Mega-TTS

28 Dec 2023 in Seminar on Text-to-Speech

Mega-TTS 논문 요약

Z. Jiang et al., “Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias”, 2023

Goal

Large and wild 학습 데이터셋을 사용하면서도, 동시에 Speech의 본질적 특성과 모델의 Inductive bias를 Matching하는 Zero-shot TTS 모델을 제안했습니다.

Motivation

[Adaspeech, 2021; GenerSpeech, 2022]와 같은 이전의 TTS 연구들은 Limited 데이터 셋에서 학습되었습니다.
[VALL-E, 2023]와 같은 최근 TTS 연구들은 대규모 데이터 셋에서 학습되었고 Neural audio codec 모델을 사용하여 Speech를 인코딩하는 방식을 보였습니다. 이로인해 Zero-shot 성능은 좋지만 Speech의 본질적인 특성들을 고려하지 않은 모델링을 제안했습니다.

Contribution

Phase, Prosody, Timbre, Content와 같이스피치를 다양한 요소로 나누어 모델링하여 Speech의 본질적인 특성을 고려하였습니다.
Zero-shot TTS, Speech editing, Cross-lingual TTS 이렇게 3가지 다운스트림 음성 생성 태스크에서 우수성을 입증하였습니다.

Background

Phase(위상): Phase는 의미론적 측면과는 큰 관련이 없고, 다른 요소들에 비해 사람들이 덜 민감하게 반응합니다. 즉 waveform을 reconstruct 하기 위해 1개의 적합한 위상만 필요할 뿐, 모든 위상을 모델링할 필요는 없습니다. 언어모델이나 디퓨전 모델로 위상을 모델링하는 것은 오히려 낭비라고 할 수 있습니다.
Timbre(음색): Timbre는 문장 안에서 글로벌 vector로 적용되어 안전하게 유지돼야 합니다. 시간에 따라 변하는 latent로 Timbre를 모델링하는 것은 너무 많은 비용을 야기합니다.
Prosody(운율): Prosody는 일반적으로 local and long-term dependency를 다 가지면서, 시간에 따라 빠르게 변하고, 텍스트와의 연관성이 낮습니다. 이러한 특성 때문에, conditional phoneme-level LLM이 본질적으로 Prosody 시퀀스를 생성하는 데 매우 적합합니다.
Content: Content는 스피치와 monotonic alignment를 갖기 때문에, Autoregressive 모델을 사용하면 repeating, skipping 문제가 있을 수 있습니다.

Method

모델은 크게 VQGAN-based TTS 모델과 Prosody-LLM (P-LLM) 으로 구성되어 있습니다.

서로 다른 스피치 특성은 각각 다른 방식으로 모델링 됩니다.

Mel-spectrogram을 중간 representation으로 선택
- 기존의 Neural audio codec을 사용한 모델들은 웨이브폼을 이산코드로 변환하여 Intermediate representation으로 사용했습니다.
- Mega-TTS는 Mel-spectrogram을 Intermediate representation으로 사용하여 Phase와 다른 Attribute들을 분리했습니다.
글로벌 Timbre 인코더를 이용
- Timbre는 시간에 따라 천천히 변하는 글로벌적 특성이 있습니다.
- 동일한 화자의 랜덤한 문장에서 글로벌 Timbre 벡터를 추출합니다.
- 이로 인해 Timbre와 content 정보를 분리했습니다.
P-LLM 이라는 latent code language model을 제안
- LLM은 local and long-range dependency를 잡기에 유용하다는 특징이 있습니다.
- P-LLM을 통해 Prosody의 분포에 Fit할 수 있었고 이로인해 Prosody 정보를 잘 얻습니다.
VQGAN-based 어쿠스틱 모델을 사용
- 위의 3가지 Representation과 어쿠스틱 모델을 이용해 Mel-spectrogram을 합성합니다.

Architecture

3 types of encoder

1. Prosody encoder

구성: 2개의 Conv stacks, Phoneme단위 Pooling, Vector quantization bottleneck
인풋: Mel-spectrogram의 저주파수 대역
아웃풋: Phoneme-level prosody codes $u= {u_1, u_2,…, u_T}$, hidden states $H_{\text{prosody}}$
특이점: [ProsoSpeech, 2022] 에서 보여준 것 처럼, 인풋 Mel-spectrogram의 저주파수 대역(각 Mel-frame마다 20 bins)만 잘라서 넣어주면 Prosody 정보는 살아남지만, Timbre와 Content 정보는 상대적으로 사라지므로 저주파수 대역을 넣어줍니다.

2. Content encoder

구성: Feed-forward 트랜스포머 layers, Duration predictor, length regulator
인풋: Text sequence
아웃풋: Content representation $H_{\text{content}}$
특이점: Prosody encoder에서 추출한 Prosody 정보를 인풋으로 받습니다.

3. Timbre encoder

구성: 여러개의 Convolution stacks
인풋: 레퍼런스 오디오
아웃풋: 스피커 Identity 정보와 Timbre 정보를 포함한 Global vector $H_{\text{timbre}}$
특이점: Timbre encoder의 아웃풋은 시간축으로 averaging하여 Target voice의 전체적인 음색을 잡을 수 있습니다. 또한, 동일한 화자의 different speech를 넣어줌으로써, 레퍼런스 오디오에서 Content 정보는 누락되어 Timbre 정보를 분리시킬 수 있습니다.

Training objective

Untitled

P-LLM

본 논문에서 제안한 P-LLM은 지역 및 장기적인 의존성을 잡는 Prosody 모델링을 위한 latent code 언어모델입니다. Mega-TTS는 아래와 같은 방식으로 Prosody-oriented speech decoding 메커니즘을 적용합니다.

Prompt speech-text: $( y_p, x_p)$, Target speech-text: $( y_t, x_t)$
Encode
- $u=E_{\text{prosody}} (y_p)$
- $H_{\text{content}} = E_{\text{content}} (x_p)$
- $\tilde{H}_ {\text{timbre}} = E_{\text{timbre}} (y_p)$
- $\tilde{H}_ {\text{content}} = E_{\text{content}} (x_t)$
Prosody prediction
- $\tilde{u} = f(\tilde{u} | u, H_{\text{content}}, \tilde{H}_ {\text{timbre}}, \tilde{H}_ {\text{content}} ; \theta)$
- $f$는 Prosody 예측함수
- $\theta$는 P-LLM의 파라미터
Decode
- $\hat{y}_t = D(\tilde{u}, \tilde{H} _{\text{timbre}}, \tilde{H} _{\text{content}} )$
- $\hat{y}_t$는 생성된 스피치

P-LLM은 Prosody 모델링을 위한 트랜스포머 기반의 디코더 아키텍쳐 입니다. 이렇게 설정한 동기는 LLM의 강력한 in-context learning 능력을 활용하면 Prosody code를 예측할 수 있다고 생각했기 때문입니다. Autoregressive하게 Prosody를 예측하는 과정은 아래와 같이 적을 수 있습니다. 학습 과정은 Teacher-forcing과 Cross-entropy loss 를 통해 학습합니다. \[p( \tilde{u} | u, H_{\text{content}}, \tilde{H}_{\text{timbre}}, \tilde{H}_{\text{content}} ; \theta)=\Pi^{T}_{t=0} p( \tilde{u}_t | \tilde{u}_{<t}, u, H_{\text{content}}, \tilde{H}_{\text{timbre}}, \tilde{H}_{\text{content}} ; \theta)\]

Untitled

Experiments

객관적 평가방법은 아래와 같습니다

Pitch: GT Mel-spectrogram과 합성된 Mel-spectrogram의 Pitch contour 사이의 average dynamic time warping (DTW) 거리를 계산합니다.
Similarity: WavLM 모델을 파인튜닝하여 (-1,1) 범위에서 코사인 유사도를 계산하고, 값이 높을수록 스피치 사이의 유사성이 높습니다.

실험결과는 아래 사진들과 같습니다.

Subjective and objective evaluation

Untitled

Evaluation of timbre and prosody encoder for unseen speaker

Untitled

Experiment of hyperparameter for VQ

Untitled

Mega-TTS

Goal

Motivation

Contribution

Background

Method

Architecture

3 types of encoder

1. Prosody encoder

2. Content encoder

3. Timbre encoder

Training objective

P-LLM

Experiments

PRML Lab. Speech Team

Error

Goal

Motivation

Contribution

Background

Method

Architecture

3 types of encoder

1. Prosody encoder

2. Content encoder

3. Timbre encoder

Training objective

P-LLM

Experiments

Templates (for web app):

Error