Diff-TTS

20 Dec 2023 in Seminar on Text-to-Speech

Diff-TTS 논문 요약

• Myeonghun Jeong et al., Diff-TTS: A Denoising Diffusion Model for Text-to-Speech, InterSpeech, 2021

Goal

Denoising diffusion 모델을 이용해서 Non-AR Mel-spectrogram generation 모델 제안

Motivation

Disadvantages of previous TTS models

Autoregressive 방식의 TTS 모델은 속도가 느림
Non-AR Feed-forward 방식의 TTS 모델은 확률적 모델링 없이 Simple regression objective function만 최적화 하기 때문에, 다양한 스피치를 생성하기 어려움
Non-AR Flow 방식의 TTS 모델은 아키텍쳐에 제약이 많고, Inefficient 함

Appearance of diffusion model

Image generation에서 좋은 생성 성능을 보여줌 [DDPM, 2020]
Waveform generation에서 좋은 생성 성능을 보여줌 [DiffWave, 2021; WaveGrad, 2020]

Advantage of diffusion model

Maximum likelihood에 따라 안정적으로 학습이 가능함
아키텍쳐 선정에 큰 제약이 없음
생성 성능이 매우 우수함

Conclusion

최초로 Denoising diffusion을 이용한 Non-AR TTS 모델을 제안
적은 모델 사이즈임에도 성능면에서 기존 모델들을 뛰어넘었음
[DDIM, 2021] 방식을 사용하여 인퍼런스 속도를 높였고, 성능과 속도사이의 Trade-off를 조절할 수 있음
재 학습 없이, 노이즈 정도를 조절하여 Pitch, Prosody 다양성을 조절할 수 있음
Denoising diffusion 프레임워크로 TTS를 가능하게 하는 Log-likelihood-based optimization 방법을 제안

Denoising diffusion model for TTS

Forward process

Forward 프로세스는 텍스트 $c$ 에 대한 Mel-spectrogram $x_0$ 에 노이즈를 점점 더해서 가우시안 노이즈로 만드는 과정입니다. Forward 프로세스는 아래 수식으로 정의됩니다. \[q(x_{t} | x_{t-1},c) = N(x_t; \sqrt{1-\beta_ {t} } x_ {t-1}, \beta _{t} I)\]
위의 Markov transition probability는 텍스트 $c$에 Independent 하다고 가정하겠습니다.
전체 디퓨전 프로세스 $q(x_{1:T} | x_0,c)$ 는 마르코프 프로세스이고, 따라서 아래와 같이 곱 연산으로 나눌 수 있습니다. \[q(x_{1},...,x_{T} | x_0,c)= q(x_{1:T} | x_0,c)= \Pi^T_{t=1}q(x_{t} | x_{t-1})\]

Reverse process

Reverse 프로세스는 노이즈를 Mel-spectrogram으로 복원하는 과정입니다. 아래 수식으로 정의됩니다. \[p_{\theta}(x_{0},...,x_{T-1} | x_T,c)= p_{\theta} (x_{0:T-1} | x_T,c)= \Pi^T_{t=1} p_{\theta}(x_{t-1} | x_{t},c)\]
Forward 과정과 다르게 Reverse transition $p_{\theta}(x_{t-1} | x_{t},c)$ 은 텍스트 $c$ 에 conditioning 되어있습니다. 즉, 원하는 텍스트 $c$ 에 대한 Mel-spectrogram으로 복원하려고 합니다.

Training objective

Diff-TTS는 Reverse 과정에서 모델분포인 $p_{\theta}(x_0 | c)$ 을 학습합니다.
True Mel-spectrogram 데이터 셋의 분포를 $q(x_{0} | c)$ 라고 하겠습니다. $q(x_{0} | c)$를 잘 근사하기 위해서 Reverse 과정은 Mel-spectrogram의 log-likelihood를 최대화 하는 것을 목표로 삼습니다. \[\mathbb{E}_{\log q(x_0 | c)} [ \log p_{\theta} (x_0 |c) ]\]
하지만, likelihood가 intractable하기 때문에, ELBO 문제로 바꾸어서 풀게되고 [DDPM, 2020]에서 보여준 Parametrization 덕분에 아래와 같이 Closed form으로 변환할 수 있습니다. 아래 식이 최종 Training objective 식입니다. \[\min_{\theta} L(\theta) = \mathbb{E}_{x_{0}, \epsilon, t} || \epsilon - \epsilon_{\theta}(\sqrt{\bar{\alpha _ {t}}} x_0 + \sqrt{1-\bar{\alpha_ {t}}}\epsilon,t,c) ||_1\]
최종 Sampling 식은 아래와 같습니다. \[x_{t-1} = \frac{1}{\sqrt{\alpha_{t}}} (x_{t} - \frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha_{t}}}} \epsilon_{\theta} (x_{t},t,c)) + \sigma_t z_t\]
$z_{t} \sim N(0,I)$ 이고, $\sigma_{t} = \eta \sqrt{\frac{1-\bar{\alpha_{t-1}}}{1-\bar{\alpha_{t}}}\beta_t}$ 입니다. 여기서 Temperature term $\eta$ 는 후에 Prosody와 Pitch variability를 조절할 수 있는 분산의 스케일링 factor 입니다.

Accelerated sampling

[DDPM, 2020]에서 제안한 모델이 Parralel 하게 생성하지만, 디퓨전 스텝 $T$만큼 반복해야하기 때문에 인퍼런스가 느리다는 단점이 있었습니다. [DDIM, 2021]에서는 Accelerated sampling 방식을 제안 및 적용하여 Diffusion 모델의 인퍼런스 속도를 높였습니다. Diff-TTS에서도 Accelerated sampling 방식을 사용하여 인퍼런스 속도를 높였습니다.
Reverse 프로세스 동안, Decimation factor $\gamma$에 의해 Reverse transition이 얼마만큼 Skip될지 정해집니다. (Fig 2 참고)
원래 Reverse path의 Subsequence $\tau = [\tau_1, \tau_2, …, \tau_M] \ (M<T)$를 Decimation factor $\gamma$ 에 따른 새로운 Reverse path라고 할 수 있습니다.
$i>1$ 일때, Accelerated sampling 수식은 아래와 같습니다. \[x_{\tau_{i-1}} = \sqrt{\bar{\alpha}_{\tau_{i-1}}} (\frac{x_{\tau_{i}}- \sqrt{1-\bar{\alpha}_{\tau_{i}}}\epsilon_{\theta}(x_{\tau_{i}},\tau_{i},c)}{\sqrt{\bar{\alpha}_{\tau_{i}}}} + \sqrt{1-\bar{\alpha}_{\tau_{i-1}}-\sigma^2_{\tau_{i}}}\epsilon_{\theta}(x_{\tau_{i}},\tau_{i},c) +\sigma_{\tau_{i}}z_{\tau_{i}}\] \[\sigma_{\tau_{i}} = \eta \sqrt{\frac{1-\bar{\alpha}_{\tau_{i-1}}}{1-\bar{\alpha}_{\tau_{i}}}\beta_{\tau_{i}}}\]
$i=1$ 일때는, 아래와 같습니다. \[x_{0} = \frac{x_{\tau_{1}}- \sqrt{1-\bar{\alpha}_{\tau_{1}}}\epsilon_{\theta}(x_{\tau_{1}},\tau_{1},c)}{\sqrt{\bar{\alpha}_{\tau_{1}}}}\]

Model Architecture

Encoder (text encoder)

Phoneme 정보를 인코딩하고, 인코딩 된 정보는 Duration predictor와 Decoder로 전달됩니다.
[SpeedySpeech, 2020]과 유사한 아키텍쳐를 사용하였습니다.

Duration predictor and Length regulator

[FastSpeech 2, 2020]의 Length regulator를 사용하였습니다.
Duration predictor를 학습하기 위한 Alignment는 MFA를 통해 얻었습니다.

Decoder and Step encoder

$t$번째 스텝의 Latent variable을 인풋으로 받으면, 노이즈를 예측합니다.
디코더는 [DiffWave, 2021]와 유사한 아키텍쳐를 사용하였습니다.
인코딩한 Phoneme 정보와 Time 스텝 정보가 Conv1D 후에 더해짐으로써 Conditioned 됩니다.

Experiments setting

데이터셋: LJSpeech
700,000 스텝동안 학습
Vocoder: HiFi-GAN

Result

Audio quality and model size

Decimation factor가 커져도 성능저하가 심하지 않습니다.
Decimation factor가 작으면 기존 비교 모델들보다 성능이 더 좋습니다.

그리고 모델의 사이즈는 절반정도로, 더 Efficient 합니다.

Inference speed

Decimation factor가 57이어도, Tacotron2 보다 성능이 좋으면서 빠른 시간안에 합성이 가능합니다. Decimation factor를 설정함에 따라 합성 성능과 시간의 Trade-off를 조절할 수 있습니다.
(RTF 란? Real time factor로 1초의 웨이브폼을 형성하는데에 걸리는 시간을 뜻합니다)

Variability and controllabiliy

Untitled

Diff-TTS는 모델을 재학습하지 않고도 Temperature term 스케일을 조절하면서 Prosody의 다양성을 보여줄 수 있습니다.

Diff-TTS

Goal

Motivation

Disadvantages of previous TTS models

Appearance of diffusion model

Advantage of diffusion model

Conclusion

Denoising diffusion model for TTS

Forward process

Reverse process

Training objective

Accelerated sampling

Model Architecture

Encoder (text encoder)

Duration predictor and Length regulator

Decoder and Step encoder

Experiments setting

Result

Audio quality and model size

Inference speed

Variability and controllabiliy

PRML Lab. Speech Team

Error

Goal

Motivation

Disadvantages of previous TTS models

Appearance of diffusion model

Advantage of diffusion model

Conclusion

Denoising diffusion model for TTS

Forward process

Reverse process

Training objective

Accelerated sampling

Model Architecture

Encoder (text encoder)

Duration predictor and Length regulator

Decoder and Step encoder

Experiments setting

Result

Audio quality and model size

Inference speed

Variability and controllabiliy

Templates (for web app):

Error