Voicebox

21 Feb 2024 in Seminar on Text-to-Speech

Voicebox 논문 요약

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar and Wei-Ning Hsu
"Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale"
Accepted by NeurIPS 2023
[Paper] [Demo] [Unofficial Code]

Goal

Masking strategy 를 통해 In-context learning 을 활용하여, 학습 때 마주하지 않은 Task에 대해서도 우수한 성능을 보이는 TTS 모델 제안
- Mono-lingual zero-shot TTS
- Cross-lingual zero-shot TTS
- Speech denoising
- Speech editing
- Style conversion
- Diverse speech sampling

Motivation

비전, 언어모델 분야에서 대규모 생성모델들이 큰 성과를 이룸
- [GPT, 2020] : In-context learning 능력 보유
- [DALL-E, 2021]
반면, 스피치 분야는 데이터 셋의 분포에 따라 아웃풋이 편향되는 경향이 있음
- Curated 데이터 셋 : 너무 깔끔해서 다양한 감정, 노이즈, 목소리 변화등을 표현하지 못함
- In the wild 데이터 셋 : 일반적으로 성능이 낮음

Contribution

Voicebox는 대규모 데이터 셋으로 Text-guided speech infilling task 을 학습하여 in-context learning 능력을 가져 학습 때 마주하지 않은 Task에 대해서도 우수한 성능을 보임
- 기존 SOTA 모델들과 비교하였을 때 상대적으로 더욱 우수한 성능을 보임
어떠한 길이에도 Speech infilling 이 가능

Promblem formulation

Notation
- $x = ( x^ 1, x ^ 2 , …, x ^N)$ : $N$ 프레임으로 구성된 오디오 샘플
- $y = ( y^ 1,y ^ 2 , …, y ^ M)$ : $M$ 개의 음소로 구성된 텍스트 시퀀스
- $m$ : Binary temporal mask
- $x_{mis} = m \odot x$ : Missing data
- $x_{ctx} = (1-m) \odot x$ : Context data
- $l = ( l ^ 1, l ^ 2 , …, l ^M)$ : 각 음소마다의 Duration
- $z = \text{rep} (y, l ) =( z^ 1, z ^ 2 , …,z ^N)$ : 프레임 별 음소 Transcript
목표
- In-context learning 을 통해 여러개의 Text-guided speech generation task 를 수행할 수 있는 Single model 만들기
Text-guided speech infilling task
- 주변 오디오 샘플과 텍스트를 통해 Speech segment 예측하기
- 모델이 학습 하는 것 : $p(x_{mis} | y, x _{ctx})$

Method

Audio model & Duration model
Flow-matching with an optimal transport path
Classifier-free guidance

Audio model

Untitled

Context 를 담당하는 텍스트 $z$ 와 길이가 $N$ 인 $x_{ctx}$ 가 주어졌을 때, $x _ {mix}$ 의 분포는 매우 Stochastic 합니다. 또한 $x _ {mis}$ 의 Temporal span 이 크다면 이를 예측하는 것은 더욱 어렵습니다.
저자들은 Conditional normalizing flow 를 활용해 $x_{mis}$의 분포를 Parametrizing 하고자 하였고, Flow-matching objective with optimal-transport 로 학습하였습니다.
Context 에 해당하는 $z$ 와 $x _ {ctx}$ 는 Figure 2와 같이 모델의 입력으로 사용됩니다.

Flow-matching

[Flow-matching, 2023] 에서 제안한 Optimal Transport path 인 Conditional flow 를 학습합니다.
- $p _ t ( x | x _ 1) = \mathcal{N} ( x | t x _ 1, (1-(1-\sigma _ {\text{min}}) t )^ 2 I )$
- $u _ t (x | x _ 1) = \frac{x _ 1 - (1- \sigma _ {\text{min}})x}{1- ( 1- \sigma _ {\text{min}})t}$
- $x _ t = ( 1- (1-\sigma _ {\text{min}} ) t ) x _ 0 + tx$

\[\mathcal{L}_ {\text{CFM}} ( \theta ) = \mathbb{E} _ {t,q(x_{1}),p_{t} (x|x_{1})} || u_{t} (x|x_{1}) - v_{t} (x; \theta ) || ^2\] \[\mathcal{L} _ {\text{audio-CFM}} (\theta) = \mathbb{E} _ {t,m,q(x,z),p_0(x_0)} || u_t (x_t | x) - v_t(x_t,x_{ctx},z;\theta) ||^2\]

학습 동안 Loss는 모든 프레임에서 계산됩니다. 비록 마스킹 되지 않은 부분은 Inference 때 필요없을 수 있지만, 이렇게 계산하여 모델을 학습 시켰습니다. 이는 [P-Flow, 2023]과 상반되는 모습입니다. 물론 마스킹 된 프레임만 Loss를 계산하여 학습한 버전과 비교해보았을 때, 모든 프레임에서 Loss를 계산한 버전이 더욱 좋은 성능을 보였다고 말하고 있습니다.
인퍼런스 시, 학습된 오디오 분포 $p _ 1 ( x | z , x _ {ctx})$ 에서 샘플링 하기 위해 아래의 단계를 따릅니다.
- $p _ 0$ 에서 노이즈 $x _ 0 \sim N ( 0,1)$ 를 샘플링합니다.
- $\frac{d \phi _ t ( x _ 0 )}{dt} = v _ t ( \phi _ t ( x _ 0 ), x _ {ctx}, z; \theta )$ 와 초기 조건 $\phi _ 0 ( x _ 0 ) = x _ 0$ 가 주어졌을때 $\phi _ 1 ( x _ 0 )$ 를 평가하기 위해 ODE solver를 이용합니다.
ODE solver는 초기조건 $\phi _ 0 (x_0) = x_0$ 가 주어진 상태에서 $t= 0$ 부터 $t= 1$ 까지의 다양한 $t$ 에서 $v _ t$를 평가함으로써, 최종적으로 $\phi _ 1 ( x _ 0 )$ 를 계산합니다.
Number of function evaluation (NFE) 는 $\frac{d \phi _ t ( x _ 0 )}{dt}$ 가 몇 번 계산되는지를 나타냅니다.
Voicebox의 경우, 10 NFE 이하에서도 고품질의 음성을 생성할 수 있었습니다.

Duration model

저자들은 Duration model 과 관련하여 2가지 방식을 생각했습니다.
1. Audio model 처럼 Conditional flow matching 으로 Duration 분포 $q(l | y, l _ {ctx})$ 모델링하기
2. $l _ {ctx}$ 와 $y$ 를 기반으로 Masked duration $l _ {mis}$ 를 Regression 하기
저자들은 2번 방식을 채택하였습니다. 모델은 마스킹된 Phoneme 에 대해 아래의 $L_1$ regression loss 로 학습하게 됩니다.

\[\mathcal{L}_{\text{dur-regr-m}}(\theta) = \mathbb{E} _{m,q(l,.y)}||m' \odot l _{mis} - g(l _{ctx} , y ; \theta )) || _ 1\]

[FastSpeech2, 2021]에서 사용된 Duration 모델과 비슷하지만, 추가적인 입력으로 Duration context $l_{ctx}$ 가 사용됩니다.

Classifier-free guidance

[Classifier-free guidance, 2022] 의 방법론을 Flow-matching 에 적용하여 확장시켰습니다.
- 기존의 Conditioner $c$ 는, Audio 모델에서는 $(z, x _ {ctx})$ 이고, Duration 모델에서는 $(y, l _ {ctx})$ 입니다.
인퍼런스 시, Audio model에 대해 변경된 Vector field는 아래와 같습니다.

\[\tilde{v}_t ( w, x _{mis}, z ; \theta ) = (1+\alpha ) \cdot v_t(w,x_{ctx}, z ; \theta ) - \alpha \cdot v_t (w; \theta)\]

여기서 $\alpha$ 는 Guidance의 강도를 나타내고, $v_t ( w; \theta)$ 는 $x_{ctx}$와 $z$를 제거함으로써 얻을 수 있습니다. $\alpha$ 와 $\alpha_{dur}$ 의 값은 실험을 통해 결정하였습니다.
Q. Classifier-free guidance를 같이 채택한 이유가 무엇일까?

Application

Q. 특정 Application에서 Speech prompt (reference speech)의 Transcriptioin을 필요로하는 것은 단점인가?

Untitled

Experiment

Data

Voicebox-En (영어모델) : 60,000 ASR-transcribed 영어 오디오북
Voicebox-Multi (다국어 모델) : 50,000 시간 Multi-lingual 오디오북 + 6개 국어

Metrics

Correctness and intelligibility
- Word Error Rate (WER) : Public automatic speech recognition (ASR) 모델을 사용
- Multi-lingual setup : Whisper large-v2 모델 사용
Similarity
- WavLM-TCDNN 을 사용하여 임베딩 추출 후, 임베딩 사이의 유사도 비교 (VALL-E에서 제안한 측정 방식)
Diversity
- Frechet Speech Distance (FSD) : wav2vec 2.0 피쳐를 활용하여 FID에 적용
  - 낮을수록 실제 데이터와 유사한 샘플을 생성
  - 낮을수록 더 다양한 샘플을 생성(?)
그 외 객관적 지표
- QMOS, SMOS

Mono-lingual zero-shot TTS

Cross-sensentence 방식과 Continuation 방식에서 우수한 성능을 거두었습니다.

Untitled

Cross-lingual zero-shot TTS

발음 정확도, 화자 유사도 모두 우수한 성능을 거두었습니다.

Untitled

발음 정확도, 화자 유사도 그래프

Untitled

Transient noise removal

Untitled

Performance vs. efficiency

Untitled

Diverse speech sampling

Untitled

Comparison to diffusion method

Untitled

Goal

Motivation

Contribution

Promblem formulation

Method

Audio model

Flow-matching

Duration model

Classifier-free guidance

Application

Experiment

Data

Metrics

Mono-lingual zero-shot TTS

Cross-lingual zero-shot TTS

Transient noise removal

Performance vs. efficiency

Diverse speech sampling

Comparison to diffusion method

Templates (for web app):

Error