Sparse value completion for prosody control

These speech samples accompany a paper submitted for review at the ICASSP-2024. We introduce a novel framework for generation control whereby an encoder maps a sparse, human-interpretable control space onto the latent space of a generative model. We apply this framework to the problem of prosody control in text-to-speech synthesis. Within this framework, we propose a model, called Multiple-Instance CVAE (MICVAE), that is specifically designed to encode sparse prosodic features and output complete waveforms. We demonstrate empirically that MICVAE possesses the main desirable qualities for a human-in-the-loop control method: efficiency, robustness, and faithfulness. With even a very small number of provided values, a user can use MICVAE to control the rendition and produce a large improvement in the listener preference.

We consider the following systems for prosody control:

Name Description
1. MICVAE Our model. Conditional Variational Autoencoder with a Self-Attention encoder.
2. CVAE Conditional Variational Autoencoder (same decoder as our model) with an RNN encoder and masked inputs.
3. AFP Acoustic Feature Predictor. Prosody generated using only the decoder from our model, no control is taking place.
4. Crude-Control The same as AFP, but some of the features are changed manually in the output. The same features are changed as with MICVAE and CVAE.

Experimental Setup

Morgana viz

a00_08q_003

Text: "¿de qué obras es hijo, pues me niega mi soldada y mi sudor y trabajo?"

Ground-Truth MICVAE CVAE Crude-Control AFP

a18_21_001

Text: "A mí estos políticos de hoy en día me dan un asco, pero un asco que ni te imaginas."

Ground-Truth MICVAE CVAE Crude-Control AFP

a00_18_001

Text: "Nos despertábamos a eso de las ocho u nueve en el silencio de los Alpes y bajábamos a desayunar en el salón del hotel."

Ground-Truth MICVAE CVAE Crude-Control AFP