Abstract: Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Due to the hierarchical structure of speech emotion, it is challenging to disentangle the emotional style for different speakers. Inspired by the recent success of speaker disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive voice conversion framework, that is called StyleVC. StyleVC is designed to disentangle linguistic content, speaker identity, pitch, and emotional style information. We study the use of style encoder to model emotional style explicitly. At run-time, StyleVC converts both speaker identity and emotional style for arbitrary speakers. Experiments validate the effectiveness of our proposed framework in both objective and subjective evaluations.

-------------------------------- Model Architecture -------------------------------


Figure 1. An illustration of the training phase of the proposed framework, where the green boxes represent the modules that are involved in the training.

--------------------------------- Speech Samples ---------------------------------

Baseline: VQMIVC[1]
Proposed Method: StyleVC, an any-to-any expressive voice conversion framework.

The samples are from five emotions (surprise, angry, sad, happy, and neutral) and three conversion scenarios (the conversion between seen speakers, the conversion between seen and unseen speakers; the conversion between unseen speakers).

We provide the utterances from source speakers, denoted as Source; the converted utterances from baseline, , denoted as VQMIVC[1]; the converted utterances from our proposed method, denoted as StyleVC; the utterances from target speakers, denoted as Target.


Source VQMIVC[1] StyleVC Target
Seen to Seen Speakers
Surprise
Angry
Sad
Happy
Neutral
Seen to Unseen Speakers
Surprise
Angry
Sad
Happy
Neutral
Unseen to Unseen Speakers
Surprise
Angry
Sad
Happy
Neutral
[1] Wang, Disong, et al. "VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion." arXiv preprint arXiv:2106.10132 (2021).