Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer

----------------------------> Model Architecture <-----------------------

Figure 1. Schematic diagram of the proposed framework JES-StarGAN. Blue boxes represent the modules involved in the trainingand the yellow boxes represent the pre-trained modules.

-----------------------------> Speech Samples <---------------------------

Experimental Setup:

Baseline: StarGAN-VC[1]

Proposed Method: JES-StarGAN, a joint emotional style and speaker identity conversion framework.

The samples are from four speakers ( two male and two female) with three emotions (neutral, happy, and sad).

	Source	StarGAN-VC	JES-StarGAN	Target
Neutral








Happy








Sad

[1] Kameoka, Hirokazu, et al. "Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks." 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018.