Skip to the content.

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Non-parallel many-to-many voice conversion transfers the speech of the source speaker into an arbitrary style of the target speaker without parallel data, while keeping the source speech content unchanged. Especially when the target speaker does not exist in the training set, it is a challenge to accurately extract style information. Therefore, the task demands the model to have excellent robustness and generalization.

We propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC), explicitly exploits the style spatial characteristics of different subbands to convert each subband content of source speech separately.

Many-to-many Voice Conversion Samples

Note:

1. VCTK Corpus

VCTK Corpus contains approximately 44 hours of speech recordings from 109 speakers with various accents. These sentences are selected from multiple media or archives. Each speaker reads out about 400 sentences selected by the greedy algorithm. VCTK Corpus contains 47 male speakers and 62 female speakers, with a relatively balanced gender ratio.

1.1 Female to Female

Sample 1 (p233 → p236) Sample 2 (p239 → p244)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen

1.2 Female to Male

Sample 1 (p233 → p254) Sample 2 (p236 → p259)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen

1.3 Male to Female

Sample 1 (p258 → p236) Sample 2 (p259 → p239)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen

1.4 Male to Male

Sample 1 (p258 → p254) Sample 2 (p270 → p258)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen

2. AISHELL3-84

AISHELL3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus. The corpus contains roughly 85 hours of recordings produced by 218 native Chinese speakers (consisting of 176 female and 42 male) and a total of 88,035 utterances. Due to the unbalanced gender ratio of AISHELL3, we employ all-male speakers and a randomly selected array of 42 female speakers as our evaluation dataset, called AISHELL3-84. In AISHELL3-84, we randomly sample 5 males and 5 females as the final test set.

2.1 Female to Female

Sample 1 (11 → 1000) Sample 2 (482 → 11)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen

2.2 Female to Male

Sample 1 (1000 → 1365) Sample 2 (1274 → 710)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen

2.3 Male to Female

Sample 1 (394 → 1274) Sample 2 (407 → 1000)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen

2.4 Male to Male

Sample 1 (394 → 407) Sample 2 (1935 → 1365)
Source
Target
StarGANv2-VC-noASR
StarGANv2-VC-ASR
SGAN-VC-Unseen
SGAN-VC-Seen