Abstract
Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions.2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts.3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that outperforms baseline models in both subjective and objective metrics across multiple related tasks.
Note: We conduct all tasks in the zero-shot scenario, with training and testing on multi-lingual speech and singing data. All samples are resample to 48kHZ.
Style Transfer
Parallel Style Transfer
For the parallel experiments, we randomly select samples with unseen singers from the test set as target voices and use different utterances from the same singers to form prompts. We also input music scores as contents.
1.Target Lyric: 回忆里想起模糊的小时候,云朵漂浮在蓝蓝的天空
Language: Chinese
Successfully transferred timbre, accent, enunciation, pop singing method, emotion.
Prompt | Ground Truth |
---|---|
StyleTTS 2 | CosyVoice | VISinger 2 | TCSinger | TCSinger 2 |
---|---|---|---|---|
2.Target Lyric: 入夜渐微凉,繁花落地成霜,你在远方眺望,耗尽所有暮光,不思量
Language: Chinese
Successfully transferred timbre, accent, enunciation, pop singing method.
Prompt | Ground Truth |
---|---|
StyleTTS 2 | CosyVoice | VISinger 2 | TCSinger | TCSinger 2 |
---|---|---|---|---|
3.Target Lyric: how to be brave,how can I love when I’m afraid
Langugae: English
Successfully transferred timbre, accent, enunciation, pop singing method, and mixed voice technique.
Prompt | Ground Truth |
---|---|
StyleTTS 2 | CosyVoice | VISinger 2 | TCSinger | TCSinger 2 |
---|---|---|---|---|
4.Target Lyric: allons, en garde, allons, allons, ah toréador
Language: French
Successfully transferred timbre, accent, enunciation, bel canto singing method.
Prompt | Ground Truth |
---|---|
StyleTTS 2 | CosyVoice | VISinger 2 | TCSinger | TCSinger 2 |
---|---|---|---|---|
Cross-lingual Style Transfer
Additionally, we utilize unseen test data with different lyric languages (such as English and Chinese) as prompts and targets for inference. We also input music scores as contents.
1.Target Lyric: 在我的怀里,在你的眼里
Language: From English to Chinese
Successfully transferred timbre, accent, enunciation, pop singing method.
Prompt | TCSinger 2 |
---|---|
2.Target Lyric: or do you need more, is there something you’re searching
Language: From Japanese to English
Successfully transferred timbre, accent, enunciation, pop singing method.
Prompt | TCSinger 2 |
---|---|
Non-parallel Style Transfer
Additionally, we employ unseen test data with different styles to generate the same content in entirely distinct ways from the original version. We also use music scores as input content.
It is evident that the timbre, accent, and enunciation have been successfully transferred. In addition, the first example successfully transfers the emotional expression of the singing, the second captures the falsetto technique, and the third reproduces the breathy technique.
Target Lyric: 雨到了这里缠成线,缠着我们流连人世间
Language: Chinese
Prompt Audio | TCSinger 2 |
---|---|
Style Control
Multi-level styles are randomly assigned in a manner that is appropriate for the context. These styles include global timbre (such as the singer’s gender and vocal range), singing method (e.g., bel canto and pop), emotion (e.g., happy and sad), and segment-level or word-level techniques (such as mixed voice, falsetto, breathy, vibrato, glissando, and pharyngeal). We also input the same music scores as content.
Target Lyric: 一壶清酒一生尘灰
Language: Chinese
Textual Prompt | TCSinger 2 |
---|---|
A female singer with an alto vocal range performs a pop song. | |
A male singer with an tenor vocal range performs a pop song. | |
A female singer with an alto vocal range performs a pop song. She sings with the breathy technique in the whole segment. | |
A female singer with an alto vocal range performs a pop song. She begins with the breathy techniques in the first half of the song (three words), before transitioning into falsetto for the second half (about five words). |
Speech-to-Singing Style Transfer
We randomly select unseen singers from the test set as target samples and different speech samples from the same singers to form the prompts. We also input music scores as contents.
1.Target Lyric: 风到这里就是粘,粘住过客的思念
Language: Chinese
Successfully transferred timbre, accent, enunciation.
Prompt | TCSinger 2 |
---|---|
2.Target Lyric: Anytime you whisper my name, you’ll see
Language: English
Successfully transferred timbre, accent, enunciation.
Prompt | TCSinger 2 |
---|---|
3.Target Lyric: parais, astre pur, et charmant
Language: French
Successfully transferred timbre, accent, enunciation.
Prompt | TCSinger 2 |
---|---|
4.Target Lyric: господи, как это, господи, как это, как это больно
Language: Russian
Successfully transferred timbre, accent, enunciation.
Prompt | TCSinger 2 |
---|---|