Abstract

Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions.2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts.3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that outperforms baseline models in both subjective and objective metrics across multiple related tasks.

Note： We conduct all tasks in the zero-shot scenario, with training and testing on multi-lingual speech and singing data. All samples are resample to 48kHZ.

Style Transfer

Parallel Style Transfer

For the parallel experiments, we randomly select samples with unseen singers from the test set as target voices and use different utterances from the same singers to form prompts. We also input music scores as contents.

1.Target Lyric: 回忆里想起模糊的小时候，云朵漂浮在蓝蓝的天空

Language: Chinese

Successfully transferred timbre, accent, enunciation, pop singing method, emotion.

Prompt	Ground Truth

StyleTTS 2	CosyVoice	VISinger 2	TCSinger	TCSinger 2

2.Target Lyric: 入夜渐微凉，繁花落地成霜，你在远方眺望，耗尽所有暮光，不思量

Language: Chinese

Successfully transferred timbre, accent, enunciation, pop singing method.

Prompt	Ground Truth

StyleTTS 2	CosyVoice	VISinger 2	TCSinger	TCSinger 2

3.Target Lyric: how to be brave,how can I love when I’m afraid

Langugae: English

Successfully transferred timbre, accent, enunciation, pop singing method, and mixed voice technique.

Prompt	Ground Truth

StyleTTS 2	CosyVoice	VISinger 2	TCSinger	TCSinger 2

4.Target Lyric: allons, en garde, allons, allons, ah toréador

Language: French

Successfully transferred timbre, accent, enunciation, bel canto singing method.

Prompt	Ground Truth

StyleTTS 2	CosyVoice	VISinger 2	TCSinger	TCSinger 2

Cross-lingual Style Transfer

Additionally, we utilize unseen test data with different lyric languages (such as English and Chinese) as prompts and targets for inference. We also input music scores as contents.

1.Target Lyric: 在我的怀里，在你的眼里

Language: From English to Chinese

Successfully transferred timbre, accent, enunciation, pop singing method.

Prompt	TCSinger 2

2.Target Lyric: or do you need more, is there something you’re searching

Language: From Japanese to English

Successfully transferred timbre, accent, enunciation, pop singing method.

Prompt	TCSinger 2

Non-parallel Style Transfer

Additionally, we employ unseen test data with different styles to generate the same content in entirely distinct ways from the original version. We also use music scores as input content.

It is evident that the timbre, accent, and enunciation have been successfully transferred. In addition, the first example successfully transfers the emotional expression of the singing, the second captures the falsetto technique, and the third reproduces the breathy technique.

Target Lyric: 雨到了这里缠成线，缠着我们流连人世间

Language: Chinese

Prompt Audio	TCSinger 2

Style Control

Multi-level styles are randomly assigned in a manner that is appropriate for the context. These styles include global timbre (such as the singer’s gender and vocal range), singing method (e.g., bel canto and pop), emotion (e.g., happy and sad), and segment-level or word-level techniques (such as mixed voice, falsetto, breathy, vibrato, glissando, and pharyngeal). We also input the same music scores as content.

Target Lyric: 一壶清酒一生尘灰

Language: Chinese

Textual Prompt	TCSinger 2
A female singer with an alto vocal range performs a pop song.
A male singer with an tenor vocal range performs a pop song.
A female singer with an alto vocal range performs a pop song. She sings with the breathy technique in the whole segment.
A female singer with an alto vocal range performs a pop song. She begins with the breathy techniques in the first half of the song (three words), before transitioning into falsetto for the second half (about five words).

Speech-to-Singing Style Transfer

We randomly select unseen singers from the test set as target samples and different speech samples from the same singers to form the prompts. We also input music scores as contents.

1.Target Lyric: 风到这里就是粘，粘住过客的思念

Language: Chinese

Successfully transferred timbre, accent, enunciation.

Prompt	TCSinger 2

2.Target Lyric: Anytime you whisper my name, you’ll see

Language: English

Successfully transferred timbre, accent, enunciation.

Prompt	TCSinger 2

3.Target Lyric: parais, astre pur, et charmant

Language: French

Successfully transferred timbre, accent, enunciation.

Prompt	TCSinger 2

4.Target Lyric: господи, как это, господи, как это, как это больно

Language: Russian

Successfully transferred timbre, accent, enunciation.

Prompt	TCSinger 2