📝 Publications

*denotes co-first authors

🔊 Spatial Audio

ACM-MM 2025

MRSDrama is the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts.
ISDrama is the first immersive spatial drama generation model through multimodal prompting.
Our work is promoted by multiple media and forums, such as , , and .

EMNLP 2025

VersBand is a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control.
Our work is promoted by multiple media and forums, such as , , and .

ACL 2025

TCSinger 2 is a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.
Our work is promoted by multiple media and forums, such as , , and .

EMNLP 2024

TCSinger is the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control.

NeurIPS 2024 Spotlight

GTSinger is a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks.
Our work is promoted by multiple media and forums, such as , , and .

AAAI 2024

StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.

ACL 2025 STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation, Wenxiang Guo*, Yu Zhang*, Changhao Pan*, et al. |
AACL 2025 Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches, Changhao Pan*, Dongyu Yao*, Yu Zhang*, et al. |
AAAI 2025 TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching, Wenxiang Guo, Yu Zhang, Changhao Pan, et al. |
ACL 2024 Robust Singing Voice Transcription Serves Synthesis, Ruiqi Li, Yu Zhang, Yongqi Wang, et al. |

ASRU 2025 Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion, Yu Zhang, Baotong Tian, Zhiyao Duan. |
Preprint MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis, Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, et al.

IJCAI 2025 Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly, Ruiyuan Zhang, Qi Wang, Jiaxiang Liu, Yu Zhang, et al.