📝 Publications
*denotes co-first authors
🔊 Spatial Audio

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Yu Zhang, Wenxiang Guo, Changhao Pan, et al.
- MRSDrama is the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts.
- ISDrama is the first immersive spatial drama generation model through multimodal prompting.
- Our work is promoted by multiple media and forums, such as
,
, and
.
ACM-MM 2025
A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference, Changhao Pan*, Wenxiang Guo*, Yu Zhang*, et al. |Preprint
ASAudio: A Survey of Advanced Spatial Audio Research, Zhiyuan Zhu*, Yu Zhang*, Wenxiang Guo*, et al. |Under Review
MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations, Wenxiang Guo*, Changhao Pan*, Zhiyuan Zhu*, Xintong Hu*, Yu Zhang*, et al. |
🎼 Music Generation

Versatile Framework for Song Generation with Prompt-based Control
Yu Zhang, Wenxiang Guo, Changhao Pan, et al.
- VersBand is a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control.
🎙️ Singing Voice Synthesis

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Yu Zhang, Ziyue Jiang, Ruiqi Li, et al.

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
Yu Zhang, Ziyue Jiang, Ruiqi Li, et al.
- TCSinger is the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control.

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
Yu Zhang, Changhao Pan, Wenxinag Guo, et al.

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
Yu Zhang, Rongjie Huang, Ruiqi Li, et al.
- StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
ACL 2025
STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation, Wenxiang Guo*, Yu Zhang*, Changhao Pan*, et al. |Under Review
Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches, Changhao Pan*, Dongyu Yao*, Yu Zhang*, et al. |AAAI 2025
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching, Wenxiang Guo, Yu Zhang, Changhao Pan, et al. |ACL 2024
Robust Singing Voice Transcription Serves Synthesis, Ruiqi Li, Yu Zhang, Yongqi Wang, et al. |
💬 Speech Synthesis
ASRU 2025
Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion, Yu Zhang, Baotong Tian, Zhiyao Duan. |Preprint
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis, Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, et al.
💡 Others
IJCAI 2025
Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly, Ruiyuan Zhang, Qi Wang, Jiaxiang Liu, Yu Zhang, et al.