📝 Publications

*denotes co-first authors

🔊 Spatial Audio

ACM-MM 2025
sym

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Yu Zhang, Wenxiang Guo, Changhao Pan, et al.

Demo Hugging Face

  • MRSDrama is the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts.
  • ISDrama is the first immersive spatial drama generation model through multimodal prompting.
  • Our work is promoted by multiple media and forums, such as weixin, weixin, and zhihu.
  • ACM-MM 2025 A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference, Changhao Pan*, Wenxiang Guo*, Yu Zhang*, et al. | Demo
  • Preprint ASAudio: A Survey of Advanced Spatial Audio Research, Zhiyuan Zhu*, Yu Zhang*, Wenxiang Guo*, et al. |
  • Under Review MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations, Wenxiang Guo*, Changhao Pan*, Zhiyuan Zhu*, Xintong Hu*, Yu Zhang*, et al. | Demo Hugging Face

🎼 Music Generation

Preprint
sym

Versatile Framework for Song Generation with Prompt-based Control
Yu Zhang, Wenxiang Guo, Changhao Pan, et al.

Demo

  • VersBand is a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control.

🎙️ Singing Voice Synthesis

ACL 2025
sym

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Yu Zhang, Ziyue Jiang, Ruiqi Li, et al.

Demo

  • TCSinger 2 is a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.
  • Our work is promoted by multiple media and forums, such as weixin, weixin, and zhihu.
EMNLP 2024
sym

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
Yu Zhang, Ziyue Jiang, Ruiqi Li, et al.

Demo

  • TCSinger is the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control.
NeurIPS 2024 Spotlight
sym

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
Yu Zhang, Changhao Pan, Wenxinag Guo, et al.

Demo Hugging Face

  • GTSinger is a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks.
  • Our work is promoted by multiple media and forums, such as weixin, weixin, and zhihu.
AAAI 2024
sym

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
Yu Zhang, Rongjie Huang, Ruiqi Li, et al.

Demo

  • StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.

💬 Speech Synthesis

💡 Others