I am Yu Zhang (张彧). Now, I am a Research Scientist at ByteDance. If you are seeking any form of academic cooperation, please feel free to email me at aaron9834@icloud.com.

I earned my PhD in the College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院), under the supervision of Prof. Zhou Zhao (赵洲). Previously, I graduated from Chu Kochen Honors College, Zhejiang University (浙江大学竺可桢学院), with dual bachelor’s degrees in Computer Science and Automation. I have also served as a visiting scholar at University of Rochester with Prof. Zhiyao Duan and University of Massachusetts Amherst with Prof. Przemyslaw Grabowicz.

My research interests primarily focus on Multi-Modal Generative AI, specifically in Spatial Audio, Music, Singing, and Speech. I have published first-author papers at top international AI conferences, such as NeurIPS, ACL, and AAAI.

🔥 News

  • 2025.08: 🎉 1 paper is accepted by EMNLP 2025!
  • 2025.08: 🎉 1 paper is accepted by ASRU 2025!
  • 2025.08: I join ByteDance as a research scientist.
  • 2025.07: We released the full dataset and evaluation code of ISDrama (Immersive Spatial Drama Generation through Multimodal Prompting)!
  • 2025.07: We released the code of TCSinger2 (Customizable Multilingual Zero-shot Singing Voice Synthesis)!
  • 2025.07: 🎉 2 papers are accepted by ACM-MM 2025!
  • 2025.06: 🎉 I earned my PhD in Computer Science from Zhejiang University!
  • 2025.05: 🎉 2 papers are accepted by ACL 2025!
  • 2025.04: I come to the University of Rochester as a visiting scholar, working with Prof. Zhiyao Duan.
  • 2024.12: 🎉 1 paper is accepted by AAAI 2025!
  • 2024.11: We released the code of TCSinger (Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control)!
  • 2024.09: We released the full dataset and code of GTSinger (A Global Multi-Technique Singing Corpus for all singing tasks)!
  • 2024.09: 🎉 1 paper is accepted by NeurIPS 2024 (Spotlight)!
  • 2024.09: 🎉 1 paper is accepted by EMNLP 2024!
  • 2024.05: 🎉 1 paper is accepted by ACL 2024!
  • 2024.05: We released the code of StyleSinger (Style Transfer for Out-of-Domain Singing Voice Synthesis)!
  • 2023.12: 🎉 1 paper is accepted by AAAI 2024!

📝 Publications

*denotes co-first authors

🔊 Spatial Audio

ACM-MM 2025
sym

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Yu Zhang, Wenxiang Guo, Changhao Pan, et al.

Demo Hugging Face

  • MRSDrama is the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts.
  • ISDrama is the first immersive spatial drama generation model through multimodal prompting.
  • Our work is promoted by multiple media and forums, such as weixin, weixin, and zhihu.
  • ACM-MM 2025 A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference, Changhao Pan*, Wenxiang Guo*, Yu Zhang*, et al. | Demo
  • Preprint ASAudio: A Survey of Advanced Spatial Audio Research, Zhiyuan Zhu*, Yu Zhang*, Wenxiang Guo*, et al. |
  • Under Review MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations, Wenxiang Guo*, Changhao Pan*, Zhiyuan Zhu*, Xintong Hu*, Yu Zhang*, et al. | Demo Hugging Face

🎼 Music Generation

EMNLP 2025
sym

Versatile Framework for Song Generation with Prompt-based Control
Yu Zhang, Wenxiang Guo, Changhao Pan, et al.

Demo

  • VersBand is a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control.

🎙️ Singing Voice Synthesis

ACL 2025
sym

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Yu Zhang, Ziyue Jiang, Ruiqi Li, et al.

Demo

  • TCSinger 2 is a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.
  • Our work is promoted by multiple media and forums, such as weixin, weixin, and zhihu.
EMNLP 2024
sym

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
Yu Zhang, Ziyue Jiang, Ruiqi Li, et al.

Demo

  • TCSinger is the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control.
NeurIPS 2024 Spotlight
sym

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
Yu Zhang, Changhao Pan, Wenxinag Guo, et al.

Demo Hugging Face

  • GTSinger is a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks.
  • Our work is promoted by multiple media and forums, such as weixin, weixin, and zhihu.
AAAI 2024
sym

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
Yu Zhang, Rongjie Huang, Ruiqi Li, et al.

Demo

  • StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.

💬 Speech Synthesis

💡 Others

📖 Educations

💻 Industrial Experiences

  • 2025.08-Now Research Scientist at ByteDance.

🔍 Research Experiences

🎖 Honors and Awards

  • 2024.09, Outstanding PhD Student Scholarship of Zhejiang University (Top 10%).
  • 2020.06, Outstanding Graduate of Zhejiang University (Undergraduate) (Top 5%).
  • 2019.09, First-Class Academic Scholarship of Zhejiang University (Undergraduate) (Top 5%).

📚 Academic Services

  • Conference Reviewer: NeurIPS (2024, 2025), ICLR (2025), ACL (2024, 2025), AAAI (2026), ACM-MM (2025), EMNLP (2024, 2025), AACL (2025).
  • Journal Reviewer: IEEE TASLP.