Abstract

Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in both subjective and objective metrics.

Zero-shot Streaming Voice Conversion

We first show examples of zero-shot streaming voice conversion on VCTK dataset.

Example 1

Source	Reference	Conan	Conan (fast)

Example 2

Source	Reference	Conan	Conan (fast)

Example 3

Source	Reference	Conan	Conan (fast)

Example 4

Source	Reference	Conan	Conan (fast)

Cross-dataset Comparison

We then demonstrate the cross-dataset performance of our method. During streaming inference, the entire reference speech is first fed into the model to provide timbre and stylistic information. For chunk-wise online inference, the input is processed once it reaches a predefined chunk size before being passed to the model.

In this experiment, speech from VCTK is used as the reference speaker, while speech from LibriTTS serves as the source content. Note that StreamVC is not open-sourced and requires f0 as input; hence, it is excluded from this comparison. All the baseline approaches compared here are offline voice conversion methods, which are not designed for streaming inference.

Reference Speaker p231

Reference Speech:

Text: He shrugged his shoulders in ungracious acquiescence, while our visitor in hurried words and with much excitable gesticulation poured forth his story.