Skip to the content.

Abstract

Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in both subjective and objective metrics.

Zero-shot Streaming Voice Conversion

We first show examples of zero-shot streaming voice conversion on VCTK dataset.

Example 1

Target Reference Conan Conan (fastest)

Example 2

Target Reference Conan Conan (fastest)

Example 3

Target Reference Conan Conan (fastest)

Example 4

Target Reference Conan Conan (fastest)

Cross-dataset Comparison

We then demonstrate the cross-dataset performance of our method. During streaming inference, the entire reference speech is first fed into the model to provide timbre and stylistic information. For chunk-wise online inference, the input is processed once it reaches a predefined chunk size before being passed to the model.

In this experiment, speech from VCTK is used as the target speaker, while speech from LibriTTS serves as the source content. Note that StreamVC is not open-sourced and requires f0 as input; hence, it is excluded from this comparison. All the baseline approaches compared here are offline voice conversion methods, which are not designed for streaming inference.

Target Speaker p231

Target Speech:

Text: He shrugged his shoulders in ungracious acquiescence, while our visitor in hurried words and with much excitable gesticulation poured forth his story.

Source Conan Conan (fastest) QuickVC DiffVC VQMIVC PPGVC

Target Speaker p360

Target Speech:

Text: He examines the horizon all round with his glass, and folds his arms with the air of an injured man.

Source Conan Conan (fastest) QuickVC DiffVC VQMIVC PPGVC