REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

A work submitted to ASRU 2025

1. Abstract

In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL feature, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that our model outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy set, while also performing comparably to Seed-VC on the clean set. In addition, REF-VC can be compatible with singing voice conversion within one model.

Overview of REF-VC

2. Demos

Compared Methods

VITS-VC : a internal VITS-based VC system.

Seed-VC¹ : a state-of-the-art VC system based on diffusion transformers.

Clean Samples

Noisy Samples

Singing Voice Conversion

*Note that the samples of seedvc are generated from a separate svc model, and the samples of REF-VC are generated using the same model as the above speech samples.

References:

[1] S. Liu, “Zero-shot voice conversion with diffusion transformers,” CoRR, vol. abs/2411.09943, 2024.