VeRe-Flow: Guiding Flow Matching toward Clean Speech via Velocity Contrastive and Representation Regularization for Noise-Robust Bandwidth Expansion

Abstract

Noise-robust bandwidth expansion (NR-BWE) aims to reconstruct high-fidelity wideband speech from noisy low-resolution inputs. While flow matching has shown strong performance for speech generation, accurately recovering clean speech under noisy conditions remains challenging due to ambiguous velocity estimation. In this work, we propose VeRe-Flow, a clean-guided flow matching that introduces multi-level clean supervision to guide the generative process toward clean speech. At the velocity level, we introduce velocity contrastive regularization, which attracts the predicted velocity toward the clean trajectory while repelling it from noisy manifold. At the representation level, we incorporate a representation alignment objective that aligns intermediate features with clean self-supervised speech representations. Experimental results demonstrate that the proposed method achieves the lowest LSD, the highest DNSMOS OVRL and MOS among NR-BWE baselines. Audio samples are available.

This work has been submitted to Interspeech 2026 for review.

Model Structure

Figure 1. Overview of the proposed system


Audio Samples (Valentini-Botinhao testset)

Comparison of 8k Noisy Input, 16k Predicted Speech (NU-Wave2, FlowHigh, VeRe-Flow (Proposed)), and 16k Clean Ground Truth.

* NU-Wave2 and FLowHigh are retrained under the same NR-BWE setting as the proposed method.

* We recommend listening with headphones for the best experience.

Sample 8k Noisy Input NU-Wave2 FLowHigh VeRe-Flow (Proposed) 16k Clean GT
Speaker p232
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
Sample 10
Speaker p257
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
Sample 10

References

[1] C. Valentini-Botinhao et al., "Noisy speech database for training speech enhancement algorithms and TTS models," 2017.