Skip to the content.

MVSD: Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Audio style transfer under visual guidance has been made significant progress with the emergence of cross-modal generation. Nevertheless, simultaneously recording large-scale audio pairs at both the source and receiving ends presents a formidable challenge. What makes matters worse, existing methods treat each task independently, overlooking the inverse correlation between some dual tasks, which hinders the ability to leverage massive unlabeled data. In this paper, we introduce MVSD, a diffusion model-based mutual learning mechanism. MVSD exploits the intrinsic reciprocity between visual acoustic matching (VAM) and dereverberation, enabling learning from symmetric tasks and overcome the scarcity of data. More specifically, MVSD employs two converters: one for VAM called reverberator and another for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals which can optimize the inverse tasks, even with easily acquired one-way unpaired data. Furthermore, we employ the diffusion model as foundational conditional generators to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Extensive experiments exhibit that our framework can improve the performance of each task and better match specified visual scenarios. In both tasks, MVSD surpasses competitors on two standard benchmarks. Remarkably, the performance of the models can be further enhanced by adding unpaired data.

Visual Acoustic Matching (VAM)

SoundSpaces


image

          Source              GT    Image2Reverb            Avatir            MVSD

image

          Source              GT    Image2Reverb            Avatir            MVSD

image

          Source              GT    Image2Reverb            Avatir            MVSD

AVSpeech

          Source              GT    Image2Reverb            Avatir            MVSD

          Source              GT    Image2Reverb            Avatir            MVSD

Dereverbation


image

          Source              GT    MetricGAN            VIDA            MVSD

image

          Source              GT    MetricGAN            VIDA            MVSD

image

          Source              GT    MetricGAN            VIDA            MVSD

Citation

Please consider citing our paper if it helps your research.

@inproceedings{ma2024mutual,
  title={Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion},
  author={Ma, Jian and Wang, Wenguan and Yang, Yi and Zheng, Feng},
  booktitle={ECCV},
  year={2024}
}