Leveraging Discriminative Latent Representations for Conditioning GAN-Based Speech Enhancement

Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

FhG_IIS
International Audio Laboratories Erlangen, Am Wolfsmantel 33, 91058 Erlangen, Germany

{shrishti.saha.shetu, emanuel.habets, andreas.brendel}@iis.fraunhofer.de

Abstract

Generative speech enhancement methods based on generative adversarial networks (GANs) and diffusion models have shown promising results in various speech enhancement (SE) tasks. However, their performance in very low signal-to-noise ratio (SNR) scenarios remains underexplored and limited, as these conditions pose significant challenges to both discriminative and generative state-of-the-art methods. To address this, we propose a method that leverages latent features extracted from discriminative speech enhancement models as generic conditioning features to improve GAN-based speech enhancement. The proposed method, referred to as DisCoGAN, demonstrates performance improvements over baseline models, particularly in low-SNR scenarios, while also maintaining competitive or superior performance in high-SNR conditions and on real-world recordings. We also conduct a comprehensive evaluation of conventional GAN-based architectures, including GANs trained end-to-end, GANs as a first processing stage, and post-filtering GANs, as well as discriminative models under low-SNR conditions. We show that DisCoGAN consistently outperforms existing methods. Finally, we present an ablation study that investigates the contributions of individual components within DisCoGAN and analyzes the impact of the discriminative conditioning method on overall performance..

Evaluation Scenarios

In our work, we evaluate our proposed method with different SOTA generative and discriminative deep learning-based noise reduction methods in various SNR scenraios.

Following you can find some processed samples with different Methods:

1. Low SNR Dataset --> Go to the samples

2. DNS Challenge No-Reverb (High SNR Dataset) --> Go to the samples

3. Real Recordings -->Go to the samples


1. Low SNR Dataset

Item 1 (Speaker: Female)

>

Item 2 (Speaker: FeMale)

>

Item 3 (Speaker: Female)

>

Item 4 (Speaker: FeMale)

>

Item 5 (Speaker: Male)

>

Item 6 (Speaker: Male)

>

Item 7 (Speaker: Male)

>

Item 8 (Speaker: Male)

>

Item 9 (Speaker: Male)

>

Item 10 (Speaker: Male)

>


2. DNS Challenge No-Reverb

Item 1

> > > > >

Item 2

> > > > >

Item 3

> > > > >

Item 4

> > > > >

Item 5

> > > > >

Item 6

> > > > >


3. Real Recordings

Item 1

> > > >

Item 2

> > > >

Item 3

> > > >

Item 4

> > > >

Item 5

> > > >

Item 6

> > > >



Conditions of Use

1.Fraunhofer IIS generated this sound material based on material that is publicly available on VCTK dataset, DNS Challenge and ESC-50 .

2.The content has been processed using generally accepted rules of technology as well as scientific care, but not actual attainment of any expected feature.

3. With the exception of willful intent or gross negligence, Fraunhofer IIS shall not be liable that Open Source software or other third-party software is free from any error or claim or its fitness for a particular purpose, even if included within the Sound Material.

4.The Sound Material shall only be used for testing and appreciating noise reduction techniques and shall not be copied, publicly transmitted, distributed, lent or modified for any other reason.

5.No representation or warranties are made or implied regarding the accuracy, non-infringement, or fitness for a particular purpose of Sound Material.

6.Copyright and Permission notice shall be duplicated whenever Sound Material is copied, distributed, or publicly transmitted.

6.The Sound material cannot be distributed with charge. --> Go to Top