Deep learning-based manifold learning for spatial filtering
My master's thesis, supervised by Dr. Andreas Brendel and
examined by
Prof. Walter Kellermann from
Lehrstuhl für Multimediakommunikation und Signalverarbeitung (LMS).
I successfully defended the thesis on 13/05/2022.
Abstract:
Obtaining robust Relative Transfer Function (RTF) estimates is a crucial intermediate step in
many spatial filtering systems. Presence of adverse acoustic conditions (such as interference,
noise and reverberation) severely hinder the RTF estimates, and thus the spatial filtering
performance. Quite often, the acoustic environment, in particular the location of the microphone
array and the surroundings are stationary. Thus, the most prominent cause of changes in the RTFs
are due to the source positions. As the number of generative parameters have much lower dimensions
than the RTFs, usage of manifold learning techniques is justified. Most promising results in the
literature have been obtained after using Variational Autoencoders (VAEs), where the learned
representations are shown to be useful for a broad range of tasks including RTF enhancement,
source extraction, and localization.
The research to date has tended to be exploratory, where the aim was to probe various fields of
application to see if the idea of utilizing VAEs is beneficial at all, and a significant effort
went to comparisons with orthodox ideas such as spectral graph theory-based approaches. A natural
consequence was not being able to rigorously evaluate the state-of-the-art in all design choices
for the VAE model. We would like to fill such gaps in this thesis.
In particular, a more appropriate modeling could be achieved by Complex-valued neural networks (CVNNs).
RTFs are complex-valued by nature, yet previous studies adopted real-valued neural networks, mainly due
to a lack of con- sensus on the interpretations and implementations within the community. In this thesis
we perform a comprehensive investigation of various CVNN variants and observe their benefits compared to
the current practices. Another focal point is merging the spatial filtering and manifold learning paradigms
to obtain a superior source extraction algorithm. The proposed modifications are compared against the
state-of-the-art VAE baseline with respect to the expressive, denoising and speech extraction capabilities.
We found out that putting a greater emphasis on the complex-valued nature of the RTFs improves the overall
system performance, especially for mediocre Signal-to-Noise Ratio (SNR) values.