Noise-Robust Adaptation Control for Supervised Acoustic System Identification Exploiting A Noise Dictionary


Abstract

We present a noise-robust adaptation control strategy for block-online supervised acoustic system identification by exploiting a noise dictionary. The proposed algorithm takes advantage of the pronounced spectral structure which characterizes many types of interfering noise signals. We model the noisy observations by a linear Gaussian DFT-domain state space model whose parameters are estimated by an online generalized EM algorithm. Unlike all other state-of-the-art approaches we suggest to model the covariance matrix of the observation pdf by a dictionary model. We propose to learn the noise dictionary from training data, which can be gathered either offline or online whenever the system is not excited, while we infer the activations continuously. The proposed algorithm represents a novel machine-learning-based approach to noise-robust adaptation control which allows for faster convergence in applications characterized by high-level and non-stationary interfering noise signals and abrupt system changes.

System Identification, Adaptation Control, Nonnegative Matrix Factorization, Acoustic Echo Cancellation

1 Introduction↩︎

OSASI is required for many modern hands-free acoustic human-machine interfaces, e.g., for the purpose of echo cancellation [1]. During recent decades, a multitude of OSASI algorithms has been developed which originated from the plain TD LMS algorithm [2] and evolved to sophisticated implementations operating in the block-frequency domain [3][5]. In this context, robust OSASI typically has to overcome interfering signals which are either undesired, e.g., keyboard noise, or desired, e.g., near-end talkers. Both types of interference are termed noise in the following.

Noisy observations are usually addressed by adaptation control mechanisms which are motivated by the non-stationarity of many acoustic excitation and noise signals. A popular approach for iterative OSASI algorithms are VSS control methods which lead to either binary or continuous-valued step sizes. As binary VSS control, i.e., halting the filter adaptation during periods with high-level interfering noise [6][8], does not allow for permanent filter optimization, it is less suited for time-varying acoustic environments and applications with persistently high-level interfering noise signals. Thus, we focus in this paper on continuous VSS control which stipulates a permanent filter adaptation. Many VSSs have been developed, ranging from scalar time-domain step sizes [9][11], to frequency-dependent step sizes which take into account the temporal correlation of the input and the noise signals [12][14] and thus often result in faster convergence. In particular, the inference of the adaptive filter coefficients by a Kalman filter [11], [13], [14] has proven to be a powerful approach to VSS control. The robustness of these algorithms against interfering signals crucially depends on precise estimates of the process and observation noise power, respectively. In [14] it is proposed to estimate both jointly with the adaptive filter coefficients by optimizing a single ML objective function. However, due to the high-dimensional linear Gaussian DFT-domain state space model [13], [14], this approach greatly overestimates the noise power after an abrupt system change occurred. This results in a slow reconvergence which has been analyzed theoretically in [15].

In this paper we address this problem by introducing a nonnegative noise dictionary model which we term SSFDAF-NMF. The proposed model captures the pronounced spectral structure which characterizes many types of interfering noise signals, e.g., wind noise [16], music [17], speech [18] or robotic ego-noise [19]. We suggest to estimate the noise dictionary from training data and infer continuously its activation by a generalized EM algorithm [20]. The optimization of the model shows a close relation to IS divergence-based NMF. The proposed algorithm represents a computationally efficient block-online supervised adaptive filter with inherent noise-robust VSS control which allows for faster reconvergence after abrupt system changes. We use bold lowercase letters for vectors and bold uppercase letters for matrices with underlined symbols indicating TD quantities. The \(D\)-dimensional identity matrix is denoted by \(\boldsymbol{I}_D\), the \(D\)-dimensional DFT matrix by \(\boldsymbol{F}_D\) and the zero-matrix of dimensions \(D_1 \times D_2\) by \(\boldsymbol{0}_{D_1 \times D_2}\). The superscripts \((\cdot)^{\text{T}}\) and \((\cdot)^{\text{H}}\) represent transposition and Hermitian transposition, respectively. We denote the determinant by \(\text{det}(\cdot)\), the trace by \(\text{Tr}(\cdot)\), and introduce the \(\text{diag}(\cdot)\) operator which creates a diagonal matrix from its vector-valued argument. Finally, we use \(\stackrel{\text{c}}{=}\) to denote equality up to a constant and \(\circ\) for indicating element-wise operations.

2 Online Maximum-Likelihood Learning↩︎

As the proposed algorithm is based on the noisy linear FD echo path model described in [13], and related to the online inference of its parameters introduced in [14], we give in the following section a short summary.

2.1 Linear Gaussian DFT-Domain State-Space Model↩︎

We model the noisy \(R\)-dimensional TD observation vector \[{\underline{\boldsymbol{y}}}_\tau = \boldsymbol{Q}_1^{{\text{T}}} \boldsymbol{F}_M^{-1} \boldsymbol{X}_\tau \boldsymbol{F}_M \boldsymbol{Q}_2 \underline{\mathop{\mathrm{\boldsymbol{w}}}}_\tau + \underline{\mathop{\mathrm{\boldsymbol{s}}}}_\tau \in \mathbb{R}^{R} \label{eq:timeDomObsEq}\tag{1}\] at block-index \(\tau\) by an overlap-save convolution of the input signal \[\underline{\boldsymbol{x}}_\tau = \begin{pmatrix} \underline{x}_{\tau R - M + 1}, \underline{x}_{\tau R - M + 2}, \dots, \underline{x}_{\tau R} \end{pmatrix}^{{\text{T}}} \in \mathbb{R}^{M} \label{eq:timeDomInSig}\tag{2}\] of even length \(M\) and block-shift \(R\) with the FIR filter \(\underline{\mathop{\mathrm{\boldsymbol{w}}}}_\tau \in \mathbb{R}^{M-R}\) which is superimposed by the noise vector \[\underline{\mathop{\mathrm{\boldsymbol{s}}}}_\tau = \begin{pmatrix} \underline{s}_{\tau R - R + 1}, \underline{s}_{\tau R - R + 2}, \dots, \underline{s}_{\tau R} \end{pmatrix}^{{\text{T}}} \in \mathbb{R}^{R}.\] We used here the diagonal matrix , which represents the DFT-transformed input signal block \(\underline{\boldsymbol{x}}_\tau\), the linear convolution constraint matrix and the zero-padding matrix . By transforming the zero-padded TD signals in Eq. 1 to the DFT domain, i.e., \[{\mathop{\mathrm{\boldsymbol{y}}}}_\tau = \boldsymbol{F}_M \boldsymbol{Q}_1 {\underline{\boldsymbol{y}}}_{\tau} \in \mathbb{C}^{M} \quad\text{and}\quad {\mathop{\mathrm{\boldsymbol{s}}}}_\tau = \boldsymbol{F}_M \boldsymbol{Q}_1 \underline{\mathop{\mathrm{\boldsymbol{s}}}}_\tau \in \mathbb{C}^{M}, \label{eq:FD95obsDef}\tag{3}\] we obtain the FD observation equation \[{\mathop{\mathrm{\boldsymbol{y}}}}_{\tau} = {\mathop{\mathrm{\boldsymbol{C}}}}_{\tau} {\mathop{\mathrm{\boldsymbol{w}}}}_{\tau} + {\mathop{\mathrm{\boldsymbol{s}}}}_{\tau}\] with the DFT-transformed FIR filter \({\mathop{\mathrm{\boldsymbol{w}}}}_{\tau} = \boldsymbol{F}_M \boldsymbol{Q}_2 \underline{\mathop{\mathrm{\boldsymbol{w}}}}_\tau \in \mathbb{C}^M\) and the overlap-save constrained input signal . Here, we model the FD observation noise vector \({\mathop{\mathrm{\boldsymbol{s}}}}_{\tau}\) as non-stationary, block-wise and spectrally uncorrelated zero-mean complex Gaussian random process which is distributed according to the pdf \[p({\mathop{\mathrm{\boldsymbol{s}}}}_\tau) = p({\mathop{\mathrm{\boldsymbol{s}}}}_\tau|{\boldsymbol{S}}_{1:\tau-1}) = \mathcal{N}_c({\mathop{\mathrm{\boldsymbol{s}}}}_{\tau}|\boldsymbol{0}_{M \times 1}, {\boldsymbol{\Psi}}_{\tau}^S), \label{eq:obsNoiseDistrib}\tag{4}\] with \({\boldsymbol{S}}_{1:\tau-1} = \begin{pmatrix}{\mathop{\mathrm{\boldsymbol{s}}}}_{1}, & \dots, & {\mathop{\mathrm{\boldsymbol{s}}}}_{\tau-1}\end{pmatrix}\). The diagonal entries of the noise covariance matrix \(\left[\boldsymbol{\Psi}_{\tau}^S\right]_{mm} = \mathbb{E}\left[{s}_{m \tau} {s}_{m \tau}^* \right]\), with \(\mathbb{E}\left[\cdot\right]\) denoting the expectation operator, approximate the noise PSD at block \(\tau\).

In [13], it is proposed to model the temporal evolution of the DFT-transformed FIR filter \({\mathop{\mathrm{\boldsymbol{w}}}}_{\tau}\) in terms of a random-walk Markov model with a stationary and diagonal process noise covariance matrix \(\boldsymbol{\Psi}^\Delta_{\tau}\). This allows to describe the OSASI problem by the linear Gaussian DFT-domain state space model \[\begin{align} {\mathop{\mathrm{\boldsymbol{w}}}}_{\tau} &= A ~{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau - 1} + \Delta {{\mathop{\mathrm{\boldsymbol{w}}}}}_{\tau} &\text{with}& \quad \Delta {{\mathop{\mathrm{\boldsymbol{w}}}}}_{\tau} \sim \mathcal{N}_c(\Delta {{\mathop{\mathrm{\boldsymbol{w}}}}}_{\tau}|{\boldsymbol{0}_{M \times 1}}, \boldsymbol{\Psi}_{\tau}^\Delta)\tag{5} \\ {\mathop{\mathrm{\boldsymbol{y}}}}_{\tau} &= {\mathop{\mathrm{\boldsymbol{C}}}}_{\tau} {\mathop{\mathrm{\boldsymbol{w}}}}_{\tau} + {\mathop{\mathrm{\boldsymbol{s}}}}_{\tau} &\text{with} & ~~~\quad {\mathop{\mathrm{\boldsymbol{s}}}}_{\tau} \sim \mathcal{N}_c({\mathop{\mathrm{\boldsymbol{s}}}}_{\tau}|{\boldsymbol{0}_{M \times 1}}, \boldsymbol{\Psi}_{\tau}^S), \tag{6} \end{align}\] with the state transition coefficient .

2.2 Online Inference↩︎

In [14] it is proposed to infer the state-space model parameters in by optimizing the ML objective function \[\begin{align} \label{eq:mlMCSSFDAF} \begin{aligned} \tilde{\mathcal{C}}_{\text{ML}}(\tilde{\Theta}_\tau) = \log p({\mathop{\mathrm{\boldsymbol{y}}}}_{\tau} | {\boldsymbol{Y}}_{1:\tau-1}, \tilde{\Theta}_\tau) \end{aligned} \end{align}\tag{7}\] with \({\boldsymbol{Y}}_{1:\tau-1} = \begin{pmatrix}{\boldsymbol{y}}_{1}, & \dots, & {\boldsymbol{y}}_{\tau-1}\end{pmatrix}\) by an EM algorithm which we term SSFDAF. The adaptive filter coefficient vector \({\mathop{\mathrm{\boldsymbol{w}}}}_{\tau}\) is treated as latent random vector of the log-likelihood 7 which leads via Jensen’s inequality to the lower bound [14] \[\begin{align} \tilde{\mathcal{C}}_{\text{ML}}(&\tilde{{\Theta}}_\tau) = \log \int p({\mathop{\mathrm{\boldsymbol{y}}}}_{\tau}, {\mathop{\mathrm{\boldsymbol{w}}}}_{\tau} | {\boldsymbol{Y}}_{1:\tau-1}, \tilde{\Theta}_\tau)~d {\mathop{\mathrm{\boldsymbol{w}}}}_\tau \label{eq:mlMCSSFDAFNMF1} \\ &\geq \int q({\mathop{\mathrm{\boldsymbol{w}}}}_{\tau}) \log \frac{p({\mathop{\mathrm{\boldsymbol{y}}}}_{\tau}, {\mathop{\mathrm{\boldsymbol{w}}}}_{\tau} | {\boldsymbol{Y}}_{1:\tau-1}, \tilde{\Theta}_{\tau})}{q({\mathop{\mathrm{\boldsymbol{w}}}}_{\tau})} ~d {\mathop{\mathrm{\boldsymbol{w}}}}_\tau = \tilde{\mathcal{Q}}(q,\tilde{\Theta}_\tau) \notag \end{align}\tag{8}\] for the log-likelihood function with \(q({\mathop{\mathrm{\boldsymbol{w}}}}_{\tau})\) being some pdf. The \(l\)-th E-step of the algorithm, i.e., the variational optimization w.r.t. \(q({\mathop{\mathrm{\boldsymbol{w}}}}_{\tau})\), is addressed by the Kalman filter update \[\begin{align} \label{eq:eStep} \begin{aligned} \hat{\mathop{\mathrm{\boldsymbol{w}}}}^{+}_{\tau - 1} &= A~\hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau - 1,(L)} \\ {\mathop{\mathrm{\boldsymbol{P}}}}^{+}_{\tau - 1,(l)} &= A^2 ~ {\mathop{\mathrm{\boldsymbol{P}}}}_{\tau - 1,(L)} + \boldsymbol{\Psi}^{\Delta}_{\tau,(l)}\\ \boldsymbol{\Lambda}_{\tau,(l)} &= {\mathop{\mathrm{\boldsymbol{P}}}}^{+}_{\tau - 1,(l)} \left( {\mathop{\mathrm{\boldsymbol{X}}}}_\tau {\mathop{\mathrm{\boldsymbol{P}}}}^{+}_{\tau-1,(l)} {\mathop{\mathrm{\boldsymbol{X}}}}_\tau^{\mathop{\mathrm{\text{H}}}} + \frac{M}{R} \boldsymbol{\Psi}^{S}_{\tau,(l)} \right)^{-1}\\ \textcolor{black}{{\mathop{\mathrm{\boldsymbol{e}}}}_{\tau}^{+}} &= \textcolor{black}{{\mathop{\mathrm{\boldsymbol{y}}}}_\tau - {\mathop{\mathrm{\boldsymbol{C}}}}_\tau \hat{\mathop{\mathrm{\boldsymbol{w}}}}^{+}_{\tau - 1}} \\ \hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau,(l)} & = \hat{\mathop{\mathrm{\boldsymbol{w}}}}^{+}_{\tau - 1} + \boldsymbol{\Lambda}_{\tau,(l)} {\mathop{\mathrm{\boldsymbol{X}}}}_\tau^{\mathop{\mathrm{\text{H}}}} \textcolor{black}{{\mathop{\mathrm{\boldsymbol{e}}}}_{\tau}^{+}} \\ {\mathop{\mathrm{\boldsymbol{P}}}}_{\tau,(l)} & = \left[{\mathop{\mathrm{\boldsymbol{I}}}}_M - \frac{R}{M} \boldsymbol{\Lambda}_{\tau,(l)} {\mathop{\mathrm{\boldsymbol{X}}}}_\tau^{\mathop{\mathrm{\text{H}}}} {\mathop{\mathrm{\boldsymbol{X}}}}_\tau\right] {\mathop{\mathrm{\boldsymbol{P}}}}^{+}_{\tau-1,(l)} \end{aligned} \end{align}\tag{9}\] with the prior error \({\mathop{\mathrm{\boldsymbol{e}}}}_{\tau}^{+}\), the posterior mean \(\hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau,(l)}\) and the corresponding diagonal state uncertainty \({\mathop{\mathrm{\boldsymbol{P}}}}_{\tau,(l)}\) [13]. Note that the frequency-dependent step sizes, contained in the diagonal matrix \(\boldsymbol{\Lambda}_{\tau,(l)}\), depend on the predicted state uncertainty \({\mathop{\mathrm{\boldsymbol{P}}}}^{+}_{\tau - 1,(l)}\), the DFT-domain input signal \({\mathop{\mathrm{\boldsymbol{X}}}}_\tau\) and on the estimated noise covariance matrix \(\boldsymbol{\Psi}^{S}_{\tau,(l)}\). By inserting the optimum function \(q^{(l)}\), i.e., the posterior pdf resulting from the Kalman filter Eqs. 9 , into 8 we obtain the lower bound [14] \[\begin{align} \tilde{\mathcal{Q}}( q^{(l)}, \tilde{\Theta}_\tau) \stackrel{\text{c}}{=} \tilde{\mathcal{Q}}_{\Delta}(q^{(l)},{\boldsymbol{\Psi}}_\tau^\Delta) + \tilde{\mathcal{Q}}_S(q^{(l)},{\boldsymbol{\Psi}}_\tau^S) \label{eq:addLowBound} \end{align}\tag{10}\] with \[\begin{align} \tilde{\mathcal{Q}}_{\Delta}(&{q^{(l)},\boldsymbol{\Psi}}_\tau^\Delta) \stackrel{\text{c}}{=} - \log {\text{det} \left( \boldsymbol{P}^+_{\tau-1} \right)} \\ & -\text{Tr} \left(\left( \boldsymbol{P}_{\tau-1}^{+}\right)^{-1} \mathbb{E}_{q^{(l)}} \left[ \left( {\mathop{\mathrm{\boldsymbol{w}}}}_\tau - A \hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau-1} \right) \left( {\mathop{\mathrm{\boldsymbol{w}}}}_\tau - A \hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau-1} \right)^{\text{H}} \right] \right) \notag \\ \tilde{\mathcal{Q}}_{S}(&{q^{(l)}, \boldsymbol{\Psi}_\tau^S}) \stackrel{\text{c}}{=} - \log {\text{det} \left( \boldsymbol{\Psi}_\tau^S \right)} \label{eq:eqlowBoundObsNoise} \\ &-\text{Tr} \left( {\left( \boldsymbol{\Psi}_\tau^S \right)}^{-1} \mathbb{E}_{q^{(l)}} \left[ \left( {\mathop{\mathrm{\boldsymbol{y}}}}_\tau - {\mathop{\mathrm{\boldsymbol{C}}}}_\tau {\mathop{\mathrm{\boldsymbol{w}}}}_\tau\right) \left( {\mathop{\mathrm{\boldsymbol{y}}}}_\tau - {\mathop{\mathrm{\boldsymbol{C}}}}_\tau {\mathop{\mathrm{\boldsymbol{w}}}}_\tau\right)^{\text{H}} \right] \right). \notag \end{align}\tag{11}\] Finally, in the M-step the optimum parameters in are obtained by [14] \[\begin{align} \boldsymbol{\Psi}_{\tau,(l)}^{\Delta} & = (1-A^2)~ \left(\hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau,(l)} {\hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau,(l)}}^{\text{H}} + {\mathop{\mathrm{\boldsymbol{P}}}}_{\tau,(l)} \right) \tag{12}\\ \boldsymbol{\Psi}_{\tau,(l)}^{S} & = {\mathop{\mathrm{\boldsymbol{e}}}}_{\tau,(l)} {\mathop{\mathrm{\boldsymbol{e}}}}_{\tau,(l)}^{\text{H}} + \frac{R}{M} {\mathop{\mathrm{\boldsymbol{X}}}}_{\tau} {\mathop{\mathrm{\boldsymbol{P}}}}_{\tau,(l)} {\mathop{\mathrm{\boldsymbol{X}}}}_{\tau}^{\mathop{\mathrm{\text{H}}}} \tag{13} \end{align}\] with the posterior error \({\mathop{\mathrm{\boldsymbol{e}}}}_{\tau,(l)} = {\mathop{\mathrm{\boldsymbol{y}}}}_\tau - {\mathop{\mathrm{\boldsymbol{C}}}}_{\tau} \hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau,(l)}\). Another common choice for estimating the noise covariance matrix is given by \[\boldsymbol{\Psi}_{\tau,(l)}^{S} = \lambda \boldsymbol{\Psi}_{\tau-1,(L)}^{S} + (1-\lambda) \boldsymbol{\Psi}_{\tau,\text{prior}}^{S} \label{eq:mStepObsNoiseRecAv}\tag{14}\] with the prior rank-1 error update \(\boldsymbol{\Psi}_{\tau,\text{prior}}^{S} = {\mathop{\mathrm{\boldsymbol{e}}}}_{\tau}^{+} ({\mathop{\mathrm{\boldsymbol{e}}}}_{\tau}^{+})^{\text{H}}\) [21]. This can be motivated as a heuristic variant of 13 by omitting the excitation signal contribution and using the predicted mean \(\hat{\mathop{\mathrm{\boldsymbol{w}}}}^{+}_{\tau - 1}\) as initial estimate of the updated mean \(\hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau,(l)}\). By assuming \(\boldsymbol{\Psi}_{\tau,(l)}^{\Delta}\) and \(\boldsymbol{\Psi}_{\tau,(l)}^{S}\) to be diagonal, the off-diagonal terms of the outer products in Eqs. 12 , 13 and 14 are neglected.

3 Proposed SSFDAF-NMF Algorithm↩︎

In this section we describe the proposed VSS control by exploiting a nonnegative dictionary noise model.

3.1 Probabilistic Nonnegative Dictionary Noise Model↩︎

Here, in contrast to [14], and all other state-of-the-art VSS control strategies, we model the observation noise covariance matrix by a nonnegative dictionary model \[\boldsymbol{\Psi}_\tau^S = \text{diag}\left({\mathop{\mathrm{\boldsymbol{T}}}} {\mathop{\mathrm{\boldsymbol{v}}}}_{\tau} \right). \label{eq:nmfObsCovMod}\tag{15}\] Hence, the observation model 6 is parametrized by the nonnegative dictionary matrix which contains \(K\) noise atoms and the respective activation vector . We suggest to employ only the activation vector \({\mathop{\mathrm{\boldsymbol{v}}}}_{\tau}\) of the observation noise model and the process noise covariance matrix \({\boldsymbol{\Psi}}_{\tau}^\Delta\) as state-space model parameters which need to be inferred continuously online from the noisy observations. The dictionary matrix \({\mathop{\mathrm{\boldsymbol{T}}}}\) is estimated from a TD training data vector which can be obtained either offline, if prior knowledge about the expected noise type is available, or online otherwise. In the latter case we append the observed signal \(\underline{\boldsymbol{y}}_\tau\) to \(\underline{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr}}\), i.e., \(\underline{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr}}^{\text{T}} \gets \begin{pmatrix} \underline{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr}}^{\text{T}}, \underline{\boldsymbol{y}}_\tau^{\text{T}} \end{pmatrix}\), whenever the input signal is not active, i.e., \(\underline{\boldsymbol{x}}_\tau\approx \boldsymbol{0}_{M \times 1}\), and thus \(\underline{\boldsymbol{y}}_\tau \approx \underline{\mathop{\mathrm{\boldsymbol{s}}}}_\tau\) (cf. Eq. 1 ). The TD training data vector \(\underline{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr}}\) is straightforwardly transformed to the corresponding STFT training data matrix , by decomposing \(\underline{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr}}\) into \(N\) blocks \(\underline{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr},\tau} \in \mathbb{R}^M\) of length \(M\) with block shift \(R_{\text{tr}}\), followed by windowing and DFT transformation. Note that in comparison to \(\underline{\mathop{\mathrm{\boldsymbol{s}}}}_\tau\) (cf. Eq. 3 ), the time-domain training data blocks \(\underline{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr},\tau}\) are not zero-padded to increase the spectral resolution of the corresponding non-stationary noise PSD samples \(|{\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr},\tau}|^{\circ 2}\) where \(\tau=1,\dots,N\). We propose to estimate the dictionary matrix \(\mathop{\mathrm{\boldsymbol{T}}}\) by maximizing the log-likelihood \(\log p(\boldsymbol{S}_{\text{tr}}) = \prod_{\tau=1}^{N} p({\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr},\tau})\) which is equivalent to maximizing the IS-NMF objective function [17] \[\mathcal{C}_{\text{IS}}({\mathop{\mathrm{\boldsymbol{T}}}}, \boldsymbol{V}_{\text{tr}}) \stackrel{\text{c}}{=} -\sum_{m,\tau=1}^{M,N} \left(\log \sum_{k=1}^{K} t_{m k} v_{\text{tr},k \tau} + \frac{ |{s}_{\text{tr},m\tau}|^2 }{\sum_{k=1}^{K} t_{m k} v_{\text{tr},k \tau}} \right) \label{eq:isNMFcostFunc}\tag{16}\] with \(s_{\text{tr},m \tau} = \left[ \boldsymbol{S}_{\text{tr}} \right]_{m \tau}\), \(t_{mk} = \left[ \mathop{\mathrm{\boldsymbol{T}}}\right]_{mk}\) and \(v_{\text{tr},k \tau} = \left[ \boldsymbol{V}_{\text{tr}} \right]_{k \tau}\). This allows to optimize the dictionary matrix \({\mathop{\mathrm{\boldsymbol{T}}}}\) by the multiplicative IS-NMF update rules [22] \[\begin{align} {\boldsymbol{V}_{\text{tr}}} &\gets {\boldsymbol{V}_{\text{tr}}} \circ \left[ \frac{{\mathop{\mathrm{\boldsymbol{T}}}}^{\text{T}} \left( ({\mathop{\mathrm{\boldsymbol{T}}}} {\boldsymbol{V}_{\text{tr}}})^{\circ -2} \circ |{\boldsymbol{S}}_{\text{tr}}|^{\circ 2} \right)}{{\mathop{\mathrm{\boldsymbol{T}}}}^{\text{T}} ({\mathop{\mathrm{\boldsymbol{T}}}} {\boldsymbol{V}_{\text{tr}}})^{\circ -1} } \right]^{\circ\frac{1}{2}} \tag{17} \\ {\mathop{\mathrm{\boldsymbol{T}}}} &\gets {\mathop{\mathrm{\boldsymbol{T}}}} \circ \left[ \frac{ \left( ({\mathop{\mathrm{\boldsymbol{T}}}} {\boldsymbol{V}_{\text{tr}}})^{\circ -2} \circ |\boldsymbol{S}_{\text{tr}}|^{\circ 2} \right) {\boldsymbol{V}_{\text{tr}}}^{\text{T}}}{ ({\mathop{\mathrm{\boldsymbol{T}}}} {\boldsymbol{V}_{\text{tr}}})^{\circ -1} {\boldsymbol{V}_{\text{tr}}}^{\text{T}} } \right]^{\circ\frac{1}{2}} \tag{18} \end{align}\] which represent an instance of the MM algorithm [23]. Note that the training data activation matrix \(\boldsymbol{V}_{\text{tr}}\) is only needed for estimating the dictionary and that due to the conjugate symmetry of \({\mathop{\mathrm{\boldsymbol{s}}}}_\tau\) and \({\mathop{\mathrm{\boldsymbol{s}}}}_{\text{tr},\tau}\), it is sufficient to compute the non-redundant part of the dictionary, i.e., .

3.2 Online Inference↩︎

We propose to estimate the parameters in \({\Theta}_\tau\) by maximizing the log-likelihood function \[\begin{align} \label{eq:mlMCSSFDAFNMF} \begin{aligned} \mathcal{C}_{\text{ML}}({\Theta}_\tau) &= \log p({\mathop{\mathrm{\boldsymbol{y}}}}_{\tau} | {\boldsymbol{Y}}_{1:\tau-1}, {\Theta}_\tau). \end{aligned} \end{align}\tag{19}\] Due to the linear model, we obtain a tight lower bound on the activation vector \(\mathop{\mathrm{\boldsymbol{v}}}_{\tau}\) by inserting the dictionary model 15 into 11 and evaluating the expectation \[\mathcal{Q}_{S}(q^{(l)},{\mathop{\mathrm{\boldsymbol{v}}}}_\tau) \stackrel{\text{c}}{=} - \sum_{m=1}^{M} \left(\log \sum_{k=1}^{K} t_{m k} v_{k \tau} + \frac{\psi_{m \tau,(l)}^e }{\sum_{k=1}^{K} t_{m k} v_{k \tau}} \right) \label{eq:boundForNoiseSimple}\tag{20}\] with \(\psi_{m \tau,(l)}^e = \left[ {\mathop{\mathrm{\boldsymbol{e}}}}_{\tau,(l)} {\mathop{\mathrm{\boldsymbol{e}}}}_{\tau,(l)}^{\text{H}} + {\mathop{\mathrm{\boldsymbol{C}}}}_\tau \boldsymbol{P}_{\tau,(l)} {\mathop{\mathrm{\boldsymbol{C}}}}_\tau^{\text{H}} \right]_{mm}\) representing the expected posterior error power. By comparing Eq. 20 to Eq. 16 , we observe that the lower bound on the activation vector \({\mathop{\mathrm{\boldsymbol{v}}}}_\tau\) is an IS-NMF objective function with the expected posterior error power \(\psi_{m \tau,(l)}^e\) as target variable. The lower bound \(\mathcal{Q}_{S}(q^{(l)},{\mathop{\mathrm{\boldsymbol{v}}}}_\tau)\) does, in comparison to \(\tilde{\mathcal{Q}}_{S}(q^{(l)},{\boldsymbol{\Psi}}_\tau^S)\) (cf. Eq. 11 ), not allow for a closed-form solution as given by Eq. 13 . Thus, we suggest to optimize it iteratively by applying \(P\) times the multiplicative update 17 , with \({\boldsymbol{V}_{\text{tr}}} = {\mathop{\mathrm{\boldsymbol{v}}}}_\tau\) and \(|{s}_{\text{tr},m\tau}|^2=\psi_{m \tau,(l)}^e\), which yields a series of non-decreasing values of the objective function [22]. Hence, the proposed SSFDAF-NMF represents a generalized EM algorithm [20].

For updating the process noise covariance matrix \({\boldsymbol{\Psi}}_\tau^\Delta\), we first observe that the optimization w.r.t. \({\boldsymbol{\Psi}}_\tau^\Delta\) is independent of \({\mathop{\mathrm{\boldsymbol{v}}}}_\tau\) which results from the additive structure of the lower bound 10 . Hence, we can use Eq. 12 .

3.3 Algorithmic Variants↩︎

For updating the filter coefficients, two different optimization procedures are investigated. Either the E-step 9 is carried out before the M-step 12 and 13 , abbreviated by EM, or the reversed order is used, abbreviated by ME. Note that the EM version, which is termed SSFDAF-EM-\(L\) in the following, requires at least \(L=2\) iterations to use the updated noise covariance matrix \(\boldsymbol{\Psi}_{\tau,(l)}^{S}\) (cf. Eq. 13 ) in the Kalman step size computation (cf. Eq. 9 ).

The proposed SSFDAF-NMF update with the EM optimization version including \(L\) iterations is summarized in Alg. 1. Here, for each E-step the posterior mean \(\hat{\mathop{\mathrm{\boldsymbol{w}}}}_{\tau,(l)}\) and the respective state uncertainty matrix \(\boldsymbol{P}_{\tau,(l)}\) of the latent adaptive filter coefficients are updated by the Kalman filter Eqs. 9 . Subsequently, in the M-step, the diagonal process noise covariance matrix \(\boldsymbol{\Psi}_{\tau,(l)}^{\Delta}\) and dictionary activation \({\mathop{\mathrm{\boldsymbol{v}}}}_{\tau,(l)}\) are updated. The dictionary activation is optimized by applying \(P\) times the multiplicative update rule 17 with the target variable (cf. Sec. 3.2) and fixed dictionary matrix \(\mathop{\mathrm{\boldsymbol{T}}}\). Note that the expected posterior error power can efficiently be approximated by Eq. 13. The iterative optimization procedure is initialized with the optimum activation vector \(\boldsymbol{v}_{\tau-1,(L)}\) of the previous time frame. Finally, the observation noise covariance matrix \(\boldsymbol{\Psi}_{\tau,(l)}^S\) is updated by Eq. 15 .

Figure 1: Proposed filter update by SSFDAF-NMF-EM-\(L\).

On the other hand, if using the ME version with \(L=1\) iteration and the observation noise covariance estimator 14 , one obtains the optimization approach described in [21] which is termed SSFDAF-ME-\(1\) in the sequel. We propose here an alternative NMF-based M-step by using the instantaneous prior error power, i.e., \(\lambda=0\), as desired target variable of the NMF model (cf. Eq. 20 ), which is abbreviated by SSFDAF-NMF-ME-\(1\).

4 Experiments↩︎

In this section the proposed SSFDAF-NMF variants are evaluated for an AEC scenario which includes an abrupt echo path change and various types of recorded noise signals. The respective echoes \(\underline{\boldsymbol{y}}_{\text{ec},\tau}\) were simulated by convolving speech source signals, i.e., \(15\) randomly-selected talkers of [24], with measured RIRs \(\underline{\boldsymbol{h}}_{\tau}\in \mathbb{R}^D\) of length \(D=48000\) and sampling frequency . The RIRs were taken from the AIR database [25] and describe a hands-free interaction of a human with a mock-up phone in \(8\) different acoustic environments. The echo path change is modelled by using different RIRs \(\underline{\boldsymbol{h}}_\tau\) for simulating the echoes before and after \(10\) s. The clean echo signals \(\underline{\boldsymbol{y}}_{\text{ec},\tau}\) were distorted by adding recorded noise signals, which were scaled according to the desired SNR. As noise signals we considered near-end speech, chosen from additional \(15\) different talkers of [24], keyboard clicking and moving cutlery [26].

As performance measures we use the signal-dependent ERLE \({\Gamma}_\tau\) and the system mismatch \(\Upsilon_\tau\) [1] \[\begin{align} {\Gamma}_\tau = 10 \log_{10} \frac{\mathbb{E}\left[||\underline{\boldsymbol{y}}_{\text{ec},\tau}||_2^2\right] }{\mathbb{E}\left[||\underline{\boldsymbol{y}}_{\text{ec},\tau}-\hat{\boldsymbol{\underline{y}}}_{\tau}||_2^2\right]}, ~~ \Upsilon_\tau = 10 \log_{10} \frac{{||\textcolor{black}{\underline{\boldsymbol{h}}_\tau} - \hat{\underline{\boldsymbol{w}}}_\tau}||_2^2}{||\textcolor{black}{{\underline{\boldsymbol{h}}}_{\tau}}||_2^2} \label{eq:systMisDef} \end{align}\tag{21}\] with \(||\cdot||_2\) denoting the p2-norm, \(\hat{\boldsymbol{\underline{y}}}_{\tau}\) denoting the estimated echo signal and \(\hat{\underline{\boldsymbol{w}}}_\tau = \boldsymbol{Q}_2^{\text{T}} \boldsymbol{F}_M^{-1} \hat{{\boldsymbol{w}}}_\tau\). Note that we only use the first \(M-R\) taps of the true RIR \(\underline{\boldsymbol{h}}_\tau\) for computing \(\Upsilon_\tau\) to obtain an estimate of the attainable system mismatch. The echo caused by the remaining \(D-(M-R)\) taps is usually dealt with by a residual echo suppression filter, e.g., [7], [13], which we do not include here. As baseline approaches we use the state-of-the-art noise power estimators 13, with EM optimization sequence and \(L=2\) iterations (SSFDAF-EM-2), and 14, with \(\lambda=0.5\) and \(L=1\) iteration (SSFDAF-ME-1). The baseline algorithms are compared with two optimization variants of the proposed SSFDAF-NMF algorithm with \(K=10\) basis vectors and \(P=3\) MM optimization steps. While the first variant is defined by an an EM optimization sequence with \(L=2\) iterations (SSFDAF-NMF-EM-2), the second one uses a single ME iteration and the prior error power approximation (SSFDAF-NMF-ME-1). For all algorithms we chose state transition coefficient \(A=0.9999\), block-length \(M=1536\) and block-shift \(R=512\), corresponding to a filter length \(M-R=1024\). For training the dictionaries we chose a set of \(25\) s-long signals which are temporally complementary to the evaluation data. The near-end speech training data did neither include the input nor the local speech signal used for testing the algorithms and thus allows for an offline speaker-independent training.

None

Figure 2: Performance evaluation of the proposed SSFDAF-NMF algorithms in comparison to their baseline SSFDAF counterparts for different noise types and SNR levels.

For computing the training data matrix \(\boldsymbol{S}_{\text{tr}}\) we used a Hamming window and a frame-shift \(R_{\text{tr}}=512\).

Fig.2 shows exemplary average ERLE \(\bar{\Gamma}_\tau\) and system mismatch \(\bar{\Upsilon}_\tau\) for different noise signal types and SNR levels. Note that due to space constraints not all combinations can be shown. The results have been computed by evaluating \(100\) experiments with each experiment being defined by randomly drawing an RIR sequence, an input and interfering noise signal and a training noise signal. As can be concluded from Fig. 2, the proposed SSFDAF-NMF variants both significantly outperform their baseline counterparts in terms of reconvergence rate after echo path changes in all considered scenarios. Thus, the proposed noise dictionary allows the adaptive filter to better distinguish whether the estimated prior error \(\boldsymbol{e}^+_\tau\) results from a noisy observation or from adaptive filter misalignment. Furthermore, the proposed SSFDAF-NMF-EM-2 shows a better reconvergence behaviour than the proposed SSFDAF-NMF-ME-1, at the cost of slightly slower initial convergence rate. While the benchmark algorithms (SSFDAF-EM-2) and (SSFDAF-ME-1) require on average \(0.3\) ms and \(0.6\) ms, respectively, to process one signal block \(\boldsymbol{y}_\tau\) of duration \(32\) ms on an Intel Xeon CPU E3-1275 v6 @ 3.80GHz, their NMF counterparts need \(0.5\) ms and \(1\) ms, respectively. As the runtime is still comparably low, the proposed SSFDAF-NMF algorithm represents a computationally highly efficient OSASI algorithm.

5 Conclusion↩︎

We have introduced a novel computationally efficient adaptation control for block-OSASI by exploiting a nonnegative noise dictionary. The proposed algorithm exhibits significantly faster convergence properties for different types of recorded noise signals in comparison to two high-performance state-of-the-art algorithms.

References↩︎

[1]
G. Enzner, H. Buchner, A. Favrot, and F. Kuech, “Acoustic EchoControl,” in Academic Press Library in Signal Processing, vol. 4, pp. 807–877. Elsevier, FL, USA, 2014.
[2]
B. Widrow and M. E. Hoff, “Adaptive SwitchingCircuits,” in Proc. WESCON Conv. Rec., Los Angeles, USA, Aug. 1960, pp. 96–104.
[3]
E. Ferrara, “Fast implementations of LMS adaptive filters,” IEEE Trans. Acoust., vol. 28, no. 4, pp. 474–475, Aug. 1980.
[4]
D. Mansour and A. Gray, “Unconstrained frequency-domain adaptive filter,” IEEE Trans. Acoust., vol. 30, no. 5, pp. 726–734, Oct. 1982.
[5]
S. Haykin, Adaptive Filter Theory, Prentice Hall, NJ, USA, 2002.
[6]
J. Benesty, D.R. Morgan, and J.H. Cho, “A new class of doubletalk detectors based on cross-correlation,” IEEE Trans. Speech Audio Process., vol. 8, no. 2, pp. 168–172, Mar. 2000.
[7]
E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control: A practical Approach, Wiley-Interscience, NJ, USA, 2004.
[8]
N. Cahill and R. Lawlor, “An Approach to DoubletalkDetectionBased on Non-NegativeMatrixFactorization,” in Proc. IEEE Int. Symp. Signal Proc. Inf. Tech., Sarajevo, Bosnia and Herzegovina, Dec. 2008, pp. 497–501.
[9]
J. Benesty, H. Rey, L. R. Vega, and S. Tressens, “A Nonparametric VSSNLMSAlgorithm,” IEEE Signal Process. Lett., vol. 13, no. 10, pp. 581–584, 2006.
[10]
H. Huang and J. Lee, “A new variable step-size NLMS algorithm and its performance analysis,” IEEE Trans. Signal Process., vol. 60, no. 4, pp. 2055–2060, 2012.
[11]
C. Huemmer, R. Maas, and W. Kellermann, “The NLMS algorithm with time-variant optimum stepsize derived from a bayesian network perspective,” IEEE Signal Process. Lett., vol. 22, no. 11, pp. 1874–1878, 2015.
[12]
B. H. Nitsch, “A frequency-selective stepfactor control for an adaptive filter algorithm working in the frequency domain,” Signal Process., vol. 80, no. 9, pp. 1733–1745, 2000.
[13]
G. Enzner and P. Vary, “Frequency-domain adaptive Kalman filter for acoustic echo control in hands-free telephones,” Signal Process., vol. 86, no. 6, pp. 1140–1156, June 2006.
[14]
S. Malik and G. Enzner, “Online maximum-likelihood learning of time-varying dynamical models in block-frequency-domain,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Dallas, USA, Mar. 2010, pp. 3822–3825.
[15]
F. Yang, G. Enzner, and J. Yang, “Frequency-DomainAdaptiveKalmanFilterWithFastRecovery of AbruptEcho-PathChanges,” IEEE Signal Process. Lett., vol. 24, no. 12, pp. 1778–1782, Dec. 2017.
[16]
M. Schmidt, J. Larsen, and F. Hsiao, “Wind NoiseReduction using Non-NegativeSparseCoding,” in IEEE Int. Workshop Mach. Learn. Signal. Process., Thessaloniki, Greece, Aug. 2007, pp. 431–436.
[17]
C. Févotte, N. Bertin, and J. Durrieu, “Nonnegative MatrixFactorization with the Itakura-SaitoDivergence: WithApplication to MusicAnalysis,” Neural Comput., vol. 21, no. 3, pp. 793–830, Mar. 2009.
[18]
P. Smaragdis, “Convolutive SpeechBases and TheirApplication to SupervisedSpeechSeparation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 1–12, Jan. 2007.
[19]
T. Haubner, A. Schmidt, and W. Kellermann, “Multichannel NonnegativeMatrixFactorization for Ego-NoiseSuppression,” in ITG-Symp. Speech Commun., Oldenburg, Germany, Oct. 2018.
[20]
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from IncompleteDataVia the EMAlgorithm,” J. R. Stat. Soc. Series B Stat. Methodol., vol. 39, no. 1, pp. 1–22, Sept. 1977.
[21]
J. Franzen and T. Fingscheidt, “Improved MeasurementNoiseCovarianceEstimation for N-channel FeedbackCancellationBased on the FrequencyDomainAdaptiveKalmanFilter,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Brighton, UK, May 2019, pp. 965–969.
[22]
C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the \(\beta\)-divergence,” Neural Comput., vol. 23, no. 9, pp. 2421–2456, 2011.
[23]
D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” Am. Stat., vol. 58, no. 1, pp. 30–37, 2004.
[24]
L. M. Panfili, J. Haywood, D. R. McCloy, P. E. Souza, and R. A. Wright, “The UW/NU corpus, version 2.0,” https://depts.washington.edu/phonlab/projects/uwnu.php, 2017.
[25]
M. Jeub, M. Schäfer, H. Krüger, C. Nelke, C. Beaugeant, and P. Vary, “Do WeNeedDereverberation for Hand-HeldTelephony?,” in Int. Congr. Acoust., Sydney, Australia, 2010.
[26]
“Freesound database,” https://freesound.org, Accessed: 2020-10-06.