Abstract

Reverberation conveys critical acoustic cues about the environment, supporting spatial awareness and immersion. For auditory augmented reality (AAR) systems, generating perceptually plausible reverberation in real time remains a key challenge, especially when explicit acoustic measurements are unavailable. We address this by formulating blind estimation of artificial reverberation parameters as a reverberant signal matching task, leveraging a learned room-acoustic prior. Furthermore, we propose a feedback delay network (FDN) structure that reproduces both frequency-dependent decay times and the direct-to-reverberation ratio of a target space. Experimental evaluation against a leading automatic FDN tuning method demonstrates improvements in estimated room-acoustic parameters and perceptual plausibility of artificial reverberant speech. These results highlight the potential of our approach for efficient, perceptually consistent reverberation rendering in AAR applications.

Audio systems, parameter estimation, reverberation

1 Introduction↩︎

Reverberation provides the auditory system with rich information about the size, geometry, and material properties of an environment, allowing listeners to form a sense of space and situational awareness [1], [2]. Accordingly, auditory augmented reality (AAR) systems must ensure that virtual sound sources integrate seamlessly into real acoustic settings to preserve immersion, realism, and telepresence [3]–[6]. Plausible acoustic rendering is a core challenge in AAR, requiring a method to generate artificial reverberation in real-time while adhering to computational complexity constraints, as well as knowledge of the acoustic surroundings of the user [7].

Such knowledge about the acoustic environment may be represented by a room impulse response (RIR), from which room-acoustic parameters such as reverberation time \(\mathrm{T}_{60}\) and clarity index \(\mathrm{C_{50}}\) can be derived [8], or through geometric information like the shape of the environment and the positions of sources and receivers. In many practical situations, acquiring such information via dedicated measurements or user input is infeasible, requiring instead a non-intrusive inference from observed reverberant signals [9]. Blind estimation of RIRs from reverberant signals, in the single- and multi-channel case, has long been an active area of research [10], [11]. Recent advances have been driven by leveraging the powerful discriminative and generative modeling capabilities of deep neural networks (DNNs) [12]–[15]. However, many of these methods are computationally demanding and not specifically tailored towards applications that involve real-time processing of audio, such as AAR. Moreover, rendering virtual sound sources directly via convolution with estimated RIRs can be challenging as it is computationally expensive, requires storing or recomputing long filters for dynamic scenes, and lacks flexibility when the environment, source position, or listener position changes. Parametric artificial reverberation methods, by contrast, offer a lightweight alternative that enables real-time rendering while maintaining perceptual plausibility. Among these, feedback delay networks (FDNs) have emerged as a particularly effective and versatile class of structures for interactive applications [16].

FDNs were introduced as a generalization of the parallel comb-filter structure in which the delays are interconnected via a feedback matrix [17], [18]. The continued popularity of FDNs is due to their computational efficiency and flexibility, allowing real-time processing and independent control over parameters such as energy decay, diffusion, and overall equalization [19]. By reformulating the FDN in a differentiable form, DNNs can be used to estimate the parameter values needed to synthesize a target reverberation, thereby enabling blind estimation of RIRs. This idea has been explored by Sungho Lee et al. in [20], where an artificial reverberator parameter estimation network (ARP-net) is used to determine a subset of FDN parameters from reverberant speech in an end-to-end manner. The ARP-net employs an encoder to convert audio spectrograms into a latent vector, followed by ARP-groupwise layers for FDN parameter projection. The model is particularly large, with the ARP-net comprising approximately 7.3M parameters. While the method has shown promising results, the chosen FDN design and optimization scheme, specifically the use of a fixed mixing matrix and a shared attenuation filter, limit further improvements and reduce perceptual plausibility [21].

The main contributions of the present study are twofold. First, we formulate the blind estimation of artificial reverberation parameters as a reverberant signal matching task, utilizing a pre-learned room-acoustic prior. Second, we propose a differentiable FDN structure that, unlike prior work, offers greater flexibility in frequency-dependent decay and energy control, as well as improved temporal density. We evaluate the proposed approach against a leading automatic FDN tuning method in [20], comparing performance in terms of room-acoustic parameters, specifically \(T_{60}\) and \(C_{50}\), as well as the perceptual plausibility of the resulting artificial reverberant speech reflected by the Fréchet Audio Distance (FAD) [22].

The remainder of this paper is structured as follows. Section 2 introduces the method, extraction of a room-acoustic embedding from speech, as well as the differentiable FDN structure. Section 3 presents the evaluation setup and the baseline used for performance comparison. Section 4 contains the presentation and discussion of experimental results, and Section 5 concludes this work.

2 Proposed Method↩︎

We consider the reverberant signal \(y[t]\) as the convolution of an anechoic source signal \(x[t]\) with a RIR \(h[t]\) of length \(L\), where the result is corrupted by uncorrelated, additive background and sensor noise \(v[t]\): \[y[t] = (x * h)[t] + v[t] =\sum_{\tau=0}^{L-1} x[t-\tau]h[\tau] + v[t],\] where \(t\) denotes the discrete-time index. Furthermore, we represent signals \(y[t]\), \(x[t]\), and \(h[t]\) in the time–frequency domain as Mel-frequency-scaled, log-magnitude spectrograms, denoted by \(\mathbf{Y}[f,k]\), \(\mathbf{X}[f,k]\) and \(\mathbf{H}[f,k] \in \mathbb{R}^{F \times K}\), where \(f\) and \(k\) index frequency and time, respectively, and \(F\) and \(K\) denote the frequency and time dimensions of the RIR spectrogram.

The proposed method involves a two-stage representation learning framework, which builds upon the approach presented in [23], and a final parameter estimation stage. Figure 1 provides a schematic overview of the complete approach, with the three colors denoting the different stages.

2.1 Room Acoustic Prior↩︎

In the first stage, a variational autoencoder (VAE) [24], consisting of encoder \(\mathcal{E}_{H,\phi}:\mathbb{R}^{F\times K}\rightarrow\mathbb{R}^{2D}\) parameterized by \(\mathbf{\phi}\), and decoder \(\mathcal{D}_{H,\theta}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{F\times K}\) parameterized by \(\mathbf{\theta}\), is trained to learn a compact representation of RIRs with \(D\ll FK\). We jointly learn the posterior and likelihood distributions, \(q_\phi(\mathbf{z}_H|\mathbf{H}) \sim \mathcal{N}(\boldsymbol{\mu}_\phi,\boldsymbol{\Sigma}_\phi)\) and \(p_\theta(\mathbf{H}|\mathbf{z}_H)\), respectively, where \(\boldsymbol{\Sigma}_\phi = \operatorname{diag}(\sigma_\phi^{2,(1)}, \dots, \sigma_\phi^{2,(D)})\). We optimize the well-known evidence lower bound objective [24]: \[\label{eq:elbo} \begin{align} \mathcal{L}_H(\mathbf{\phi}, \mathbf{\theta}, \mathbf{H}) = &\;\mathbb{E}_{p(\mathbf{H})}\bigg[ \lambda\mathrm{KL}\left\{q_\phi(\mathbf{z}_H|\mathbf{H})\,||\,p(\mathbf{z}) \right\} \\ & -(1-\lambda)\mathbb{E}_{q_\phi\left(\mathbf{z}_H|\mathbf{H}\right)}\big[\log p_\theta(\mathbf{H}|\mathbf{z}_H)\big]\bigg], \end{align}\tag{1}\] where \(p(\mathbf{z})\sim\mathcal{N}(\mathbf{0}_D,\mathbf{I}_{D\times D})\) is a standard normal prior, \(\mathrm{KL}\left\{p||q\right\}\) denotes the Kullback-Leibler (KL) divergence between probability distributions \(p\) and \(q\), and \(\mathbb{E}[\cdot]\) denotes statistical expectation. The encoder \(\mathcal{E}_{H,\phi}\) and decoder \(\mathcal{D}_{H,\theta}\) are realized through convolutional layers, and the variational bottleneck constrains the posterior mean to \((-1,1)\) and the variance to \((0,2)\) via \(\tanh(\cdot)\) activation.

In the following stage, a second encoder \(\mathcal{E}_{Y,\psi}:\mathbb{R}^{F\times K}\rightarrow\mathbb{R}^{D}\) is trained to approximate the RIR posterior from reverberant speech. To this end, we minimize the KL divergence between the variational distribution \(q_\psi(\mathbf{z}_Y|\mathbf{H},\!\mathbf{X})\), which is conditioned on the anechoic source signal \(\mathbf{X}\), and the posterior distribution of the RIR: \[\label{eq:KLD} \mathop{\mathrm{arg\,min}}_{\mathbf{\psi}}\mathrm{KL}\left\{ q_\psi(\mathbf{z}_Y|\mathbf{H,X})||q_\phi(\mathbf{z}_H|\mathbf{H}) \right\}.\tag{2}\] This objective enforces that \(\mathcal{E}_{Y,\psi}\) learns a latent representation of reverberant speech that is invariant to the anechoic source signal, yielding an approximation of the form \(q_\psi(\mathbf{z}_Y|\mathbf{H})\). In the considered scenario, we do not sample from the speech posterior and can assume a fixed identity covariance matrix \(q_\psi(\mathbf{z}_Y|\mathbf{H,X})\sim\mathcal{N}(\boldsymbol{\mu}_\psi,\mathbf{I})\). Hence, the speech encoder is only required to estimate the variational mean \(\boldsymbol{\mu}_\psi\), which reduces (2 ) to: \[\mathcal{L}_Z = \frac{1}{2}\Big[\operatorname{tr}(\boldsymbol{\Sigma}_\phi^{-1})] + (\boldsymbol{\mu}_\phi - \boldsymbol{\mu}_\psi)^\top\boldsymbol{\Sigma}_\phi^{-1} (\boldsymbol{\mu}_\phi - \boldsymbol{\mu}_\psi) + \log\lvert\boldsymbol{\Sigma}_\phi\rvert - D \Big],\] where the operator \((\cdot)^\top\) indicates transposition. \(\mathcal{E}_{Y,\psi}\) is realized as a convolutional feature extractor followed by a transformer encoder with attention-based sequence aggregation, which maps the variable-length input spectrogram \(\mathbf{Y}\) to a fixed-size embedding in \(\mathbb{R}^D\).

Figure 1: Overview of the proposed multi-stage approach: the loss functions are enumerated in the order in which the individual models with the corresponding colors are trained. Black arrows indicate the signal flow during the forward pass; dashed, colored arrows indicate the backpropagation of gradients.

2.2 Artificial Reverberation↩︎

The latent approximation \(\boldsymbol{\mu}_\psi\) serves as input to the parameter estimation model, which predicts a set of FDN parameters used to synthesize reverberation of the anechoic source signal \(x[t]\) (cf. Fig. 1). The objective in this last stage is to align the artificially reverberated signal with the true reverberant signal.

2.2.1 Differentiable FDN↩︎

To synthesize the reverberation, we adopt an implementation of the FDN, depicted in Fig. 2 and defined by the following transfer function: \[\begin{align} \label{eq:tr95fdn} H(z) = T(z)\left(\mathbf{c}^\top\big[\mathbf{D_m}(z)^{-1} -\mathbf{A}(z)\big]^{-1}\mathbf{b} + gz^{-m_{\textrm{d}}}\right), \end{align}\tag{3}\] where \(\mathbf{A}(z)\) is the filter feedback matrix, formed by combining an orthogonal \(N \times N\) mixing matrix \(\mathbf{U}\) with channel-wise attenuation filters \(\mathbf{\Gamma}(z) = \text{diag}\{\Gamma_1(z), \dots,\Gamma_{N}(z)\}\), i.e. \(\mathbf{A}(z) = \mathbf{U} \mathbf{\Gamma}(z)\). Here, \(N\) denotes the number of delay lines. Each filter is a graphic equalizer (GEQ) implemented as a cascade of \(J\) second-order sections, comprising shelving filters, peaking filters, and a scalar gain. The vector of delay lengths is \(\mathbf{m} = [m_1, \dots, m_N]\), and the corresponding delay matrix \(\mathbf{D_m}(z)\) is a diagonal matrix with entries \([z^{-m_1}, \dots, z^{-m_N}]\). Vectors \(\mathbf{b}\) and \(\mathbf{c}\) are \(N \times 1\) column vectors representing the input and output gains, respectively. The direct path is modeled by a gain \(g\) and a delay \(m_{\textrm{d}}\). \(T(z)\) denotes the tone-correction GEQ. All parameters, apart from the delay lines, are learnable.

In traditional FDN design, the attenuation filters \(\mathbf{\Gamma}(z)\) are typically scaled in proportion to the delay lengths [25] to model a single frequency-dependent decay rate. In the proposed design, however, we lift this constraint to avoid numerical instabilities during training. The tone-corrector filter \(T(z)\) is necessary to model the frequency-dependent initial energy of the reference RIR. The proposed FDN structure is realized using the FLAMO¹ Python library, which provides a comprehensive toolbox for differentiable signal processing in the frequency domain based on the frequency sampling method [26].

Figure 2: Proposed differentiable FDN structure. The blocks highlighted in red are estimated by the parameter estimation model.

2.2.2 FDN Parameter Estimation↩︎

We estimate the FDN parameters using a set of regression models \(\mathcal{F}\), each consisting of a shallow multi-layer perceptron (MLP) that maps the latent representation \(\boldsymbol{\mu}_\psi\) to a specific FDN parameter. \(\mathcal{F}:\mathbb{R}^D\rightarrow \mathcal{P}\) yields the parameter set: \[\mathcal{P} = \left\{ \begin{array}{ll} p_T \in \mathbb{R}^{J \times 1}, & \mathbf{b} \in \mathbb{R}^{N \times 1}, \\ p_\mathbf{U} \in \mathbb{R}^{N \times N}, & \mathbf{c} \in \mathbb{R}^{N \times 1}, \\ p_{\boldsymbol{\Gamma}} \in \mathbb{R}^{J \times N}, & g \in \mathbb{R} \end{array} \right\},\] which is used to reverberate the dry signal \(x[t]\), obtaining \(\hat{y}[t]\). In this work, we use \(N=8\) and \(J=11\).

To guarantee FDN stability, the intermediate parameters \(p_T\), \(p_\mathbf{U}\), and \(p_{\boldsymbol{\Gamma}}\) are transformed into the corresponding FDN components \(T(z)\), \(\mathbf{U}\), and \(\mathbf{\Gamma}(z)\) through dedicated activation functions. \(\mathbf{U}\) is obtained using the orthogonality mapping \(\mathbf{U} = \exp\{ \textrm{Tr}(p_{\mathbf{U}}) - \textrm{Tr}(p_{\mathbf{U}})^\top\}\), where \({\textrm{Tr}}(\cdot)\) extracts the upper triangular matrix and the operator \(\exp\{ \cdot \}\) denotes the matrix exponential [27]. Parameters \(p_T\) and \(p_{\boldsymbol{\Gamma}}\) control the command gains of the one-octave band GEQs and are limited to \([-12, 12]\,\)dB and \([-\inf, 0]\,\)dB, respectively, using a scaled \(\tanh(\cdot)\) activation and a \(\text{sigmoid}(\cdot)\) activation. The chosen GEQ design at the 48-kHz sample rate has nine peaking stages with center frequencies ranging from 62.5 Hz to 16 kHz, two shelving stages with cutoff frequencies of 44 Hz and 22.6 kHz, and a dc gain factor. The lengths of the delay lines are fixed to \(\mathbf{m} = \left[809, 877, 937, 1049, 1151, 1249, 1373, 1499\right]\). The values in \(\mathbf{m}\) are coprime numbers distributed logarithmically, aiming to maximize the echo density [16] and avoid degenerative patterns. Since the silence before the onset is removed during data pre-processing, the direct path is modeled with a short two-sample delay \(m_\textrm{d}\), and the direct-to-reverberant ratio is controlled by \(g\).

We train \(\mathcal{F}\) with a multi-resolution STFT loss function [28], [29], defined as the mean squared distance between the Mel-frequency scaled, logarithmic-magnitude spectra of \(y[t]\) and \(\hat{y}[t]\): \[\begin{align} \mathcal{L}_{Y}(y,\hat{y}) \;=&\; \frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}} \frac{1}{F_r T_r}\; \big\|\, \mathbf{Y}^{(r)} - \widehat{\mathbf{Y}}^{(r)}\,\big\|_F^{2}\\ \text{with}\;\; \mathbf{Y}^{(r)}(y) \;\triangleq&\; 10\log_{10}\!\left(\mathbf{M}_r\,\big|\mathrm{STFT}_{r}\{y\}\big|^2 + \varepsilon\right), \end{align}\] where \(\mathbf{M}\) denotes a mapping from linear to Mel-frequency scale, \(\mathcal{R}\) is a set of spectrogram configurations, each with parameters \(\left(N_{\mathrm{fft}}^{(r)}, N_{\mathrm{hop}}^{(r)}, N_{\mathrm{mel}}^{(r)}\right)\), \(|\cdot|\) indicates cardinality, \(\lVert\cdot\rVert_F\) denotes the Frobenius norm, and \(\varepsilon>0\) is a constant preventing \(\log(0)\). In addition to the signal matching objective, we enforce a dense feedback matrix \(\mathbf{U}\) with a sparsity penalty, which encourages a fast build-up of temporal reflection density at the beginning of the FDN response and yields a perceptually smooth reverberation tail [21]: \[\mathcal{L}_U=\frac{N\sqrt{N}-\sum_{i,j}\lvert U_{i,j}\rvert}{N\left(\sqrt{N}-1\right)},\] where \(U_{i,j}\) is the entry of matrix \(\mathbf{U}\) in the \(i\)-th row and \(j\)-th column. We train the parameter estimator \(\mathcal{F}\) with a weighted sum of both loss terms \(\mathcal{L}_3=\mathcal{L}_Y+\lambda \mathcal{L}_U\) (cf. Fig. 1).

3 Evaluation Setup↩︎

3.1 Dataset↩︎

We generate a large corpus of reverberant speech by combining anechoic speech from the EARS dataset [30] with RIRs from a variety of publicly available datasets, including ACE [9], ASH-IR [31], Multi-Room Transition [32], Arni [33], EchoThief [34], IKS [35], MIT [36], OpenAir [37], Multi-Purpose Room Impulse Response [38], TAU spatial RIR [39], and TH Köln [40]. We partition all RIRs and speech signals into three disjoint sets for training, validation, and testing, and generate approximately \(18\) h of 4-s reverberant speech segments, sampled at \(48\,\text{kHz}\). RIRs are energy-normalized with onset times removed, and spectrograms are standardized to zero mean and unit variance.

3.2 Model and Training Details↩︎

The VAE used to learn the RIR posterior distribution comprises 393857 parameters, the speech encoder for the latent approximation comprises 1475168 parameters, and the FDN parameter estimator comprises 573465 parameters. Hence, during inference, we run a model with a total of 2048444 learnable parameters. We train all models using the Adam optimizer with decoupled weight decay [41] and a learning rate schedule. We then select the model with the lowest validation loss after a fixed patience period of \(16\) epochs.

3.3 Evaluation Metrics↩︎

Figure 3: Qualitative results of the proposed method: (a) RIR spectrogram, (b) VAE reconstruction, (c) decoded approximation from reverberant speech, and (d) synthesized FDN response spectrogram.

We evaluate the proposed method by comparing acoustic parameters between true and artificial RIRs. Specifically, \(\mathrm{T}_{30}\) and \(\mathrm{C_{50}}\) are assessed across seven octave bands with center frequencies \(\{125, 250, 500, 1\mathrm{k}, 2\mathrm{k}, 4\mathrm{k}, 8\mathrm{k}\}\,\)Hz, reporting the mean absolute percentage error for \(\mathrm{T}_{30}\), the mean absolute error for \(\mathrm{C_{50}}\), and the Pearson correlation coefficient (PCC) for both parameters. In addition to the parametric evaluation, we assess the perceptual plausibility of the artificially reverberant speech reflected by the FAD [22].

3.4 Baseline↩︎

At the time of writing, the approach proposed by Lee et al. [20] represents the leading method in blind estimation of FDN parameters from reverberant speech. We therefore adopt it as our baseline method. In [20], the encoder and projection layer are trained by minimizing a linear multi-resolution spectral loss. The encoder consists of five 2D convolutional layers, two gated recurrent units, and two linear layers. The projection layers consist of two linear layers followed by the parameter-specific activation function.

The FDN structure for which ARP-net estimates parameters has size \(N=6\) and includes learnable \(\mathbf{b}\) and \(\mathbf{c}\), a fixed Householder feedback matrix \(\mathbf{U}\), learnable attenuation filters with a common response \(\Gamma(z)\), a common tone-correction filter \(T(z)\), and a cascade of four Schroeder allpass (SAP) filter sections in each feedback path with learnable gains. The SAPs were deemed necessary to achieve faster echo density build-up without adding more delay lines [20]. As in our proposed method, all delay lines are fixed at initialization. The filters are implemented as eight-stage parametric equalizers.

To synthesize the energy decay of the reference RIR, Lee et al. [20] employ a common absorption filter \(\Gamma(z)\), implemented as an eight-stage parametric equalizer using state-variable filter (SVF) parameters. This filter consists of one low-shelving, six peaking, and one high-shelving filter. ARP-net is trained to estimate the resonance, cutoff frequencies, and gains of each band. Allowing the cutoff frequency to vary for each RIR improves the network’s generalization ability. The tone-correction filter \(T(z)\) is implemented similarly as a series of eight SVF filters, each with a learnable cutoff frequency, resonance, and mixing coefficients.

Table 1: Comparison of artificially reverberated speech generated by the proposed approach and the baseline in terms of FAD.
	Proposed method	ARP-net [20]
FAD [22] (\(\downarrow\))	\(0.109\)	\(0.523\)

4 Results↩︎

The empirical error distributions shown in the top two plots of Fig. 4 indicate that the proposed method accurately matches both acoustic parameters, outperforming the baseline approach. For the proposed method, optimal performance is observed in the frequency range between \(500\) Hz and \(2\) kHz, where the concentration of speech energy yields the highest signal-to-noise ratio and, consequently, supports accurate estimation of the acoustic prior. The PCCs between the parameters of the ground truth and the synthesized RIRs, shown in the bottom plot of Fig. 4, highlight the advantage of the proposed method, which is most pronounced for \(\mathrm{T}_{30}\). A general trend observed for both methods is that estimating acoustic parameters is most challenging in the lowest and highest octave bands, where speech does not sufficiently excite the acoustic response.

The perceptual quality of the synthesized reverberant speech is evaluated using the FAD, as reported in Table 1. While a formal listening experiment would be required to directly assess human preference, the results demonstrate that the proposed approach is capable of generating highly plausible artificially reverberated speech signals². Finally, we highlight the considerable modularity and control afforded by the multi-stage design of the proposed method, as illustrated in Fig. 3 with a single RIR example. This design enables the assessment of both the level of detail captured in the acoustic prior and the extent to which this prior can be estimated from a given segment of reverberant speech, on which the quality of the synthesized RIR ultimately depends.

Figure 4: Comparison of the proposed approach and the baseline in terms of acoustic parameters of the synthesized RIRs. For each octave band, the top two plots show empirical distributions of parameter errors (lower is better), and the bottom plot shows the PCC between ground-truth and synthesized parameters (higher is better).

5 Conclusion↩︎

We have presented a method for the blind estimation of FDN parameters from reverberant speech. The proposed approach extracts a room-acoustic prior and employs a differentiable FDN structure capable of modeling both frequency-dependent decay times and the direct-to-reverberant ratio. Evaluation of the synthesized RIRs, in terms of acoustic parameters, together with the perceptual plausibility of the generated reverberant speech, demonstrated that the proposed method outperformed the baseline.

References↩︎

[1]

J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization, MIT Press, Cambridge, MA, USA, 1997.

[2]

A. S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, MIT Press, Cambridge, MA, USA, 1994.

[3]

S. Agrawal et al., “Defining immersion: Literature review and implications for research on immersive audiovisual experiences,” J. Audio Eng. Soc., vol. 67, no. 11, pp. 886–897, 2019.

[4]

A. Neidhardt et al., “Perceptual matching of room acoustics for auditory augmented reality in small rooms-literature review and theoretical framework,” Trends Hear., vol. 26, Apr. 2022.

[5]

T. Potter et al., “On the relative importance of visual and spatial audio rendering on VR immersion,” Front. Signal Process., vol. 2, May 2022.

[6]

F. Immohr et al., “Proof-of-concept study to evaluate the impact of spatial audio on social presence and user behavior in multi-modal VR communication,” in Proc. ACM IMX, Jun. 2023, pp. 209–215.

[7]

N. Meyer-Kahlen et al., “Testing auditory illusions in augmented reality: Plausibility, transfer-plausibility, and authenticity,” J. Audio Eng. Soc., vol. 72, no. 11, pp. 797–812, 2024.

[8]

H. Kuttruff, Room Acoustics, CRC Press, 2016.

[9]

J. Eaton et al., “Estimation of room acoustic parameters: The ACE challenge,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 10, pp. 1681–1693, Oct. 2016.

[10]

Y. Lin et al., “Bayesian regularization and nonnegative deconvolution for room impulse response estimation,” IEEE Trans. Signal Process., vol. 54, no. 3, pp. 839–847, Mar. 2006.

[11]

K. Crammer et al., “Room impulse response estimation using sparse online prediction and absolute loss,” in Proc. ICASSP, May 2006, vol. 3, pp. 301–304.

[12]

C. J. Steinmetz et al., “Filtered noise shaping for time domain room impulse response estimation from reverberant speech,” in Proc. WASPAA, Oct. 2021, pp. 221–225.

[13]

S. Lee et al., “Yet another generative model for room impulse response estimation,” in Proc. WASPAA, Oct. 2023.

[14]

A. Ratnarajah et al., “Towards improved room impulse response estimation for speech recognition,” in Proc. ICASSP, Jun. 2023.

[15]

E. Moliner et al., “BUDDy: Single-channel blind unsupervised dereverberation with diffusion models,” in Proc. IWAENC, Sep. 2024, pp. 120–124.

[16]

S. J. Schlecht et al., “Feedback delay networks: Echo density and mixing time,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 2, pp. 374–383, Feb. 2017.

[17]

M. A. Gerzon, “Synthetic stereo reverberation: Part one,” Studio Sound, vol. 13, no. 12, pp. 632–635, Dec. 1971.

[18]

J.-M. Jot et al., “Digital delay networks for designing artificial reverberators,” in Proc. AES Conv., Feb. 1991.

[19]

V. Välimäki et al., “Fifty years of artificial reverberation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 5, pp. 1421–1448, Jul. 2012.

[20]

S. Lee et al., “Differentiable artificial reverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 2541–2556, Aug. 2022.

[21]

G. Dal Santo et al., “RIR2FDN: An improved room impulse response analysis and synthesis,” in Proc. DAFx, Sep. 2024, pp. 230–237.

[22]

K. Kilgour et al., “Fréchet Audio Distance: A metric for evaluating music enhancement algorithms,” in Proc. Interspeech, Sep. 2019, pp. 2350–2354.

[23]

P. Götz et al., “Blind acoustic parameter estimation through task-agnostic embeddings using latent approximations,” in Proc. IWAENC, Sep. 2024, pp. 289–293.

[24]

D. P. Kingma et al., “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, Dec. 2013.

[25]

J.-M. Jot, “An analysis/synthesis approach to real-time artificial reverberation,” in Proc. ICASSP, 1992, pp. 221–224.

[26]

G. Dal Santo et al., “FLAMO: An open-source library for frequency-domain differentiable audio processing,” in Proc. ICASSP, Apr. 2025.

[27]

M. Lezcano-Casado et al., “Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group,” in Proc. ICML, May 2019, pp. 3794–3803.

[28]

R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, May 2020, pp. 6199–6203.

[29]

C. J. Steinmetz et al., “Auraloss: Audio focused loss functions in PyTorch,” in Proc. DMRN+15, Dec. 2020.

[30]

J. Richter et al., “EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,” in Proc. Interspeech, Sep. 2024.

[31]

S. Pearce, “ASH-IR dataset,” https://github.com/ShanonPearce/ASH-IR-Dataset, Accessed: 2024-09-26.

[32]

P. Götz et al., “A multi-room transition dataset for blind estimation of energy decay,” in Proc. IWAENC, Sep. 2024, pp. 125–129.

[33]

K. Prawda et al., “Dataset of impulse responses from variable acoustics room Arni at Aalto Acoustic Labs,” Zenodo, 10.5281/zenodo.6582103, 2022.

[34]

C. Warren, “EchoThief impulse response library,” http://www.echothief.com/, Accessed: 2024-05-14.

[35]

M. Jeub et al., “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. DSP, Jul. 2009.

[36]

J. Traer et al., “Statistics of natural reverberation enable perceptual separation of sound and space,” Proc. Nat. Acad. Sci., vol. 113, no. 48, pp. E7856–E7865, Nov. 2016.

[37]

D. T. Murphy et al., “Openair: An interactive auralization web resource and database,” in Proc. AES Conv., May 2010.

[38]

L. Friede et al., “Multi-purpose room impulse response dataset measured on a 3D spatial grid,” in Proc. AES Conv., Jun. 2024.

[39]

A. Politis et al., “TAU spatial room impulse response database (TAU-SRIR DB),” Zenodo, https://zenodo.org/record/6408611, Apr. 2022.

[40]

T. Lübeck et al., “A high-resolution spatial room impulse response database,” in Proc. 47th DAGA, Aug. 2021.

[41]

I. Loshchilov et al., “Decoupled weight decay regularization,” in Proc. ICLR, May 2019.

https://github.com/gdalsanto/flamo ↩︎
Sound examples: https://www.audiolabs-erlangen.de/resources/2026-ICASSP-RMS ↩︎

Matching Reverberant Speech Through Learned Acoustic Embeddings and Feedback Delay Networks