April 02, 2024
In speech separation, both CNN- and Transformer-based models have demonstrated robust separation capabilities, garnering significant attention within the research community. However, CNN-based methods have limited modelling capability for long-sequence audio, leading to suboptimal separation performance. Conversely, Transformer-based methods are limited in practical applications due to their high computational complexity. Notably, within computer vision, Mamba-based methods have been celebrated for their formidable performance and reduced computational requirements. In this paper, we propose a network architecture for speech separation using a state-space model, namely SPMamba. We adopt the TF-GridNet model as the foundational framework and substitute its Transformer component with a bidirectional Mamba module, aiming to capture a broader range of contextual information. Our experimental results reveal an important role in the performance aspects of Mamba-based models. SPMamba demonstrates superior performance with a significant advantage over existing separation models in a dataset built on Librispeech. Notably, SPMamba achieves a substantial improvement in separation quality, with a 2.42 dB enhancement in SI-SNRi compared to the TF-GridNet. The source code for SPMamba is publicly accessible at https://github.com/JusperLee/SPMamba.
Index Terms: Dynamic depth neural network, dynamic width neural network
Speech separation is pivotal in enhancing the intelligibility and quality of audio in environments with multiple speakers, thereby facilitating clearer communication and better audio analysis. In recent years, deep learning models, particularly those based on Convolutional Neural Networks (CNNs)[1], Recurrent Neural Networks (RNNs)[2]–[4], and Transformer architectures[5]–[7], have significantly advanced the state of the art in various auditory tasks, including speech separation.
However, despite their successes, both CNN-based and Transformer-based models encounter fundamental challenges in the speech separation domain. CNN-based models [1], [8]–[10], for instance, are limited by their local receptive fields, which restricts their ability to capture the full context of audio signals, thus affecting their separation capabilities. On the other hand, while Transformer-based models [5], [6], [11] excel in modelling long-range dependencies, their self-attention mechanisms suffer from quadratic complexity with respect to the sequence length, rendering them computationally expensive for real-time applications.
Recent developments in State Space Models (SSMs) [12], [13] have shown promise in addressing these limitations by establishing long-range dependencies with linear computational complexity, making them particularly suitable for tasks requiring efficient processing of long sequences. Leveraging the foundational principles of classical SSM research, modern SSMs, exemplified by Mamba, have demonstrated their efficacy across various domains, including natural language processing [14], [15] and vision tasks [16]–[18]. In the speech separation field, the potential for SSMs to revolutionize the design of efficient and effective models remains largely untapped [19].
Leveraging the transformative potential of SSMs in capturing long-range dependencies with linear computational complexity, we introduce an innovative architecture for speech separation, SPMamba. This architecture ingeniously integrates the essence of SSMs into the realm of audio processing, specifically targeting the challenges of speech separation. SPMamba is built upon the robust framework of TF-GridNet [11], which is renowned for effectively handling temporal and frequency dimensions in audio signals[11]. By replacing the Transformer components of TF-GridNet with bidirectional Mamba modules, SPMamba is designed to significantly enhance the model’s ability to comprehend and process the vast contextual landscape of audio sequences. This substitution not only addresses the limitations of CNN-based models in dealing with long-sequence audio but also mitigates the computational inefficiencies inherent in RNN-based approaches.
Our comprehensive experiments, conducted on a dataset with noise and reverberation, underscore the remarkable efficacy of SPMamba in the field of speech separation. The results unequivocally demonstrate a marked superiority of SPMamba over conventional separation models, highlighting a significant leap in performance metrics. Specifically, SPMamba achieves an impressive 2.42 dB improvement in SI-SNRi, compared with TF-GridNet. This enhancement is not merely a quantitative victory but a testament to the qualitative leap in separation quality afforded by the integration of SSMs.
The main contribution of this paper is the pioneering exploration of SSMs within the speech separation domain through the introduction of SPMamba. The superior performance of SPMamba, as evidenced by our rigorous experimental validation, sets a new benchmark in the field, offering a compelling alternative to existing models. Beyond the immediate improvements in separation quality and computational efficiency, this work opens new avenues for future research and development of SSM-based audio processing models.
In the realm of speech separation tasks, the challenge lies in disentangling mixed audio signals into their constituent sources. This is particularly relevant in scenarios where multiple speakers are present, and the objective is to isolate the speech of each individual speaker from a single mixed input signal \(\mathbf{x}\in \mathbb{R}^{1 \times T}\). Previous methods [1], [5], [6], [8]–[11] have leveraged CNN, RNN and Transformer to tackle this problem, each offering distinct advantages and drawbacks in terms of computational efficiency and the ability to capture temporal dependencies.
The Mamba method introduces a novel approach by employing Selective SSM that combines the strengths of both CNNs and RNNs, while also addressing their limitations through a selection mechanism that incorporates input-dependent dynamics. This technique enables the model to selectively focus on or ignore parts of the input sequence, a capability that is crucial for effectively separating overlapping speech signals.
Structured SSMs, as described in the Mamba method, operate by mapping each channel of an input \(x\) to an output \(y\) through a higher-dimensional latent state \(h\), as illustrated in the following equations: \[\begin{align} & h_k = \hat{A}h_{k-1} + \hat{B}x_k, \\ & y_k = \hat{C}h_k + \hat{D}x_k, \end{align}\] where \(\hat{A}, \hat{B}, \hat{C}, \hat{D}\) are discretized state matrices tailored for the speech separation task. The discretization process transforms continuous parameters \((\Delta, A, B)\) into discrete counterparts \((\hat{A}, \hat{B})\), enabling the model to operate on discrete-time audio signals.
Specifically, the Mamba architecture combines elements of the H3 architecture and MLP blocks into a single, homogeneously stacked block. It expands the model dimension by a factor, concentrating most parameters in linear projections. The architecture includes fewer parameters for the inner SSM and employs the SiLU/Swish activation function. It is designed with standard normalization and residual connections, utilizing an optional LayerNorm layer for enhanced performance. The Mamba structure aims to match the efficiency of Transformer models with a streamlined, parameter-efficient design.
In addition, one key innovation of the Mamba is its hardware-aware algorithm that efficiently computes these selective SSMs on modern GPU architectures. By exploiting the memory hierarchy of GPUs, the method ensures that the expanded states are materialized in more efficient levels of the GPU memory hierarchy, such as SRAM, rather than the slower GPU HBM (High Bandwidth Memory). This approach significantly reduces the computational overhead associated with the large effective state size \((DN \times B \times L)\), where \(D\) is the number of channels, \(N\) is the state dimension, \(B\) is the batch size, and \(L\) is the sequence length.
The selection mechanism is realized by making several parameters (\(\Delta, B, C\)) functions of the input, thereby introducing time-varying dynamics into the model. This allows the Mamba method to dynamically adjust its focus on specific parts of the input sequence based on the content, a feature that is particularly beneficial for speech separation tasks where changes occur in different segments of the audio signal.
In order to design an efficient speech separation model, we introduce Mamba into the TF-GridNet network structure [11], called SPMamba, as shown in Fig.1. In Section 3.2, we first introduce a bidirectional Mamba layer, the core contribution of this paper; in Section 3.3, we present the details of the SPMAmba structure; in Section 3.4, we describe the loss function we use.
While the S6 model exhibits unique features, its causal processing of the input data limits it to capturing only historical information about the data. This feature makes S6 suitable for handling causal tasks like causal speech separation. However, in this paper, our interest is mainly focused on non-causal speech separation. To overcome this limitation, an intuitive solution is to mimic the processing of BLSTM by scanning speech frames along both forward and backward directions, thus enabling the model to combine current and historical features. Fig.2 shows the detailed structure of the BMamba layer.
Specifically, we will process the input audio feature \(\mathbf{E}_t\) from both the front and back directions, where \(t\) denotes the serial number of the overall structure as shown in Fig.1. For one direction, we first apply a linear projection of \(\mathbf{E}_t\) onto \(\hat{B}\), \(\hat{C}\), and \(\Delta\). Finally, we compute forward and backward through the SSM. The forward \(\mathbf{E}^f_t\) and backward \(\mathbf{E}^b_t\) are then gated and concated together to obtain the output marker sequence \(\mathbf{E}^o_t\).
The SPMamba model adopts TF-GridNet as its backbone network, which is the SOTA speech separation framework in previous studies. In order to further improve the efficiency of the model, we adopt the strategy of replacing the BLSTM network with a bidirectional Mamba network. Next, we will elaborate on the structural design and implementation details of the SPMamba model.
Fig.1 illustrates the overall architecture of SPMamba. Consistent with TF-GridNet, it consists of three main components: 1) a time-domain module for learning the feature relationships between different frames; 2) a frequency-domain module for modelling the relationships between different sub-bands; and 3) a time-frequency attention module for capturing long-range global information.
Time-Domain Feature Module. In this module, we treat the input tensor as a series of independent frequency sequences and employ a BMamba layer to capture the complex relationships within each frame. First, we unfold the input tensor using a kernel of size \(K\) and stride \(S\) to enhance the local spectral context. To ensure dimensional consistency, we apply zero padding to the frequency dimension. Next, we apply layer normalization along the channel dimension of the unfolded tensor, followed by a BMamba layer with H hidden units in each direction to model the intra-frame frequency information. To recover the original dimensions, we use a 1D deconvolution layer with kernel size \(K\), stride \(S\), input channel \(2H\), and output channel \(C\) to process the hidden embeddings of the BMamba layer. Finally, we remove the zero-padding and add the input tensor to the output tensor via a residual connection to facilitate gradient flow and learning. This module effectively captures the intricate relationships within each frame by leveraging the power of the BMamba layer, enabling the network to exploit the rich spectral information present in the input tensor, leading to enhanced feature representations for subsequent processing.
Frequency-Domain Feature Module. In this module, the procedure closely resembles the intra-frame spectral module. The key distinction lies in interpreting the input tensor, which is treated as F-independent sequences, each with a length of T. Within this module, a BMamba layer is employed to capture and model the temporal information present within each sub-band.
Time-Frequency Attention Module. This module leverages frame-level embeddings derived from the time-frequency representations within each frame of the output tensor generated by the frequency-domain feature module. It employs whole-sequence self-attention on these frame embeddings to capture long-range global information. Like TF-GridNet, the concatenated attention outputs undergo further processing to obtain the output tensor, which is fed into the next block.
For the Loss function, we employ Permutation Invariant Training (PIT) [20], [21] to calculate the Signal-to-Noise Ratio (SNR) loss [22]. The SNR loss is defined as: \[\text{SNR}(\mathbf{s}, \hat{\mathbf{s}}) = 10 \log_{10} \frac{\|\mathbf{s}\|^2}{\|\mathbf{s} - \hat{\mathbf{s}}\|^2},\] where \(\mathbf{s}\) represents the target signal and \(\hat{\mathbf{s}}\) represents the estimated signal.
Model | SDR | SDRi | SI-SNR | SI-SNRi | Params(M) | Macs (G/s) |
---|---|---|---|---|---|---|
Conv-TasNet [1] | 7.58 | 7.69 | 6.71 | 6.89 | 5.62 | 10.23 |
DualPathRNN [4] | 5.76 | 5.87 | 4.88 | 5.06 | 2.72 | 85.32 |
SudoRM-RF [8] | 7.59 | 7.70 | 6.66 | 6.84 | 2.72 | 4.60 |
A-FRCNN [10] | 9.53 | 9.64 | 8.58 | 8.76 | 6.13 | 81.20 |
TDANet [9] | 9.93 | 10.14 | 8.95 | 9.21 | 2.33 | 9.13 |
BSRNN [24] | 12.64 | 12.75 | 12.04 | 12.23 | 25.97 | 98.69 |
TF-GridNet [11] | 13.59 | 13.70 | 12.62 | 12.81 | 14.43 | 445.56 |
SPMamba (Ours) | 16.01 | 16.14 | 15.20 | 15.33 | 6.14 | 78.69 |
We have constructed a multi-speaker speech separation dataset with reverberation and noise. For the speaker audio component of the dataset, we selected the publicly available Librispeech dataset [25], with Librispeech-360 containing approximately 360 hours of English data. For the noise component, we chose to use the noise provided by the WHAM! dataset [26] and the sound effects from the DnR data [27]. For background music, we selected the music portion of the cleaned DnR dataset [27].
To create realistic synthetic mixtures, we focused on three main aspects: class overlap between different audio sources in the mixture, relative levels, and multi-channel spatialization. To ensure that the mixtures include multiple complete speech segments and have a sufficient number of onsets and offsets between different classes, we set each mixture’s length to 60 seconds with a sample rate of 16kHz. We do not allow intra-class overlap, meaning two segments from the same speaker will not overlap, but foreground and background sound effects can overlap. We use Loudness Units Full Scale (LUFS) [28] for relative levels for adjustment, with music at \(-24\), speech at \(-17\), and sound effects at \(-21\). We uniformly sample an average LUFS value for each class in each mixture from a range of \(±2.0\) around the respective target LUFS. For multi-channel spatialization, we use a simulator [29] to simulate spatial reverberation through tracing to achieve an effect close to real-world scenarios. Finally, we constructed a 57-hour training set, an 8-hour validation set, and a 3-hour test set to evaluate the performance of different models.
For the short-time Fourier transform (STFT), we employ a 512-point Hann window with a hop size of 128 points. We apply a 512-point Fourier transform to extract a 129-dimensional complex spectrum for each frame. We use \(B = 6\) blocks, and \(E\) is set to 4. Each BMamba layer is composed of two Mamba components, with a hidden layer dimension of 128. We utilize RMSNorm to normalize the output of the Mamba components.
During the training process, we randomly select 4s-long mixed audio segments for training. We employ the Adam optimizer [30] with an initial learning rate of 0.001, and we halve the learning rate if the validation loss does not improve within 10 epochs. The maximum value for gradient clipping is set to 5. The model is trained until no best validation model is found for 20 consecutive epochs. For evaluation, we use SI-SNRi [31] and SDRi [22] as metrics and report the number of parameters and computational complexity of different models.
We compare our proposed method, SPMamba, with several state-of-the-art speech separation models, including Conv-TasNet [1], DualPathRNN [4], SudoRM-RF [8], A-FRCNN [10], TDANet [9], BSRNN [24], and TF-GridNet [11]. Conv-TasNet is a classic time-domain audio separation network. DualPathRNN combines two RNNs with different temporal resolutions to model long-term dependencies. SudoRM-RF is a lightweight time-domain model composed of multiple UNets. A-FRCNN is an asynchronous fully recurrent convolutional neural network. TDANet is an encoder-decoder architecture time-domain audio separation network that is the best-performing lightweight separation network. BSRNN is a model that achieves SOTA music separation performance by constructing a model using a frequency band-splitting approach. TF-GridNet is a time-frequency domain model that achieves SOTA performance in speech separation by alternately modeling in the frequency and time domains. These models represent various architectures and methods in the field of speech separation.
The experimental results in Table 1 demonstrate that our proposed method, SPMamba, outperforms all other compared models regarding SDR(i) and SI-SNR(i) metrics. SPMamba achieves an SDR of 16.01 dB and an SI-SNR of 15.20 dB, surpassing the baseline model, TF-GridNet, by a significant margin of 2.42 dB and 2.58 dB, respectively. It is worth noting that SPMamba achieves these state-of-the-art results with only 6.14 million parameters and a computational complexity of 78.69 G/s, which is considerably lower than that of TF-GridNet (14.43 million parameters and 445.56 G/s). This highlights the efficiency and effectiveness of our proposed architecture in tackling the speech separation task.
In this paper, we introduce SPMamba, a novel speech separation architecture that leverages the power of State Space Models (SSMs) to address the limitations of existing CNN-based and Transformer-based methods. By incorporating a bidirectional Mamba module into the TF-GridNet framework, SPMamba captures a wider range of contextual information while maintaining computational efficiency. Our experimental results demonstrate the superior performance of SPMamba, with a substantial improvement of 2.42 dB in SI-SNRi compared to the baseline TF-GridNet model. Moreover, SPMamba achieves this state-of-the-art performance with significantly fewer parameters and lower computational complexity, highlighting its efficiency and effectiveness in speech separation tasks.