September 30, 2024
Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers’ transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker’s appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers’ speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
speech recognition, end-to-end, neural transducer, multi-talker, alignment-free training
RNN Transducer (RNNT) [1] is promising for streaming automatic speech recognition (ASR) [2], but it struggles to handle multi-speaker overlapped inputs. To address this, a variety of multi-talker RNNT (MT-RNNT) methods have been proposed to transcribe the overlapping speech of multiple speakers [3]–[15].
Several MT-RNNT approaches employ multiple encoder and/or decoder branches with permutation invariant training (PIT) [3]–[5] or heuristic error assignment training (HEAT) [6]–[11]. Although these MT-RNNTs do not use any front-end speech separation, decoding the speech of all speakers is often computationally intensive. This is because the encoder processing, which is the most computationally demanding operation [16]–[18], must be performed individually for each speaker in the mixture. This significantly increases the computational costs in both training and decoding. This issue poses a critical challenge to streaming applications, and thus it is preferable to have a single encoder that can simultaneously recognize multi-talker inputs.
To achieve this, token-level serialized output training (tSOT) for MT-RNNT (MT-RNNT-tSOT) was proposed [12]. In tSOT, multiple speakers’ transcriptions are serialized into a single output stream based on the order of subword-level occurrence timestamps, regardless of the speakers. This enables multi-talker ASR with the standard RNNT architecture. However, the serialization requires accurate timestamps for each token, which must be obtained through forced alignment from an external ASR system. Moreover, performing forced alignment on real recording mixtures is particularly challenging, and low-quality alignments result in the degradation of MT-RNNT-tSOT performance.
In this paper, we propose a novel MT-RNNT training scheme that retains the standard RNNT architecture while significantly simplifying the training process, without requiring any alignment. We refer to this approach as MT-RNNT with alignment-free training (MT-RNNT-AFT). For MT-RNNT-AFT, we introduce a prompt token that specifies the order of speakers’ appearances in the mixture. The target labels for each speaker are then created by simply appending the prompt token at the beginning of each transcription. The losses are computed individually between the target label and the prediction for each speaker, and then summed. MT-RNNT-AFT can decode all speakers’ speech in a first-in-first-out manner, requiring just one round of encoder processing. The decoder can simultaneously recognize all speakers’ speech by batching its processing [16] for all speakers, which is made possible by the use of identical parameters. The computational costs are much lower than MT-RNNT which requires distinct encoder outputs for all speakers, as mentioned above.
MT-RNNT-AFT can output each speaker’s hypothesis individually, unlike MT-RNNT-tSOT, which outputs a single serialized transcription in a more complex format. Therefore, MT-RNNT-AFT can utilize various effective approaches developed for standard single-talker ASR. Leveraging this advantage, we introduce self-knowledge distillation (KD) using parallel single/multi-talker ASR data. We can naturally use the parallel data because the mixture is created on-the-fly using multiple single-talker voices. We distill knowledge from the MT-RNNT-AFT outputs, which are generated from single-talker ASR data, to the outputs of MT-RNNT-AFT itself, produced using multi-talker ASR data, similar to [19], [20]. We also employ language model (LM) integration [21], [22] during decoding.
Experiments demonstrate that MT-RNNT-AFT achieves comparable performance to MT-RNNT-tSOT in offline mode, even though MT-RNNT-AFT does not use any rich alignments from external ASR systems. Moreover, KD and LM integration further improve the recognition performance. Our best systems match the recognition performance of state-of-the-art alternatives in both streaming and offline modes, while employing a much simpler training scheme.
Target-speaker ASR (TS-ASR) [20], [23]–[28] has been proposed as an alternative solution to overlapped ASR. TS-ASR recognizes only the target speaker’s speech from a mixture using speech enrolled in advance that captures the target speaker’s characteristics. It naturally avoids output-speaker ambiguity and can limit the decoding to just the target speaker. However, for TS-ASR to recognize all speakers in a mixture, the encoder output must be recomputed for each speaker involved using distinct enrolled speech, and each encoder output must be decoded individually. Specifically, encoder processing is significantly more computationally expensive than decoder processing [16]–[18]. Thus, TS-ASR is not the optimal solution for recognizing all speakers’ voices in the mixture simultaneously. Note that this problem also occurs with MT-RNNT using PIT/HEAT, which requires multiple rounds of processing in both the encoder and decoder branches.
RNNT-based speaker-attributed ASR has also been proposed as an extension of MT-RNNT-tSOT [13]. This approach uses an additional speaker encoder/decoder to classify output tokens by speaker. Incorporating specific speaker information further improves the performance of multi-talker ASR [29]. However, it still requires accurate timestamps for the serialization of both target and speaker labels, and the extra encoder/decoder introduces critical delays for streaming ASR. Furthermore, for real-world data, speaker information is anonymized and difficult to access. In this work, we aim to enhance MT-RNNT to recognize multiple speakers while retaining a standard RNNT architecture, without requiring rich alignments, speaker details, or additional encoders.
Multi-talker ASR recognizes speech from a mixture of \(M\) speakers. In this paper, we focus on the two-speaker multi-talker ASR task (\(M = 2\)), as reported in several studies [3], [4], [6]–[8], [12], [13], [26], [29], [30]. Let \(\boldsymbol{X}^{\text{mixture}}\) be the input mixture signal of duration \(T^{\prime}\); it includes two speakers’ voices, denoted as \(\boldsymbol{X}^{\text{mixture}} = \boldsymbol{X}^{\text{spk1}} + \boldsymbol{X}^{\text{spk2}}\). \(Y^{\text{spk1}} \in \{1, \dots, K\}^U\) and \(Y^{\text{spk2}} \in \{1, \dots, K\}^{U'}\) represent the token sequences associated with each speaker’s transcription. \(y^{\text{spk}m}_{u} \in \{1, \dots, K\}\) indicates the \(u\)-th token of the \(m\)-th speaker in \(Y^{\text{spk}m}\). The vocabulary size, \(K\), includes the blank symbol, “\(\phi\)”.
RNNT [1] learns the mapping between sequences of different lengths. A single-talker speech, \(\boldsymbol{X}^{\text{spk}1}\), is encoded into \(\boldsymbol{H}^{\text{enc}} = \left[ \boldsymbol{h}^{\text{enc}}_{1}, \dots, \boldsymbol{h}^{\text{enc}}_{T} \right]\) of length-\(T\) via a feature extractor and encoder network \(f^{\text{enc}}(\cdot)\). \(Y^{\text{spk}1}\) is transformed into \(\boldsymbol{H}^{\text{pred}} = \left[ \boldsymbol{h}^{\text{pred}}_{1}, \dots, \boldsymbol{h}^{\text{pred}}_{U} \right]\) via prediction network \(f^{\text{pred}}(\cdot)\). These encoded features are then fed to joint network \(f^{\text{joint}}(\cdot)\) to obtain the posteriors \(\hat{\boldsymbol{y}}_{t,u} \in (0,1)^{K}\). The above operations are defined as follows: \[\begin{align} \boldsymbol{h}^{\text{enc}}_{t} &= f^{\text{enc}} (\boldsymbol{x}_{t^{\prime}}^{\text{spk}1}; \theta^{\text{enc}}), \\ \boldsymbol{h}^{\text{pred}}_{u} &= f^{\text{pred}} (y_{u-1}^{\text{spk}1}; \theta^{\text{pred}}), \\ \hat{\boldsymbol{y}}_{t,u}^{\text{spk}1} &= \text{Softmax} \left(f^{\text{joint}} (\boldsymbol{h}^{\text{enc}}_{t}, \boldsymbol{h}^{\text{pred}}_{u}; \theta^{\text{joint}}) \right), \label{eq:rnnt} \end{align}\tag{1}\] where \(\text{Softmax}(\cdot)\) means a softmax operation. RNNT outputs three dimensional tensor \(\hat{\boldsymbol{Y}}^{\text{spk}1} \in (0,1)^{T \times U \times K}\) during training. The learnable parameters \(\theta^{\text{RNNT}} \triangleq [\theta^{\text{enc}}, \theta^{\text{pred}}, \theta^{\text{joint}}]\) are optimized using RNNT loss \({\cal L}_{\text{RNNT}}\) [1]. In this study, we retain the original model structure but replace the inputs and outputs with multi-talker variants in the subsequently described MT-RNNT-tSOT and MT-RNNT-AFT.
Fig. 1 shows the training procedure of the MT-RNNT-tSOT system [12]. MT-RNNT-tSOT has the same architecture and training procedure as standard RNNT and differs only in the input mixture and its transcriptions. The tSOT approach generates training mixture \(\boldsymbol{X}^{\text{mixture}}\) and labels \(Y^{\text{tSOT}}\) on-the-fly [12] as briefly explained below.
Two-speaker mixture, \(\boldsymbol{X}^{\text{mixture}}\), is generated by adding two clean speech signals while ensuring that the second speaker’s speech starts after the first speaker. Serialized transcription \(Y^{\text{tSOT}}\) is created by sorting all tokens from both speakers based on their timestamps, which are contained in the alignments. This process requires accurate timestamps for all tokens, which must be obtained in advance
by performing forced alignment on the speech and transcriptions of all speakers using the external ASR system. Note that a speaker change token, <sc>, is inserted whenever there is a speaker switch.
In the training step, since MT-RNNT-tSOT adopts the same architecture as the standard RNNT explained in Section 3.1, we replace single-talker speech \(\boldsymbol{X}^{\text{spk}1}\) and
its transcription \(Y^{\text{spk}1}\) with multi-talker variants, \(\boldsymbol{X}^{\text{mixture}}\) and \(Y^{\text{tSOT}}\), respectively. The joint
network of MT-RNNT-tSOT outputs the posteriors probabilities, \(\hat{\boldsymbol{Y}}^{\text{tSOT}} \in (0, 1)^{T \times (U+U^{\prime}+\alpha) \times (K+1)}\). “\(\alpha\)” represents the
number of occurrences of <sc>, and “\(K+1\)” corresponds to the vocabulary size including <sc>. All parameters, \(\theta^{\text{MT-RNNT-tSOT}} \triangleq
[\theta^{\text{enc}}, \theta^{\text{pred}}, \theta^{\text{joint}}]\), are optimized with \(\mathcal{L}_{\text{RNNT}}\) using \(\hat{\boldsymbol{Y}}^{\text{tSOT}}\) and \(Y^{\text{tSOT}}\).
For decoding, MT-RNNT-tSOT simultaneously transcribes all speakers’ speech in \(\boldsymbol{X}^{\text{mixture}}\) into a single serial hypothesis \(\hat{Y}^{\text{tSOT}}\). Although MT-RNNT-tSOT can perform streaming multi-talker ASR, unlike the attentional encoder-decoder (AED) using the utterance-level SOT framework [30], it requires accurate alignments from an external pre-trained ASR system. Moreover, generating alignments for real mixtures is particularly problematic, resulting in poor alignments, as performing forced alignment is especially challenging. These low-quality alignments lead to the degradation observed in MT-RNNT-tSOT performance. Additionally, since the format of the serialized hypothesis \(\hat{Y}^{\text{tSOT}}\) is complex, MT-RNNT-tSOT cannot straightforwardly utilize either LM integration [21], [22] or the knowledge distillation framework [31] developed for the standard single-talker ASR.

Figure 2: Training procedure of MT-RNNT-AFT. MT-RNNT-AFT decodes all speakers’ speech in a first-in-first-out manner. Prompt tokens <spk\(m\)>, which correspond to the
sequential order of each speaker’s appearance in mixture \(\boldsymbol{X}^{\text{mixture}}\), are appended to the beginning of each transcript \(Y^{\text{spk}m}\)..
In this paper, we propose an alignment-free training scheme for MT-RNNT (MT-RNNT-AFT). This scheme completes the training in a single step and eliminates the need for rich alignments to be generated by an external ASR system. Fig. 2 shows the procedures for mixture and label generation in MT-RNNT-AFT training.
MT-RNNT-AFT decodes each speaker’s speech in a first-in-first-out manner. The mixture generation procedure is the same as that used in MT-RNNT-tSOT, see Section 3.2. The delay should be set to preserve the order of each speaker’s appearance in the mixture. In this paper, the “offset” is set to 0.5 seconds, based on the duration of the initial silence within each segment of LibriSpeech [32].
To adhere to the first-in-first-out approach in label generation, we introduce prompt tokens, namely <spk1> for the first speaker, and <spk2> for the second speaker. Each prompt token is appended to the beginning of
their respective transcripts, denoted as \(Y^{\text{spk}m}\), and the resulting transcript is then named \(Y^{\prime \;\text{spk}m}\). In the two-speaker case, there are two target labels:
\(Y^{\prime \;\text{spk1}} \in \{1, \dots, K+2\}^{U+1}\) and \(Y^{\prime \;\text{spk2}} \in \{1, \dots, K+2\}^{U^{\prime}+1}\). The appearance order information is easier to prepare than
obtaining accurate alignments for all training samples, as required by MT-RNNT-tSOT. Thus, the AFT scheme can be applied to real data consisting of mixtures and their transcriptions.
In the training step, MT-RNNT-AFT is trained individually for each speaker. The reason is that, in the MT-RNNT-AFT scheme, the number of \(Y^{\prime \;\text{spk}m}\) corresponds to \(M\) as described above. Thus, we feed the mixture \(\boldsymbol{X}^{\text{mixture}}\) and the transcript \(Y^{\prime \;\text{spk}m}\) to the encoder and prediction networks, respectively. The joint network of MT-RNNT-AFT computes the predictions \(\hat{\boldsymbol{Y}}^{\prime \;\text{spk}m}\) for each speaker. Consequently, in the two speaker case (\(M=2\)), there are two predictions: \(\hat{\boldsymbol{Y}}^{\prime \;\text{spk1}} \in (0,1)^{T \times (U+1) \times (K+2)}\) and \(\hat{\boldsymbol{Y}}^{\prime \;\text{spk2}} \in (0,1)^{T \times (U^{\prime}+1) \times (K+2)}\). Note that the vocabulary is increased to “\(K+2\)” due to the addition of two prompt tokens. These predictions are used to calculate the combined loss \(\mathcal{L}^{\prime}_{\text{RNNT}} = \sum_{m=1}^{M} \mathcal{L}^{\text{spk}m}_{\text{RNNT}}\), and each loss \(\mathcal{L}^{\text{spk}m}_{\text{RNNT}}\) is computed using \(\hat{\boldsymbol{Y}}^{\text{spk}m}\) and \(Y^{\text{spk}m}\). All parameters, \(\theta^{\text{MT-RNNT-AFT}} \triangleq [\theta^{\text{enc}}, \theta^{\text{pred}}, \theta^{\text{joint}}]\), are optimized with \(\mathcal{L}_{\text{RNNT}}^{\prime}\).
In decoding, MT-RNNT-AFT recognizes all speakers’ voices in a first-in-first-out manner by processing the mixture through the encoder just once. By inputting the corresponding prompt token at the beginning, the decoder, consisting of the prediction and joint networks, outputs each speaker’s hypothesis from the shared encoder output. Beam search can be performed in parallel by batching decoder processing [16] for all speakers. Thus, the processing of the encoder and beam search, including the decoder, is completed in just one pass for all speakers, thanks to the fully shared parameters and the use of the shared encoder output. Therefore, the total computational cost of MT-RNNT-AFT is much lower than that of TS-ASR and MT-RNNT using PIT/HEAT. This is because encoder processing is significantly more computationally expensive than decoder processing [16]–[18]. Moreover, TS-ASR and MT-RNNT using PIT/HEAT require multiple invocations of both the encoder and decoder modules.1
In this paper, we also propose a self-knowledge distillation (KD) approach to further enhance MT-RNNT-AFT. Multiple single-talker speech \(\boldsymbol{X}^{\text{spk}m}\) is naturally available for MT-RNNT-AFT training due to the simulated on-the-fly mixture generation process. We exploit the parallel speech data, i.e., \(\boldsymbol{X}^{\text{spk}m}\) and \(\boldsymbol{X}^{\text{mixture}}\), in our KD framework, similar to [19], [20]. The training process consists of three steps. First, we obtain pseudo labels \(\hat{\boldsymbol{Y}}^{\text{spk}m}\) of the \(m\)-th speaker by processing each single-talker ASR data, \(\boldsymbol{X}^{\text{spk}m}\) and \(Y^{\prime \;\text{spk}m}\), with MT-RNNT-AFT before mixing. Then, we obtain predictions \(\hat{\boldsymbol{Y}}^{\prime \;\text{spk}m}\) by processing multi-talker ASR data, \(\boldsymbol{X}^{\text{mixture}}\) and \(Y^{\prime \;\text{spk}m}\), with MT-RNNT-AFT. Finally, we compute each speaker’s KD loss, \(\mathcal{L}^{\text{spk}m}_{\text{KD}}\), using \(\hat{\boldsymbol{Y}}^{\text{spk}m}\) and \(\hat{\boldsymbol{Y}}^{\prime \;\text{spk}m}\), and then sum them into \(\mathcal{L}_{\text{KD}}\). The total KD loss, \(\mathcal{L}_{\text{KD}}\), and the combined loss, \(\mathcal{L}_{\text{RNNT+KD}}\), with \(\mathcal{L}^{\prime}_{\text{RNNT}}\) for MT-RNNT-AFT training are defined as follows: \[\begin{align} \mathcal{L}_{\text{KD}} &= - \sum_{m=1}^{M} \sum_{t=1}^{T} \sum_{u=1}^{U} \sum_{k=1}^{K+M} \hat{y}^{\text{spk}m}_{t,u,k} \;\log \;\hat{y}^{\prime \;\text{spk}m}_{t,u,k}, \\ \mathcal{L}_{\text{RNNT+KD}} &= \mathcal{L}^{\prime}_{\text{RNNT}} + \lambda \mathcal{L}_{\text{KD}}, \label{eq:KD} \end{align}\tag{2}\] where \(\hat{y}^{\text{spk}m}_{t,u,k}\) and \(\hat{y}^{\prime \;\text{spk}m}_{t,u,k}\) correspond to the \(k\)-th class probability of \(\hat{\boldsymbol{Y}}^{\text{spk}m}\) and \(\hat{\boldsymbol{Y}}^{\prime \;\text{spk}m}\) at the \(t\)-th time and \(u\)-th label steps of \(m\)-th speaker, respectively. \(\lambda\) is the weight of \({\cal L}_{\text{KD}}\). We expect that the frame-level pseudo labels from MT-RNNT-AFT, generated using single-talker ASR data, will improve the model’s training stability and guide alignment when processing multi-talker ASR data.
2pt
We used the LibriSpeech corpus [32] for training, and LibriSpeechMix [30]2 for the development and evaluation sets. We used simulated mixtures generated on-the-fly, as described in Section 3.2. Volume and speed perturbation [33] and SpecAugment [34] were applied to the speech after on-the-fly mixing during training. The proportions of single-talker and two-speaker ASR data during training were 50% each. For tSOT label creation, the forced alignments were generated by using the Montreal Forced Aligner [35]. We adopted the 1k subwords determined by SentencePiece [36]. We performed experiments using the ESPnet [37]. We measured model performance using the concatenated minimum-permutation word error rate (cpWER) [38] for both single-talker (1spk) and two-speaker (2spk) ASR tasks.
We used an 80-dimensional log Mel-filterbank, extracted every 10ms, as the input feature of ASR models. We adopted Conformer (L) [39], where batch normalization was replaced by layer normalization; kernel size was reduced from 31 to 15. The encoder contained a two-layer 2D convolutional neural network (CNN) followed by 17 Conformer blocks. The prediction network had a 640-dimensional long short-term memory (LSTM) layer. The joint network consisted of a 512-dimensional feed-forward network.
For the streaming experiments, we constructed a variant of the offline system configuration, with only the offline encoder replaced by a chunkwise Conformer encoder [40]. Both the current and history chunk sizes of the streaming Conformer encoder were set to 60 frames, so the algorithmic latency was \(640\text{ms} = 600\text{ms} + 40\text{ms}\), with 40ms added due to the number of CNN lookahead frames. While the parameters of the offline Conformer model were randomly initialized, the streaming Conformer parameters were initialized with those from the trained offline Conformer.
For the MT-RNNT-AFT, we used on-the-fly internal LM estimation (ILME) during decoding [22]. The LM consisted of a four-layer LSTM with 1024 cells, and was trained using a large amount of text data following the LibriSpeech recipe. The ILM was jointly trained with MT-RNNT-AFT as detailed in [41]–[43].
For the training process, we utilized the AdamW optimizer along with a warmup learning rate scheduler; a peak learning rate of 1.5e-3 was reached after 25k warmup steps, and all models were run for a total of 200 epochs each. For the MT-RNNT-AFT model, we set \(\lambda\) to \(0.001\) when we applied KD loss \(\mathcal{L}_{\text{KD}}\), described in Section 4.2, with its application starting at the 180th epoch. The minibatch size was set to 256 in all experiments. For decoding, we utilized alignment-length synchronous decoding [16] with a beam size of 16.
Table [tab:comp] (a) shows the offline results. The check marks in Table [tab:comp] (a) indicate the additional information utilized for training and/or decoding. The results from the literature are displayed above the dashed line. The values written below the dashed line present our reproduced MT-RNNT-tSOT and our proposal, MT-RNNT-AFT.
First, the single-talker RNNT model struggled to recognize speech in a mixture. Our reproduced MT-RNNT-tSOT achieved better cpWERs than the original MT-RNNT-tSOT [12]. Thus, our reproduced MT-RNNT-tSOT establishes a state-of-the-art baseline; it utilizes the standard RNNT architecture, without any speaker information or additional encoder. MT-RNNT-AFT achieved performance comparable to that of MT-RNNT-tSOT, despite not using any additional information. By applying the KD loss proposal during training, the performance of MT-RNNT-AFT was further enhanced, allowing it to outperform MT-RNNT-tSOT. Therefore, MT-RNNT-AFT achieved the best performance while retaining the standard RNNT architecture.
Additionally, while LM integration is challenging for MT-RNNT-tSOT as its complex hypotheses contain mixed words from all speakers, MT-RNNT-AFT can be naturally integrated with an external LM. This is because each hypothesis individually contains the words spoken by each speaker. We applied ILME to MT-RNNT-AFT trained with KD, and it achieved performance comparable to the state-of-the-art as reported in [29], which requires additional information such as rich alignments, specific speaker information, and an extra encoder.
Next, we performed streaming experiments; the results are shown in Table [tab:comp] (b). We observed that MT-RNNT-AFT operates effectively in streaming mode. However, its performance failed to match that of MT-RNNT-tSOT. The deficiencies were caused by deletion errors in the 1spk task and insertion errors in the 2spk task. These errors occurred during a longer duration of inactive speech, including silence or speech from other speakers. The reason is that streaming MT-RNNT-AFT lacks a mechanism to carry speaker information across chunks or access larger look-ahead frames, such as tracking the order of each speaker’s appearance and their presence in the next input chunk. Notably, KD using frame-level pseudo labels, which include not only posteriors but also speaker activity information, improved the results of streaming MA-RNNT-AFT. The results were comparable to those of MT-RNNT-tSOT, which utilizes rich alignment.
We also applied ILME to MT-RNNT-AFT trained with KD and found its performance to match that of state-of-the-art alternatives, as reported in [12]. Despite severe challenges by performing the task without rich alignment, speaker information, or an additional encoder, MT-RNNT-AFT achieved performance comparable to that of MT-RNNT-tSOT in both offline and streaming modes.
0.15cm
@llccccc@ & &
& & 1 & 2 & 4 & 8 & 16
& Offline & 2.7/5.7 & 2.7/4.1 & 2.6/4.1 & 2.6/4.0 & 2.6/4.0
& Streaming & 4.4/9.2 & 4.2/6.7 & 4.2/6.6 & 4.2/6.5 & 4.2/6.5
& Offline & 2.7/4.1 & 2.6/3.7 & 2.6/3.7 & 2.6/3.7 & 2.6/3.7
& Streaming & 4.3/7.2 & 4.2/6.7 & 4.2/6.7 & 4.1/6.7 & 4.1/6.7
Although the above experiments consistently used a beam size of 16, the multi-threaded decoder processing may not be available, even though batching is supported. In that case, MT-RNNT-AFT reduces the beam size to equalize computational costs. Thus, we investigated various beam sizes and their effects on the cpWERs of each MT-RNNT model; the results are detailed in Table [tab:beam]. From Table [tab:beam], when the beam size for MT-RNNT-tSOT was set to 4 and for MT-RNNT-AFT to 2, MT-RNNT-AFT matched the performance of MT-RNNT-tSOT without significant degradation at the smallest beam size.
We have proposed MT-RNNT-AFT, an alignment-free training enhanced MT-RNNT that can be trained without requiring rich alignments while retaining the standard RNNT architecture. We introduced a prompt token that informs the MT-RNNT-AFT which speaker to recognize in the mixture. This procedure simplifies the decoding process, resulting in a much simpler training approach, while also enabling the use of KD and ILME. MT-RNNT-AFT achieved performance comparable to that of MT-RNNT-tSOT, which requires rich alignments. Moreover, offline MT-RNNT-AFT matched the performance of the state-of-the-art alternatives, while the latter requires rich alignments, speaker details, and an additional encoder.
We also applied PIT/HEAT to the standard RNNT architecture without additional encoders or decoders for MT-RNNT training, but the training loss failed to converge as the identical parameters lacked speaker identifiers. The proposed prompt tokens, which identify each speaker, address this issue.↩︎