VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

Wenqian Cui1, Xiaoqi Jiao2, Ziqiao Meng3, Irwin King1,
1The Chinese University of Hong Kong, 2LightSpeed Studios, Tencent, 3National University of Singapore


Abstract

With the growing demand for developing speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. When engaging in conversations with humans, it is essential for these models to comprehend a wide range of world knowledge. In this paper, we introduce VoxEval, a novel speech question-answering benchmark specifically designed to assess SLMs’ knowledge understanding through purely speech-based interactions. Unlike existing AudioQA benchmarks, VoxEval maintains speech format for both questions and answers, evaluates model robustness across diverse audio conditions (varying timbres, audio qualities, and speaking styles), and pioneers the assessment of challenging domains like mathematical problem-solving in spoken format. Our comprehensive evaluation of recent SLMs using VoxEval reveals significant performance limitations in current models, highlighting crucial areas for future improvements.1

1 Introduction↩︎

The progress in text-based Large Language Models (LLMs) has significantly impacted the field of generative AI, allowing for smooth communication between humans and AI via written text. However, since natural human interactions often involve spoken language, there is a growing movement to integrate speech capabilities into LLMs [1]. This initiative aims to enable direct speech communication between humans and AI by endowing language models with the ability to understand and produce speech. Consequently, end-to-end Spoken Language Models (SLMs), which engage directly in spoken dialogue, represent an exciting and promising direction.

Figure 1: Illustration of the three crucial factors for evaluating SLMs’ knowledge comprehension.

Evaluation of SLMs is crucial for ensuring their effectiveness and reliability in real-world applications. Similar to Text Language Models (TLM), the ability to understand world knowledge is essential for SLMs to perform conversations with humans. When evaluating SLMs’ knowledge comprehension, several crucial factors should be taken into account: 1) The evaluation should be conducted entirely in speech format, meaning both the input and output must be audio-based. This approach differs from typical AudioQA tasks [2][5], where audio clips are matched with text questions. SLM evaluation should solely use speech to mimic real-world conversational settings. 2) The evaluation should consider various input audio conditions. Users may interact with SLMs under diverse audio characteristics, such as different speaker timbres, audio qualities, and speaking styles. Consequently, SLMs should demonstrate robustness against these varying conditions, producing consistent answers when the input content remains the same [6]. 3) Not all knowledge comprehension topics are straightforward for SLM evaluation. The range of knowledge understanding includes diverse subjects, including literature, sociology, engineering, etc. However, some subjects present unique challenges when evaluated in the spoken format. For instance, while math expressions can be easily and concisely written in text, evaluating them in spoken form remains a challenge. Figure 1 illustrates the three key elements mentioned above.

Based on the key elements outlined, we introduce VoxEval, a speech question-answering benchmark designed to assess the knowledge understanding capabilities of SLMs. This benchmark features audio recordings for both questions and instructions, facilitating a comprehensive evaluation of speech modeling abilities. Additionally, VoxEval incorporates a variety of input conditions for the same question, allowing us to test the robustness of SLMs across different audio scenarios. To address the challenges associated with evaluating certain subjects, such as mathematics, we pinpoint the critical differences between written and spoken formats and convert the former into their spoken counterparts. Our evaluation of recently proposed SLMs using VoxEval reveals that it presents a significant challenge for existing models. Additionally, existing SLMs demonstrate vulnerability to changes in input audio conditions, indicating the need for future research to focus on improving robustness. To summarize our contributions:

  • We introduce a novel benchmark, VoxEval, specifically designed to evaluate the knowledge understanding capabilities of SLMs exclusively in speech format, across a diverse range of audio conditions.

  • We pioneer the assessment of SLMs’ math-solving abilities, addressing a critical gap in existing evaluation methodologies.

  • We conduct a systematic evaluation of VoxEval with several recent SLMs, revealing that current models exhibit significant limitations in performance on our benchmark.

2 Related Work↩︎

2.1 Speech Large Language Models↩︎

Speech Large Language Models (SpeechLLMs) refers to the LLMs that have the ability to understand and/or generate speech. There are two primary approaches for integrating speech modalities into LLMs. The first approach focuses on making LLMs understand speech information. This is usually achieved by connecting a speech encoder to a pre-existing LLM while preserving most of the model’s original structure and parameters. We refer to these as Speech-to-Text Large Language Models (S2TLLMs). Since the primary goal of these approaches is to enable existing LLMs to perceive auditory information, it only supports text as the output modality [7], [8]. For example, Qwen-Audio [7] utilizes Whisper-large-v2 [9] as the audio encoder and Qwen-7B [10] as the LLM. SALMONN [8] uses both whisper and BEATs [11] as the audio encoders and uses Vicuna [12] to serve as the LLM.

In contrast, the second approach advances further by developing end-to-end speech interaction models [13][15]. In these models, both the input and output modalities are speech, allowing the model to both hear and speak. These approaches directly model speech throughout the LLM-based pipeline and do not significantly rely on textual information. We refer to these models as End-to-End Spoken Language Models (SLMs). SLMs typically involve three components to enable end-to-end speech interactions: tokenizer, language model (LM), and vocoder. Given an input speech, the tokenizer [16][18] first encodes the audio waveform into speech tokens or representations, LM [19][21] then autoregressively generate token response, and vocoder [22][24] synthesizes the output token back into speech waveform.

2.2 SpeechLLM Evaluation Benchmarks↩︎

The evaluation of SpeechLLMs centers on their ability to model speech data, with researchers examining various aspects and levels. Initially, multiple datasets are utilized to assess the linguistic modeling capabilities of SpeechLLMs. For instance, sWUGGY [25] is employed for lexical evaluation, sBLIMP [25] for syntactical assessment, and sStoryCloze [26] for semantic evaluation. Additionally, the modeling of paralinguistic elements (spoken content beyond words) is crucial for SpeechLLMs, prompting the development of datasets that evaluate paralinguistic information from both token and perceptual levels [14], [27]. Beyond the modeling aspect, the performance of downstream applications is vital for the practical use of SpeechLLMs in real-world scenarios. Recently, numerous downstream evaluation benchmarks have emerged [2][5], focusing on general audio understanding, including speech, environmental sounds, and music. However, as highlighted in Section 1, these benchmarks contain all QA pairs in text format, making them suitable for evaluating S2TLLMs but not SLMs. The closest benchmark to our work is VoiceBench [28], which also evaluates end-to-end SLMs under various audio conditions. However, our study emphasizes assessing the knowledge understanding capabilities of SLMs with a particular focus on their mathematical abilities, as well as providing an in-depth analysis of SLMs’ performance across various settings.

2.3 Knowledge Understanding of LLMs↩︎

Knowledge understanding is a vital skill for LLMs, as they often need to accurately grasp and produce content based on knowledge. There are many benchmarks to assess the knowledge comprehension of TLMs, each with different levels of difficulty. CommonsenseQA [29] tests commonsense knowledge, which is the most basic level of human understanding. OpenBookQA [30] focuses on basic science questions. MMLU [31] includes knowledge questions from 57 subjects, mostly at high school or college levels. MMLU-Pro [32] is an advanced version of MMLU with increased difficulty. AGIEval [33] assesses tasks at a human level, which are also challenging.

3 VoxEval↩︎

Table 1: Explanations and examples of the five techniques used in linguistic variation.
Linguistic Variation Explanation Example
Filler Words Include words like "um," "uh," "like," and "you know" to see if the model can accurately process the main content despite interruptions. Uh, I’m not sure if I can, like, make it to the meeting on time.
Mispronunciations Introduce common mispronunciations to evaluate the model’s ability to understand intended words. I went to the libary this Febuary (I went to the library this February)
Disfluencies Test with speech that includes stutters, repetitions, or self-corrections to assess how well the model handles natural speech patterns. It’s, it’s, it’s really important to be, uh, on time. Don’t, don’t, don’t do this again.
False Starts and Corrections Insert deliberate errors followed by corrections to assess the model’s ability to track changes in speech. What is the value of five times, no, I mean, four times two?
Language Proficiency Incorporate typical errors made by non-native speakers to evaluate the model’s ability to handle non-native grammatical mistakes. He just tell me he not coming today.

In this section, we introduce a speech-based knowledge understanding benchmark called VoxEval, concentrating primarily on the methods we use to create the data, how we ensure it meets the robustness evaluation criteria, and how we address the evaluation of complex topics like mathematics.

3.1 Data Construction↩︎

Building a speech-based knowledge understanding benchmark requires the collection of knowledge from different subjects and the construction of question-answer pairs about the knowledge in speech format. Given that there are numerous existing textual knowledge understanding benchmarks (see Section 2.3), it is logical to build the speech-based benchmark upon textual benchmarks. In this work, we leverage the MMLU [31] dataset to build VoxEval. Specifically, we transform the questions from MMLU into speech using the OpenAI Text-to-Speech (TTS) API 2. We select MMLU as the underlying dataset for the following reasons:

  • The subjects contained in MMLU are structured and comprehensive, including STEM, social sciences, humanities, and more, and thus ideal for holistic evaluation of knowledge understanding.

  • MMLU is widely used for evaluating SLMs [14], [15]. However, this evaluation primarily reflects their capability to handle text, not speech.

  • In datasets available in both written and spoken forms (like sStoryCloze [26]), we find that the spoken version is significantly more challenging than the written one. Therefore, adapting MMLU into a speech format would create a demanding dataset for current SLMs.

To transform MMLU into its spoken version, we need to concatenate all the text within a question into a single sequence. In MMLU, every data is a Multiple Choice Question (MCQ) with four answer choices. We use a simple sentence to concatenate the question and answer choices. Therefore, every converted data looks like this:

f"{question} Please choose the answer from options A, B, C, and D. Here are the options. A. {option_A}, B. {option_B}, C. {option_C}, D. {option_D}.", 

and its corresponding answer is

f"The correct answer is {answer}.".

Remark 1. Among the 57 subjects in MMLU, we exclude the "high school computer science" subject when constructing VoxEval, as it includes code snippets that are not suitable for verbal evaluation.

3.2 Various Input Conditions↩︎

As mentioned in Section 1, it is crucial for the answers of SLMs to be robust under various input audio conditions. Inspired by this, we incorporate MMLU questions with different input audio conditions into VoxEval. Specifically, we consider the following types of input conditions.

3.2.1 Different Speakers↩︎

SLMs should be robust when different speakers interact with them. Several key factors reflect the uniqueness of a speaker, which include but are not limited to gender, age, accent, etc. To evaluate the SLMs’ performance on different speakers, we use all six speakers provided by OpenAI TTS, namely alloy, echo, fable, nova, onyx, and shimmer, to perform the TTS. Those speakers span a wide array of speaker properties, such as gender, accent, etc.

Table 2: Comparison between written math and spoken math in various circumstances.
Circumstance Written Version Spoken Version
Arabic Numerals 2351 Two thousand three hundred fifty-one
Units 25cm Twenty-five centimeters
Operators and Brackets \(4 \div (2 + 8)\) four divided by the sum of two and eight

3.2.2 Different Speaking Styles↩︎

In real-world conversations, even a single individual may use different speaking styles depending on the situation. Therefore, Speech Language Models (SLMs) need to manage various input styles. We identify two types of style variations: linguistic and paralinguistic.

Linguistic variation involves changes in the content. For instance, people may sometimes pause to think about what they want to say, leading to disfluencies in their speech. Another example is when someone accidentally says something incorrectly and then corrects themselves upon realizing the mistake. SLMs should still accurately interpret the speech in these situations. However, these scenarios are not typically captured when using TTS to generate speech from a well-written text. To address this, we propose five techniques for linguistic variation to accommodate different situations, including filler words, mispronunciations, disfluencies, false starts and corrections, and language proficiency. Table 1 provides explanations and examples for each of these five variations. To replicate these scenarios, we first use GPT-4o [6] to modify the original question text into different versions and then convert them into speech. We only alter the question text and not the answer choice texts.

Table 3: VoxEval evaluation results of popular SLMs under various input audio conditions.
SLMs SpeechGPT TWIST SPIRIT-LM Moshi GLM-4-Voice
Speakers
Alloy 0.0001 0.0480 0.2084 0.1216 0.3763
Echo 0.0001 0.0558 0.2096 0.1221 0.3764
Fable 0.0000 0.0116 0.2084 0.1153 0.3642
Nova 0.0001 0.0332 0.2070 0.1298 0.3677
Onyx 0.0002 0.0275 0.1966 0.1192 0.3764
Shimmer 0.0000 0.0516 0.2076 0.1205 0.3815
Speaking Styles
Linguistic 0.0001 0.0488 0.2044 0.1187 0.3643
Speed 0.0001 0.0503 0.1911 0.1013 0.3469
Pitch 0.0000 0.0544 0.1788 0.0609 0.3345
Audio Qualities
Noise 0.0000 0.0368 0.1950 0.1018 0.3695
Other Env Acoustics 0.0001 0.0434 0.2019 0.1051 0.3728
Underlying Text LMs Llama-7B Llama-7B Llama-2-7B Helium-7B GLM-4-9B
Text MMLU 0.3510 0.3510 0.4530 0.5430 0.7470

Paralinguistic variation, in contrast, involves changes other than the contents of the input audio, without changing the content. Similar to linguistic variation, SLMs should also output consistent outputs under different paralinguistic variations. We consider four types of paralinguistic variations.

  • Pitch shift: Pitch shift refers to the alteration of the pitch of the speech signal, which can simulate the effect of someone speaking in a higher or lower tone. For each question input, we randomly shift its pitch between -5 and 5 semitones.

  • Tempo change: In real-life situations, people often vary their speaking speed, sometimes talking faster and other times slower. To mimic these variations in speaking pace, we apply tempo changes. For each question input, we randomly select a tempo-changing rate that ranges from half the original speed to twice as fast.

3.2.3 Different Audio Qualities↩︎

When interacting with SLMs, users may be in an environment with various acoustic conditions. However, in a standard TTS system, the speech is synthesized in an “ideal" acoustic environment (very clear, no echo, etc.), thus failing to mimic the real-world interaction scenarios. We consider two types of acoustic conditions when evaluating SLMs:

  • Background noise: Sometimes the input audio may be accompanied by background noise, which can affect the clarity and intelligibility of the spoken content. We consider different kinds of noise, including Gaussian noise, colored noise, background music, and short interruption noise. For each question input, a kind of noise is randomly selected and applied to the original audio.

  • Other environment acoustics: We consider speech recorded under various conditions, including aliasing, room impulse response, low pass filter, high pass filter, band pass filter, bitcrush, gain, clipping distortion, and seven band parameter EQ. Full explanations and transformation settings are presented in Appendix 6. A random audio effect is applied to each question input.

Remark 2. We apply all the variations to the speech data generated only using the speaker "alloy". To create different paralinguistic and acoustic conditions, we utilize the ”audioaugmentations"* package3 to augment the audio.*

3.3 Handle Math Expressions and More↩︎

Logical reasoning and critical thinking are crucial for AI assistants like LLMs, and tackling math problems is a major part of these abilities. Math-solving skills are vital for SLMs because humans do not always express mathematical ideas in written form, and spoken math has many uses for AI assistants, such as in tutoring. Hence, we pioneer the evaluation of math-solving abilities for SLMs.

Mathematical problems often include numerous mathematical expressions, and current TTS systems struggle to accurately convert these expressions into speech, as they are typically trained only on textual words. To tackle this issue, we propose a two-step method for synthesizing math problems. First, we convert the written math problems into their spoken form, and then we use the TTS system to generate speech. Specifically, we create a set of few-shot examples and use GPT-4o to handle the conversion process. Our focus is on transforming Arabic numerals, units, operators, and brackets, as illustrated by the examples in Table 2. The complete prompt is provided in Figure 3 and 4.

Table 4: Average scores for various subject groups across different input audio conditions are presented. The scores are shown in the format “STEM/Social Science/Humanities/Others" subjects. MGD stands for Mean Grouped Score. The highest score in each scenario is highlighted in bold.
SLMs (MGS) SpeechGPT TWIST SPIRIT-LM Moshi GLM-4-Voice
Speakers
Alloy 0.0000/0.0000/0.0000/0.0003 0.0561/0.0381/0.0319/0.0587 0.2244/0.2171/0.1929/0.2330 0.1309/0.1195/0.1198/0.1285 0.3250/0.3858/0.3973/0.4303
Echo 0.0004/0.0000/0.0000/0.0000 0.0611/0.0548/0.0341/0.0692 0.2248/0.2182/0.1867/0.2422 0.1413/0.1246/0.1129/0.1371 0.3325/0.4226/0.3906/0.4123
Fable 0.0000/0.0000/0.0000/0.0000 0.0088/0.0056/0.0044/0.0170 0.2122/0.2290/0.1915/0.2364 0.1257/0.1165/0.1042/0.1191 0.3081/0.4252/0.3832/0.3933
Nova 0.0000/0.0002/0.0000/0.0001 0.0349/0.0274/0.0197/0.0408 0.2302/0.2172/0.1799/0.2291 0.1504/0.1269/0.1084/0.1432 0.3195/0.4280/0.3843/0.4012
Onyx 0.0001/0.0000/0.0001/0.0000 0.0346/0.0273/0.0149/0.0395 0.2085/0.2109/0.1823/0.2242 0.1356/0.1274/0.1105/0.1339 0.3238/0.4427/0.4084/0.3832
Shimmer 0.0000/0.0000/0.0000/0.0000 0.0686/0.0479/0.0301/0.0691 0.2190/0.2229/0.1909/0.2172 0.1387/0.1210/0.1085/0.1306 0.3179/0.4596/0.3935/0.4434
Speaking Styles
Linguistic 0.0000/0.0000/0.0004/0.0000 0.0581/0.0431/0.0305/0.0588 0.2062/0.2157/0.2020/0.2257 0.1310/0.1167/0.1009/0.1344 0.3218/0.3818/0.3694/0.3991
Speed 0.0000/0.0000/0.0002/0.0005 0.0571/0.0514/0.0342/0.0667 0.2063/0.2094/0.1783/0.2038 0.1059/0.1137/0.0916/0.1171 0.3055/0.3874/0.3630/0.3798
Pitch 0.0000/0.0000/0.0000/0.0000 0.0620/0.0555/0.0367/0.0685 0.1794/0.1844/0.1760/0.2019 0.0660/0.0573/0.0567/0.0585 0.3003/0.3735/0.3476/0.3643
Audio Qualities
Noise 0.0000/0.0000/0.0000/0.0000 0.0422/0.0337/0.0247/0.0484 0.2077/0.2173/0.1659/0.2149 0.1073/0.1079/0.0983/0.1078 0.3145/0.4193/0.3868/0.4146
Other Env Acoustics 0.0000/0.0000/0.0000/0.0003 0.0519/0.0371/0.0279/0.0546 0.2244/0.2066/0.1684/0.2325 0.1118/0.1008/0.1024/0.1130 0.3045/0.4340/0.3909/0.4142
Underlying Text LMs Llama-7B Llama-7B Llama-2-7B Helium-7B GLM-4-9B
Text MMLU 0.3050/0.3830/0.3400/0.3810 0.3050/0.3830/0.3400/0.3810 - - -

a

b

Figure 2: Box plots to display the maximum performance score differences across different settings..

Table 5: Evaluation results for different levels of math problems under the “Alloy" speaker setting.
SLMs Elementary Math High School Math College Math
SpeechGPT 0.0000 0.0000 0.0000
TWIST 0.0317 0.0444 0.0800
SPIRIT-LM 0.2302 0.2000 0.2200
Moshi 0.1138 0.1926 0.1700
GLM-4-Voice 0.2566 0.2593 0.2700

4 Experiments↩︎

We evaluate recently proposed SLMs on our proposed VoxEval benchmark. In this section, we explain the experimental settings and the corresponding results.

4.1 Experimental Setups↩︎

Data Preparations. We assess the performance of the SLM across various input audio conditions, including different speakers, speaking styles, and audio qualities, as mentioned in Section 3.2. Few-shot prompting techniques are employed to guide the SLMs in answering questions from VoxEval. Specifically, for each subject, we synthesize five examples from the MMLU validation set using all six speaker voices provided by OpenAI TTS. The five in-context examples of the corresponding speaker are prepended to the final question. To accommodate the input constraints of certain SLMs, the resulting speech is truncated to the last 80 seconds.

Models. We evaluate VoxEval on five recently introduced SLMs: SpeechGPT [13], TWIST [26], SPIRIT-LM [14], Moshi [15], and GLM-4-Voice [34]. These models represent diverse training approaches within the SLM domain. Specifically, SpeechGPT and TWIST are pre-trained using unsupervised speech data. SPIRIT-LM and GLM-4-Voice are pre-trained on interleaved speech and text data. Moshi is pre-trained on aligned speech-text data. All the models are built upon TLM checkpoints. Furthermore, while the other models process audio data through a single stream, Moshi uniquely represents audio data using multiple streams.

Evaluation Metric. To assess the spoken responses provided by the SLMs, we utilize the OpenAI ASR model whisper-large-v3 [9] to convert their answers into text. Afterward, we apply string matching to determine the final answer (e.g., A, B, C, or D) from the transcription and calculate the accuracy.

4.2 Results↩︎

We aim to answer the following Research Questions (RQs) through experiments. RQ1: What is the overall performance of existing SLMs on the VoxEval benchmark? RQ2: How do existing SLMs perform across various knowledge domains? RQ3: To what extent do different input audio conditions influence the performance of existing SLMs? RQ4: How well are current SLMs in reasoning, particularly in mathematical reasoning?

To answer RQ1, we evaluate the five SLMs on VoxEval, whose results are shown in Table 3. We observe that existing SLMs perform poorly on VoxEval. In fact, most SLMs’ performance does not even surpass random guessing. Since each question in VoxEval offers four answer choices, the expected score for random guessing is 0.25. However, all SLMs, except for GLM-4-Voice, fail to exceed this baseline. This indicates that, in many instances, the SLMs struggle to correctly follow instructions to select from the options. This highlights the difficulty of enabling SLMs to “speak" the correct answer, making VoxEval a particularly challenging benchmark for current SLMs. A case study of different SLMs’ spoken responses is shown in Appendix 7. During the evaluation, we found that SpeechGPT’s performance is close to 0%. This is likely due to our approach of directly performing speech-to-speech generation without utilizing the "chain-of-modality" mode described in [13]. We choose not to use the "chain-of-modality" approach because it involves converting the input into text before generating a response, which we believe would significantly increase inference latency and is not an optimal strategy for end-to-end SLMs.

To answer RQ2, we present the SLM performance results across various subject groups within VoxEval. Following MMLU, all knowledge subjects are divided into four groups: STEM (18 subjects), Social Science (12 subjects), Humanities (13 subjects), and Others (13 subjects). Table 4 shows the average performance of the subjects in each group. We notice that SLMs exhibit significant performance variations across different subject groups. For instance, when using alloy as the input voice, SPIRIT-LM shows a score difference of approximately 4%, and GLM-4-Voice demonstrates a difference of around 10%. Moreover, different SLMs excel in different subject groups. For example, TWIST, Moshi, and GLM-4-Voice typically perform the best in the Others group, STEM group, and Social Science group, respectively.

To answer RQ3, we present the overall performance of SLMs across various input audio conditions in Table 3. Our key findings are summarized as follows.

  • SLMs are susceptible to different input audio conditions. Although the performance of SLMs does not vary significantly with changes in speaker voices, we observe that differences in speaking styles and audio quality can negatively impact their performance, sometimes leading to a performance drop of up to 6%.

  • Different input audio conditions have different impacts on SLMs. For instance, environment acoustics and linguistic variations have minimal impact on SLMs’ performance, whereas pitch shifts pose the greatest challenge for most SLMs.

We further visualize performance differences across all the settings (input audio conditions), as depicted in Figure 2. For each setting, we compute the maximum performance difference of all 56 subjects in the VoxEval dataset and present the results (56 difference scores) using box plots. In the “Speaker" setting, the differences are calculated using audio from all six speakers provided by OpenAI TTS. For the remaining settings, since all audio variations are applied to speech data generated by the”alloy" voice, the differences are computed by comparing the original “alloy" audio with the transformed audio. The top part of the figure illustrates the absolute differences in scores, while the bottom part shows the relative percentage differences. We have the following key observations.

  • Performance variation among individual subjects is significantly greater than the overall performance variation. While overall performance differences are typically within 1–2%, we found many instances where the performance gap for a specific subject exceeded 10%.

  • Models exhibit varying levels of stability when faced with changes in audio conditions. As shown in Figure 2 (b), TWIST and Moshi tend to be less stable, whereas GLM-4-Voice demonstrates minimal relative score differences.

Additionally, SLMs display relatively consistent behavior across different condition changes. According to Table 4, when input audio conditions are altered, TWIST, Moshi, and GLM-4-Voice consistently perform the best in others subjects, STEM subjects, and social science subjects, respectively.

To answer RQ4, we present the performance of SLMs across three math subjects in VoxEval, as shown in Table 5. The selected subjects—elementary math, high school math, and college math—are intended to represent varying levels of reasoning complexity. We observe that existing SLMs perform similarly across different levels of math questions. Interestingly, in some cases, SLMs perform better on more challenging questions. For example, TWIST and GLM-4-Voice achieve the highest performance in college math. This counterintuitive fact suggests that existing SLMs are unlikely to possess basic reasoning abilities.

5 Conclusion↩︎

In this paper, we introduce VoxEval, a novel speech-based question-answering benchmark designed to evaluate the knowledge understanding capabilities of end-to-end Spoken Language Models (SLMs) under diverse audio conditions. Comprehensive evaluations of recent SLMs reveal significant performance limitations, including poor accuracies that are often below random guessing, heightened susceptibility to variations in input audio conditions, and notable challenges in domains such as mathematical reasoning. These findings highlight the need for future research to improve the robustness and reasoning abilities of SLMs in real-world conversational settings. VoxEval serves as a critical step toward advancing the evaluation and development of SLMs, paving the way for more effective and reliable speech-based AI systems.

6 Explanation for Various Speech Recording Conditions↩︎

Table 6 provides the input audio conditions categorized under the "other environment acoustics" and the transformation settings in the audioaugmentation package.

Table 6: Explanations and transformation settings categorized under "other environment acoustics."
Audio Conditions Explanation Settings
Aliasing Aliasing occurs when a signal is sampled at a rate lower than twice its highest frequency (violating the Nyquist theorem), causing higher frequencies to fold back and appear as lower, incorrect frequencies in the sampled signal. min_sample_rate=4000, max_sample_rate=6000
Room Impulse Response Room impulse response refers to the acoustic fingerprint of a space, capturing how sound propagates, reflects, and decays in that environment. A bedroom impulse response is used.
Low Pass Filter A low-pass filter in audio processing allows low-frequency sounds to pass through while attenuating (reducing) higher frequencies, helping to remove noise or create a smoother sound. min_cutoff_freq=5000.0, max_cutoff_freq=5000.0
Band Pass Filter A band-pass filter in audio processing allows mid-frequency sounds to pass through min_center_freq=2000.0, max_center_freq=2000.0
High Pass Filter A high-pass filter in audio processing allows high-frequency sounds to pass through min_cutoff_freq=1000.0, max_cutoff_freq=1000.0
Bitcrush Bitcrush reduces the resolution (bit depth) and/or sampling rate of a sound, creating a lo-fi, distorted, or "crunchy" effect by introducing digital artifacts and aliasing. min_bit_depth=5, max_bit_depth=6
Gain Gain refers to the adjustment of the amplitude or volume of an audio signal. min_gain_db=-12.0, max_gain_db=12.0
Clipping Distortion Clipping distortion occurs when an audio signal exceeds the maximum amplitude that a system can handle, causing the waveform to be "clipped" at the peaks and resulting in a harsh, distorted sound. min_percentile_threshold=20, max_percentile_threshold=20
Seven Band Parameter EQ A Seven-Band Parametric EQ is an equalizer that allows precise control over seven specific frequency bands, enabling adjustments to the gain, bandwidth (Q), and center frequency of each band to shape the tonal balance or correct issues in the audio signal. min_gain_db=-12.0, max_gain_db=12.0

7 Case Study for SLM Responses↩︎

Table 7 presents example SLM responses to questions from VoxEval as well as our comments.

Table 7: Examples of each SLM’s response and our comments.
SLMs Example Response Comment
SpeechGPT I bet. SpeechGPT outputs random words for the most of time.
TWIST A, C, A, and E plus, that’s where H, H, and A, F, and E, D, F, A, A, TWIST sometimes gives an answer but cannot form an interpretable sentence for the most of time.
SPIRIT-LM The correct answer is A. Statement 1. K is a normal subgroup of X. X is a normal subgroup of X. Please choose the answer of the question from Options A, B, C, or D. SPIRIT-LM is able to follow the required format and output an answer, but it does not stop after giving an answer.
Moshi I think it’s D D Moshi is able to interpret the question and output an answer, but it does not follow the format of “The correct answer is x.".
GLM-4-Voice The correct answer is B. Multiplication is not associative. GLM-4-Voice is able to answer with required format as well as giving an explanation afterwards.

None

Figure 3: The prompt for GPT-4o to convert questions with math expressions.

None

Figure 4: The prompt for GPT-4o to convert answer choices with math expressions.

References↩︎

[1]
Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, and Irwin King. 2024. Recent advances in speech language models: A survey. arXiv preprint arXiv:2410.03751.
[2]
S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. 2024. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168.
[3]
Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. 2024. Audiobench: A universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020.
[4]
Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. 2024. Air-bench: Benchmarking large audio-language models via generative comprehension. arXiv preprint arXiv:2402.07729.
[5]
Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, et al. 2024. Dynamic-superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks. arXiv preprint arXiv:2411.05361.
[6]
OpenAI. 2024. https://openai.com/index/gpt-4o-system-card/. Online; Accessed on 6-September-2024.
[7]
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
[8]
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2024. https://openreview.net/forum?id=14rn7HpKVk. In Proceedings of the 12th International Conference on Learning Representations (ICLR).
[9]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
[10]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
[11]
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. 2023. Beats: audio pre-training with acoustic tokenizers. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
[12]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/.
[13]
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.1055. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore. Association for Computational Linguistics.
[14]
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, et al. 2024. Spirit-lm: Interleaved spoken and written language model. arXiv preprint arXiv:2402.05755.
[15]
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. 2024. http://kyutai.org/Moshi.pdf. Technical report, Kyutai.
[16]
Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. 2024. https://openreview.net/forum?id=AF9Q8Vip84. In Proceedings of the 12th International Conference on Learning Representations (ICLR).
[17]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460.
[18]
Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. 2021. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE.
[19]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
[20]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
[21]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
[22]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033.
[23]
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. 2021. https://doi.org/10.21437/Interspeech.2021-475. In Proc. Interspeech 2021, pages 3615–3619.
[24]
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2021. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
[25]
Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, and Emmanuel Dupoux. 2020. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In Proceedings of the Workshop on Self-Supervised Learning for Speech and Audio Processing. NeurIPS.
[26]
Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. 2024. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36.
[27]
Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, and Wei-Ning Hsu. 2022. https://doi.org/10.18653/v1/2022.acl-long.593. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, Dublin, Ireland. Association for Computational Linguistics.
[28]
Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. 2024. Voicebench: Benchmarking llm-based voice assistants. arXiv preprint arXiv:2410.17196.
[29]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
[30]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
[31]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ. In Proceedings of the 9th International Conference on Learning Representations (ICLR).
[32]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574.
[33]
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2024. https://doi.org/10.18653/v1/2024.findings-naacl.149. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico. Association for Computational Linguistics.
[34]
Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. 2024. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612.

  1. The full benchmark dataset will be released soon.↩︎

  2. https://platform.openai.com/docs/guides/text-to-speech↩︎

  3. https://github.com/iver56/audiomentations↩︎