April 02, 2024
Due to the substantial number of clinicians, patients, and data collection environments involved in clinical trials, gathering data of superior quality poses a significant challenge. In clinical trials, patients are assessed based on their speech data to detect and monitor cognitive and mental health disorders. We propose using these speech recordings to verify the identities of enrolled patients and identify and exclude the individuals who try to enroll multiple times in the same trial. Since clinical studies are often conducted across different countries, creating a system that can perform speaker verification in diverse languages without additional development effort is imperative. We evaluate pre-trained TitaNet, ECAPA-TDNN, and SpeakerNet models by enrolling and testing with speech-impaired patients speaking English, German, Danish, Spanish, and Arabic languages. Our results demonstrate that tested models can effectively generalize to clinical speakers, with less than 2.7% EER for European Languages and 8.26% EER for Arabic. This represents a significant step in developing more versatile and efficient speaker verification systems for cognitive and mental health clinical trials that can be used across a wide range of languages and dialects, substantially reducing the effort required to develop speaker verification systems for multiple languages. We also evaluate how speech tasks and number of speakers involved in the trial influence the performance and show that the type of speech tasks impacts the model performance.
<ccs2012> <concept> <concept_id>10010405.10010444.10010449</concept_id> <concept_desc>Applied computing Health informatics</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257.10010321</concept_id> <concept_desc>Computing methodologies Machine learning algorithms</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257.10010293.10010294</concept_id> <concept_desc>Computing methodologies Neural networks</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257.10010258.10010262.10010277</concept_id> <concept_desc>Computing methodologies Transfer learning</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178.10010179.10010183</concept_id> <concept_desc>Computing methodologies Speech recognition</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>
Extensive clinical trials span numerous patients, doctors, clinics, and even countries, making it hard to know if an enrolled patient is unique. They are also usually done over a long period of time, requiring participants to visit the doctor’s office multiple times. [1] found that 7.78% of patients participating in large clinical trials were duplicated across different sites. This diminishes quality of results of billion dollars worth drug development process. We propose using machine learning models for speaker verification (SV) to solve this problem in cognitive and mental health trials that record speech data for assessment.
Speaker verification is the process of comparing unknown speech utterances to speech signals belonging to known enrolled individuals to decide if a new recording belongs to the known users. SV systems are used across industries including such as banking [2], [3], transportation [4], telecommunications [5], and healthcare [6], [7].
There are two types of SV systems: 1) text-dependent (TD), requiring the speaker to use the same text in enrollment and verification, and 2) text-independent (TI), without those constraints. TD systems dominated in the past because it was easier to verify users on repeated text [8]. Recently, TI SV models are more common in practice since they only require audio, and there is no need for transcripts. TI models are comparable in accuracy with TD models [9] when pre-trained on large amounts of audio samples [10]. However, when it comes to cross-lingual SV, obtaining enough training data for each language is difficult, especially for clinical studies in low-resource languages.
Traditionally, TI SV systems depended fairly strongly on the language being spoken [11] and required language-specific training or fine-tuning. However, training and deploying a separate model for each language is inefficient, increasing the operational costs and overheads in multilingual systems. These concerns are even more substantial in clinical settings when working with patients with abnormal speech patterns.
We propose using large pre-trained SV TI model pre-trained on speech from multiple languages to verify speakers in clinical trials in zero-shot settings. Our target data comes from speakers with abnormal speech patterns. We also test the performance of those models on the speech recordings across different speech tasks and characterize the relation between the quality of SV and data properties (i.e. speech tasks and number of speakers).
In summary, the main contributions of this work are:
We propose using pre-trained end-to-end SV models to enroll and verify patients in cognitive and mental health clinical trials in zero-shot settings.
We test those models on the speech of English, German, Danish, Spanish, and Arabic patients and demonstrate the models can effectively generalize to different languages and speech impairment, achieving high performance as well as helping solve the duplicate participant problem.
We show how the performance changes when speech from different speech tasks is used, showing that speech task may influence EER.
For a long time, SV models used a Gaussian mixture model universal background model (GMM-UBM) approach, where acoustic features (e.g., Mel frequency cepstral coefficients (MFCC)) were used to generate GMM models for each speaker from a set of speech recordings [8]. The GMM-UBM-based models are easily trained, require a small number of speakers, and are often more interpretable than other approaches. However, GMM-UBM-based models assume the normal distribution of the data, which limits their ability to learn more complex speaker representations. The accuracy of SV models is evaluated using Equal Error Rate (EER) [12], which is the point at which the true positive rate (TPR) is equal to the true negative rate (TNR). The GMM-UBM approach, as detailed in [13], achieved an Equal Error Rate (EER) of approximately 9.25% on English datasets. However, those models are phoneme-dependent, making knowledge transfer between languages very challenging [8].
In recent years, significant advancements have been made in pre-trained deep-learning speaker verification (SV) models, achieving remarkably low EER of 0.68% for English language [14]. These models leverage the benefits of end-to-end semi-supervised learning, utilizing more readily available data in languages like English to generate improved speaker embeddings. Moreover, such models have demonstrated the ability to learn intricate and sophisticated representations. For instance, a neural-network-based architecture successfully learned the SV task along with language detection [15]. However, they tend to be less interpretable and require substantial computational resources and extensive datasets for training [16].
More recently, work has been done in assessing whether neural network-based models trained on a single source language (i.e., English) can be adapted to perform SV in another (target) language without the need to be fine-tuned or trained in that target language in zero-shot settings. For example, convolutional time-delay deep neural network (CT-DNN) [17] architecture trained on the Fisher dataset [18], containing 5000 English speakers, results in a model that can achieve EER of 3.71% at the task of TI SV in Chinese/Uyghur. While we also use zero-shot adaptation, we use larger pre-trained models, and our target data comes from speakers with abnormal speech patterns.
[9] shows that SV models trained on one language can be used for SV in other languages without requiring language-specific fine-tuning by testing 46 different languages in zero-shot settings. However, the datasets, which were used for training and evaluation in [9], contained specific keywords (i.e., ‘Hey Google’ and ‘Ok Google’), making the task partially TD. Their results might not generalize to speech that is unconstrained and fully TI. In addition, their model is also only able to compare speech utterances of a fixed length duration, which limits the ability of the model. Our model learns SV from speech of variable length.
Our work continues with the idea that a neural network-based TI SV model, pre-trained on large datasets, can be successfully applied for TI SV of speech-impaired patients in different languages. We evaluate the ability of TI SV systems to generalize by testing three models on three clinical trial datasets in five languages with variable length speech.
In this paper, we used speech recordings from the three clinical trial studies (Table 1).
Language | English | German | Danish | Spanish | Arabic |
(en) | (de) | (da) | (es) | (ar) | |
Dataset | ADCT | CSMCI | CSMCI | CSMCI | SCZCS |
# Speakers | 659 | 29 | 69 | 43 | 30 |
# Samples | 7084 | 135 | 483 | 157 | 1192 |
Avg. # Samp. per Speaker | 10.7\(\pm\)7.0 | 4.7\(\pm\)2.5 | 7.0\(\pm\)3.6 | 3.6\(\pm\)1.0 | 39.7\(\pm\)12.0 |
Avg. Dur. of Audio (sec) | 69.31 | 150.41 | 134.61 | 106.23 | 39.41 |
. Dur. of Speech (sec) | 37.30 | 110.07 | 89.31 | 74.46 | 21.88 |
Alzheimer’s Disease Clinical Trial (ADCT) is a dataset that was collected during a clinical trial involving patients with mild to moderate Alzheimer’s disease (AD). This is a longitudinal dataset of speech recordings of English speakers performing a picture description task [19], [20], as well as phonemic verbal fluency [21] and semantic (categorical) verbal fluency [22] tasks every 12 weeks for their 48-week treatment period. All participants were confirmed to have a clinical diagnosis of AD based on the National Institute on Aging/Alzheimer’s Association (NIA-AA) criteria [23].
Clinical Study of Mild Cognitive Impairment (CSMCI) is a dataset that was collected during the clinical trial involving patients with mild cognitive impairment (MCI) and early AD. This is a longitudinal dataset of speech recordings of German, Danish, and Spanish speakers performing picture description tasks [19], [20] every 12 weeks for the 96-week treatment period. All participants have a clinical diagnosis of either MCI or mild AD according to NIA-AA criteria [23].
Schizophrenia Clinical Study (SCZCS) is a dataset that was collected during the clinical trial involving patients with a diagnosis of schizophrenia (SCZ) based on diagnostic and statistical manual of mental disorders, fifth edition (DSM-5) [24]. This is a longitudinal dataset of speech recordings of Arabic speakers performing picture description [19], [20], phonemic fluency, semantic fluency, paragraph reading & recall, and journaling tasks. These tasks were conducted on a monthly basis for a duration of 6 months.
There are substantially more samples and more speakers in the ADCT dataset than in CSMCI and SCZCS datasets (Table 1). The SCZCS dataset has the most samples per speaker, with the smallest average duration of speech. The CSMCI dataset has the longest average duration of speech, with the smallest average number of samples per speaker.
In each trial, the subjects carried out a set of standardized speech tasks in every recording session. These speech tasks were used in multiple previous studies [25]–[28] due to the ability to generate speech patterns that could be examined for acoustic and linguistic characteristics associated with their mental and cognitive health condition:
Picture Description Task: The subject was presented with a static image depicting an event and was then asked to describe the scenario in their own words. Such tasks have been demonstrated to serve as reliable substitutes for spontaneous discourse [29]. Describing a picture was determined to be an effective speech task for eliciting situations requiring a higher level of cognitive effort and resulting in noticeable changes in speech, which can then be utilized to identify cognitive disorders such as AD or MCI [30]. In all studies, proprietary images were employed. They were designed to match the style and content of the well-researched ‘Cookie theft’ picture [19]. The guiding principles utilized to develop these pictures were based on the core design guidelines described in [31].
Phonemic Fluency Task: The FAS (‘F’, ‘A’, ‘S’) task [21], specifically focusing on the letter ‘F’, was employed to evaluate phonemic verbal fluency in participants. Participants were asked to name as many unique words starting with the letter ‘F’ as they can in one minute. This type of assessment has been extensively utilized in diverse populations, including individuals afflicted with AD [32].
Semantic Fluency Task: To evaluate semantic (categorical) fluency, participants were asked to list as many different animals [22], household objects, or food items as they could think of in one minute. This assessment has been widely used in a variety of populations, including AD patients [32].
Paragraph Reading & Recall Task: This task involves an individual reading a short story at the beginning of the session [33], [34]. Participants were asked to read one of three standard paragraphs, each containing the same number of details and information content units. They were asked to recall information about the story just after reading and again at the end of the session. It is observed that people who have SCZ show deficiencies in memory recall [35], making these tasks good indicators for measuring SCZ.
Journaling Task: Participants were provided with a prompt and asked to create narrative in response. Prompts are open-ended, allowing participants to provide as much or as little detail as they choose. An example prompt could be "what did you do yesterday". This task is used to assess the participants emotions, mental health, and verbal ability [36], [37].
In this study, we evaluated three state-of-the-art SV models (Table 2), pre-trained on mix of languages, and evaluated on speech from speech-impaired patients speaking in English, German, Danish, Spanish, and Arabic languages in zero-shot settings. All models incorporate a combination of 1D convolution, batch normalization (BN), and Rectified Linear Unit (ReLU) to learn speech representation. We used NVIDIA NeMo (Neural Modules) toolkit 1 implementation of all models.
The SpeakerNet model [38] is based on the QuartzNet ASR (Automatic Speech Recognition) architecture, which consists of an encoder and a decoder. This model is smaller in size compared to the other two models. It consists of 5M parameters, making it very efficient at quickly generating speaker embeddings. It was trained on the VoxCeleb 1 [39] and VoxCeleb 2 [40] datasets.
The TitaNet model [14] architecture is similar to the SpeakerNet’s architecture but five times larger in terms of the number of parameters. It also uses squeeze excitation (SE) blocks and a global average pooling layer after each SE Block. Compared to the other two models, it shows the best reported EER on VoxCeleb1 [39] dataset, but it is slower in generating speaker embeddings. This model was trained on the VoxCeleb 1, VoxCeleb 2 [40], Fisher [18], and Switch Board [41] datasets.
Model | # of Params | # of Spkrs |
---|---|---|
SpeakerNet | 5M | 7,205 |
TitaNet | 25.3M | 16,681 |
ECAPA-TDNN | 22M | 14,343 |
Finally, the ECAPA-TDNN model [42] is a time-delay neural network-based model that has been specifically designed for SV tasks. It uses SE blocks just like TitaNet does. The ECAPA-TDNN model has been shown to achieve similar performance to TitaNet although it has fewer parameters and it was trained on fewer data than Titanet, specifically the subset of VoxCeleb 1 [39], VoxCeleb 2 [40], Fisher [18], and Switch Board [41] datasets.
We separately evaluated each SV model on ADCT (en), CSMCI (es), CSMCI (de), and CSMCI (da), and SCZCS (ar) datasets. First, we generated embeddings of all audio files in each dataset within the same language. Then, we joined those embeddings to create a set of positive and negative pair tuples. The positive tuples comprised pairs of embeddings, where enrollment and test speech came from the same speaker. There was \(\sum_{i=1}^m {n_i \choose 2}\) combinations of positive tuples, where \(m\) is the number of speakers and \(n_i\) is the number of speech recordings for the i-th speaker within the same dataset and language. The negative tuples are pairs of embeddings of speech belonging to different speakers within the same dataset and language. There was \(\frac{1}{2}\sum_{i=1}^m {n_i * (N-n_i)}\) negative pairs, where \(N\) is the total number of all audio files within the same dataset and language.
Once we had all positive and negative pair tuples, we calculated the cosine similarity between the vector embeddings in each tuple. Then, we calculated EER by balancing the true positive rate and the true negative rate. If the cosine similarity was above this balance point, the tuple was predicted to belong to the same speaker. Otherwise, the two embeddings were predicted to belong to different individuals.
Models | English | Non-English | Performance on Different Languages (EER%) | ||||
Test Data | Test Data | English | German | Danish | Spanish | Arabic | |
SpeakerNet [38] | ADCT | SCZCS and CSMCI | 4.99 | 0.52 | 1.16 | 4.29 | 14.97 |
TitaNet [14] | ADCT | SCZCS and CSMCI | 3.10 | 0.50 | 0.57 | 0.67 | 9.42 |
ECAPA-TDNN [42] | ADCT | SCZCS and CSMCI | 2.69 | 0.58 | 1.30 | 1.20 | 8.26 |
English | German | Danish | Spanish | Arabic | |
(en) | (de) | (da) | (es) | (ar) | |
English | 0.0 | 31.3 | 24.6 | 59.3 | 85.5 |
German | - | 0.0 | 28.3 | 56.8 | 76.3 |
Danish | - | - | 0.0 | 51.4 | 84.9 |
Spanish | - | - | - | 0.0 | 76.6 |
Arabic | - | - | - | - | 0.0 |
It should also be noted that there were around \(m\) times as many negative tuples as there were positive tuples due to there being more ways of creating negative tuple pairs than positive pairs.
Datasets available to us are mainly consisting of European (Germanic and Romance) languages which are also more common in datasets used for pre-training of examined models. In addition, we examine performance on Arabic language, which belongs to Semitic language family and has different linguistic features and structures, including more complex grammar, additional sounds, and right-to-left script. Similar to English, Arabic is spoken across many countries, with distinct dialects. We are focusing on Jordanian Arabic due to current data availability. Jordanian Arabic has its own phonetic variations and pronunciation patterns that differ from other Arabic languages. It also has unique idioms, expressions, phrases, cultural influences, intonation, and accent. In [43], it is shown that verifying English speakers with accent is easier.
The proximity scores based on the eLinguistics metric [44] give additional information on how linguistically close studied languages are to each other (Table 4). For example, German and Danish have closer linguistic ties to English than Spanish does, while Arabic remains the most distinct.
TitaNet performed the best on the German, Danish, and Spanish languages, with ECAPA-TDNN following closely behind, and SpeakerNet having the worst performance (Table 3). The outcomes on the English and Arabic datasets were the best when using the ECAPA-TDNN model, with 2.69% and 8.42% EER, respectively. This results may be due to the fact that TitaNet and ECAPA-TDNN are larger models, pre-trained on a higher number of speakers than SpeakerNet (Table 2). Also, TitaNet is flexible with variable lengths of speech recordings.
All tested models achieve better EER on European languages in zero-shot setting than on English language. While this may look like a surprising outcome, it was also observed by [9]. We hypothesize that the primary cause of this is the diversity of the ADCT English Test Dataset, which is significantly larger in terms of the number of speakers and samples. Additionally, it is the only dataset among those tested that includes a variety of speech tasks. To investigate these assumptions further, we examine the EER for different tasks (Table 6). Lastly, considering the widespread use of English globally and the resultant diversity among English speakers, we anticipate that increase in EER may be partially attributed to the dataset containing a considerable number of individuals for whom English is not the first language, or who speak English with a range of accents.
All tested models exhibit notably poorer performance on Arabic compared to European languages, which aligns with our expectations. Several potential reasons account for this behavior.
First, there is the lexical and acoustic similarities between the source and target languages. As shown in Table 4, languages like German, Danish and English are closer to each other in terms of linguistic proximity, which could partially account for better performance in models trained on these languages. On the other hand, Arabic, with its higher eLinguistics value, poses a significant challenge due to its linguistic dissimilarity from other languages. This might explain the observed drop in performance when dealing with Arabic, underlining the importance of considering linguistic distance in cross-lingual SV [45]. Second, the models have been exposed to substantially more data from the tested European languages and others similar to them (e.g., French, Italian). In contrast, Arabic data is less prevalent in the pre-training dataset and is linguistically distinct from most languages in the pre-trained dataset. Additionally, there could be unique clinical patterns of speech impairment in Arabic that negatively affect model performance. Further research is needed to examine this hypothesis across speech tasks and cognitive diseases. Despite this, it’s crucial to highlight that the ECAPA-TDNN model still demonstrates commendable performance, with an EER of 8.26%, making it a viable option for zero-shot Speaker Verification (SV) in Arabic. Nevertheless, for optimal performance, we recommend fine-tuning ECAPA-TDNN with additional Arabic data.
The performance of the TitaNet and ECAPA-TDNN models is also compared across different languages (Table 5), but in those tests we controlled for confounding factors. We matched the number of speakers and average number of samples per speaker across the experiments and sampled the datasets. TitaNet is outperforming ECAPA-TDNN in German, Spanish, and Danish, with a slight drop in performance when compared to the results from the data with the original number of speakers (Table 5 vs table 3). In addition, the SV performance on German, Spanish, and Danish are still better than the English datasets. This may be due to the differences in speech patterns between patients with MCI and AD [46].
Language | Models | EER(%) | TPR | TNR | #Spkrs | #Smpls | Avg #Smpls |
per Spkr | ||||||
English | TitaNet | 5.30 | 0.946 | 0.948 | 29 | 212 | 7.31 \(\pm\) 1.86 |
ECAPA-TDNN | 5.30 | 0.946 | 0.948 | ||||
German | TitaNet | 0.50 | 1.000 | 0.990 | 29 | 135 | 4.66 \(\pm\) 2.54 |
ECAPA-TDNN | 0.58 | 0.994 | 0.994 | ||||
Danish | TitaNet | 1.22 | 0.990 | 0.980 | 29 | 188 | 6.48 \(\pm\) 3.45 |
ECAPA-TDNN | 1.60 | 0.981 | 0.986 | ||||
Spanish | TitaNet | 0.81 | 0.990 | 0.994 | 29 | 109 | 3.76 \(\pm\) 1.07 |
ECAPA-TDNN | 1.46 | 0.985 | 0.986 |
Speech Task | EER(%) | TPR | TNR | #Spkrs | #Smpls | Avg #Smpls | Avg Dur of | Avg Dur of |
per Spkr | Audio (sec) | Speech (sec) | |||||
Phonetic Fluency | 4.45 | 0.956 | 0.955 | 394 | 1641 | 4.24 \(\pm\) 0.94 | 49.32 \(\pm\) 10.51 | 19.02 \(\pm\) 9.26 |
Semantic Fluency | 3.99 | 0.957 | 0.963 | 394 | 1643 | 4.25 \(\pm\) 0.95 | 51.58 \(\pm\) 9.30 | 21.33 \(\pm\) 9.89 |
Picture Description | 2.11 | 0.978 | 0.980 | 397 | 2932 | 7.54 \(\pm\) 1.89 | 93.21 \(\pm\) 54.61 | 58.82 \(\pm\) 41.81 |
There can be many potential factors influencing the SV results across different datasets or languages. In particular, we will discuss the effect of three factors on SV performance including type of the mental disorder or cognitive impairment, type of the target language, and data collection procedure of a dataset.
Type of the mental disorder or cognitive impairment may impact the ability to distinguish participants from one another. Participants in the SCZCS are diagnosed with schizophrenia, whereas ADCT and CSMCI patients have Alzheimer’s disease or MCI. It is known that the nature of speech may vary between AD, MCI, and schizophrenia [46].
Another potential factor influencing the ability to generalize across different languages, can be the lexical or acoustic similarities [45] between the source and target languages. Languages, which have similar vocabulary or linguistic characteristics to the source language may benefit when performing TI SV. If the source language and the target languages are dissimilar, this may influence the ability for the monolingual to generalize to that target language. We believe this is the reason behind worse performance on Arabic language [45].
Finally, data collection procedure can impact the quality and length of audio and consequently SV performance. The three models we used in this study are able to handle variable lengths of speech but it is not known what length of speech is optimal. In addition, the influence of environmental noise and possible parts of examiner’s speech is not taken into consideration when analysing the results.
When observing the applicability of these models for our primary use case, we see that the usage of TI SV models is a feasible approach that could be applied in detecting duplicate participants across studies. However, setting a language-specific threshold may be necessary to get the best results. In our experiments, different tuned thresholds were used to maximize EER for each language. This may be caused by dialect variability within languages and amount of non-native speakers. It is also evident that these models seem to be sufficiently generalizable across languages without language-specific fine-tuning.
The TitaNet model is tested on ADCT (en) data across different speech tasks in Table 6. The picture description shows the lowest EER compared to the phonetic and semantic fluency tasks. These results suggests that picture description samples perform better in characterizing the differences between the speakers in SV tasks used in clinical settings. It is also important to note that average number of samples per speaker, and average duration of audio and speech are also the best for picture description task. In the future work, we will examine in more details influence of each of these confounding factors.
This research paper proposes the utilization of speech recordings for speaker verification in clinical trials, particularly for detecting and monitoring cognitive and mental health disorders. We demonstrated effectiveness of TitaNet, ECAPA-TDNN, and SpeakerNet models in zero-shot settings on speech-impaired patients speaking European languages and Arabic.
Our findings indicate that these TI SV models offer a promising solution to address the issue of duplicate participants without requiring fine-tuning, achieving high performance across multiple languages. We also show that the type of speech tasks conducted impacts the performance of SV models. This insight can inform the design and selection of appropriate tasks to enhance the accuracy and reliability of speaker verification systems.
In the future work, it would be valuable to establish a baseline by evaluating results from speakers without cognitive impairments in each language, providing a comparative measure of performance. Additionally, investigating the influence of specific diseases, noise, speech length, and the number of samples on SV model performance would deepen our understanding of the factors that affect accuracy.
Expanding the evaluation to include additional target languages would further validate the robustness and applicability of the SV models in diverse linguistic contexts. Lastly, exploring within-language performance variations, particularly in languages with distinct dialects, such as Arabic, would provide insights into the challenges and opportunities associated with dialectal variations in speaker verification.
As with any research involving human subjects, ethical considerations must be taken into account when analyzing and reporting the results. In the context of the SV research, the following ethical considerations are relevant.
Speaker verification systems rely on biometric data, which can raise privacy concerns. It is crucial to ensure that the privacy of the enrolled speakers and their sensitive personal information is protected throughout the process. Our research used clinical trial datasets containing recordings from patients with Alzheimer’s disease and mild cognitive impairment and schizophrenia. Due to the sensitivity of the data, we have made sure to conceal the identities of the individuals in the study and refrain from using personally-identifiable information.
Speaker verification systems have the potential to reinforce biases in society if not properly designed and evaluated. For instance, if the training data is not representative of the diverse population, the resulting system may not perform equally well across all groups. We can’t ensure the diversity of data that was used in the training of the model as we do not have access to the demographic characteristics of the datasets and other sensitive information of participants enrolled in the trials.
Speaker verification systems can be misused for unethical purposes, such as identity theft or surveillance. It is important to emphasize that our research is focused on developing and evaluating SV models for legitimate purposes, i.e., solving the duplicate participant problem.
Applying speaker verification (SV) to cognitive and mental health data comes with certain domain-specific limitations. Firstly, the variability in speech patterns among individuals with cognitive and mental health disorders can lead to variations in speech characteristics and affect the performance of SV models, as they are typically trained on relatively standardized speech data. Additionally, the cognitive impairment manifestation of the same disease may vary with language and those patterns are understudied for non-English languages. The impairment patterns are also shown to be different for different diseases. Consequently, the EER of SV systems may be different when applied to populations with cognitive and mental health disorders, which is examined in [43].
Another limitation arises from the limited availability of data. Obtaining large amounts of speech data from individuals with specific cognitive and mental health disorders is challenging. Datasets contain different amount of speakers for each severity level which impact is shown in [43]. Additionally, the type of speech tasks performed by individuals in clinical trials can influence speech characteristics and, subsequently, SV model performance as discussed in this paper. It is also important to consider the presence of co-occurring disorders among individuals with cognitive and mental health disorders. The presence of comorbid conditions can introduce additional variability in speech patterns, making it harder for SV models to accurately distinguish between speakers.
https://github.com/NVIDIA/NeMo↩︎