Zero-Shot Multi-Lingual Speaker Verification in Clinical Trials


Abstract

Due to the substantial number of clinicians, patients, and data collection environments involved in clinical trials, gathering data of superior quality poses a significant challenge. In clinical trials, patients are assessed based on their speech data to detect and monitor cognitive and mental health disorders. We propose using these speech recordings to verify the identities of enrolled patients and identify and exclude the individuals who try to enroll multiple times in the same trial. Since clinical studies are often conducted across different countries, creating a system that can perform speaker verification in diverse languages without additional development effort is imperative. We evaluate pre-trained TitaNet, ECAPA-TDNN, and SpeakerNet models by enrolling and testing with speech-impaired patients speaking English, German, Danish, Spanish, and Arabic languages. Our results demonstrate that tested models can effectively generalize to clinical speakers, with less than 2.7% EER for European Languages and 8.26% EER for Arabic. This represents a significant step in developing more versatile and efficient speaker verification systems for cognitive and mental health clinical trials that can be used across a wide range of languages and dialects, substantially reducing the effort required to develop speaker verification systems for multiple languages. We also evaluate how speech tasks and number of speakers involved in the trial influence the performance and show that the type of speech tasks impacts the model performance.

<ccs2012> <concept> <concept_id>10010405.10010444.10010449</concept_id> <concept_desc>Applied computing Health informatics</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257.10010321</concept_id> <concept_desc>Computing methodologies Machine learning algorithms</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257.10010293.10010294</concept_id> <concept_desc>Computing methodologies Neural networks</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257.10010258.10010262.10010277</concept_id> <concept_desc>Computing methodologies Transfer learning</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178.10010179.10010183</concept_id> <concept_desc>Computing methodologies Speech recognition</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>

1 Introduction↩︎

Extensive clinical trials span numerous patients, doctors, clinics, and even countries, making it hard to know if an enrolled patient is unique. They are also usually done over a long period of time, requiring participants to visit the doctor’s office multiple times. [1] found that 7.78% of patients participating in large clinical trials were duplicated across different sites. This diminishes quality of results of billion dollars worth drug development process. We propose using machine learning models for speaker verification (SV) to solve this problem in cognitive and mental health trials that record speech data for assessment.

Speaker verification is the process of comparing unknown speech utterances to speech signals belonging to known enrolled individuals to decide if a new recording belongs to the known users. SV systems are used across industries including such as banking [2], [3], transportation [4], telecommunications [5], and healthcare [6], [7].

There are two types of SV systems: 1) text-dependent (TD), requiring the speaker to use the same text in enrollment and verification, and 2) text-independent (TI), without those constraints. TD systems dominated in the past because it was easier to verify users on repeated text [8]. Recently, TI SV models are more common in practice since they only require audio, and there is no need for transcripts. TI models are comparable in accuracy with TD models [9] when pre-trained on large amounts of audio samples [10]. However, when it comes to cross-lingual SV, obtaining enough training data for each language is difficult, especially for clinical studies in low-resource languages.

Traditionally, TI SV systems depended fairly strongly on the language being spoken [11] and required language-specific training or fine-tuning. However, training and deploying a separate model for each language is inefficient, increasing the operational costs and overheads in multilingual systems. These concerns are even more substantial in clinical settings when working with patients with abnormal speech patterns.

We propose using large pre-trained SV TI model pre-trained on speech from multiple languages to verify speakers in clinical trials in zero-shot settings. Our target data comes from speakers with abnormal speech patterns. We also test the performance of those models on the speech recordings across different speech tasks and characterize the relation between the quality of SV and data properties (i.e. speech tasks and number of speakers).

In summary, the main contributions of this work are:

  • We propose using pre-trained end-to-end SV models to enroll and verify patients in cognitive and mental health clinical trials in zero-shot settings.

  • We test those models on the speech of English, German, Danish, Spanish, and Arabic patients and demonstrate the models can effectively generalize to different languages and speech impairment, achieving high performance as well as helping solve the duplicate participant problem.

  • We show how the performance changes when speech from different speech tasks is used, showing that speech task may influence EER.

2 Related Work↩︎

For a long time, SV models used a Gaussian mixture model universal background model (GMM-UBM) approach, where acoustic features (e.g., Mel frequency cepstral coefficients (MFCC)) were used to generate GMM models for each speaker from a set of speech recordings [8]. The GMM-UBM-based models are easily trained, require a small number of speakers, and are often more interpretable than other approaches. However, GMM-UBM-based models assume the normal distribution of the data, which limits their ability to learn more complex speaker representations. The accuracy of SV models is evaluated using Equal Error Rate (EER) [12], which is the point at which the true positive rate (TPR) is equal to the true negative rate (TNR). The GMM-UBM approach, as detailed in [13], achieved an Equal Error Rate (EER) of approximately 9.25% on English datasets. However, those models are phoneme-dependent, making knowledge transfer between languages very challenging [8].

In recent years, significant advancements have been made in pre-trained deep-learning speaker verification (SV) models, achieving remarkably low EER of 0.68% for English language [14]. These models leverage the benefits of end-to-end semi-supervised learning, utilizing more readily available data in languages like English to generate improved speaker embeddings. Moreover, such models have demonstrated the ability to learn intricate and sophisticated representations. For instance, a neural-network-based architecture successfully learned the SV task along with language detection [15]. However, they tend to be less interpretable and require substantial computational resources and extensive datasets for training [16].

More recently, work has been done in assessing whether neural network-based models trained on a single source language (i.e., English) can be adapted to perform SV in another (target) language without the need to be fine-tuned or trained in that target language in zero-shot settings. For example, convolutional time-delay deep neural network (CT-DNN) [17] architecture trained on the Fisher dataset [18], containing 5000 English speakers, results in a model that can achieve EER of 3.71% at the task of TI SV in Chinese/Uyghur. While we also use zero-shot adaptation, we use larger pre-trained models, and our target data comes from speakers with abnormal speech patterns.

[9] shows that SV models trained on one language can be used for SV in other languages without requiring language-specific fine-tuning by testing 46 different languages in zero-shot settings. However, the datasets, which were used for training and evaluation in [9], contained specific keywords (i.e., ‘Hey Google’ and ‘Ok Google’), making the task partially TD. Their results might not generalize to speech that is unconstrained and fully TI. In addition, their model is also only able to compare speech utterances of a fixed length duration, which limits the ability of the model. Our model learns SV from speech of variable length.

Our work continues with the idea that a neural network-based TI SV model, pre-trained on large datasets, can be successfully applied for TI SV of speech-impaired patients in different languages. We evaluate the ability of TI SV systems to generalize by testing three models on three clinical trial datasets in five languages with variable length speech.

3 Methods↩︎

3.1 Datasets↩︎

In this paper, we used speech recordings from the three clinical trial studies (Table 1).

Table 1: Speech datasets used in this work. ‘# Speakers’ denotes the number of speakers in each dataset. ‘# Samples’ denotes the number of audio samples in each dataset.
Language English German Danish Spanish Arabic
(en) (de) (da) (es) (ar)
Dataset ADCT CSMCI CSMCI CSMCI SCZCS
# Speakers 659 29 69 43 30
# Samples 7084 135 483 157 1192
Avg. # Samp. per Speaker 10.7\(\pm\)​7.0 4.7\(\pm\)​2.5 7.0\(\pm\)​3.6 3.6\(\pm\)​1.0 39.7\(\pm\)​12.0
Avg. Dur. of Audio (sec) 69.31 150.41 134.61 106.23 39.41
. Dur. of Speech (sec) 37.30 110.07 89.31 74.46 21.88

Alzheimer’s Disease Clinical Trial (ADCT) is a dataset that was collected during a clinical trial involving patients with mild to moderate Alzheimer’s disease (AD). This is a longitudinal dataset of speech recordings of English speakers performing a picture description task [19], [20], as well as phonemic verbal fluency [21] and semantic (categorical) verbal fluency [22] tasks every 12 weeks for their 48-week treatment period. All participants were confirmed to have a clinical diagnosis of AD based on the National Institute on Aging/Alzheimer’s Association (NIA-AA) criteria [23].

Clinical Study of Mild Cognitive Impairment (CSMCI) is a dataset that was collected during the clinical trial involving patients with mild cognitive impairment (MCI) and early AD. This is a longitudinal dataset of speech recordings of German, Danish, and Spanish speakers performing picture description tasks [19], [20] every 12 weeks for the 96-week treatment period. All participants have a clinical diagnosis of either MCI or mild AD according to NIA-AA criteria [23].

Schizophrenia Clinical Study (SCZCS) is a dataset that was collected during the clinical trial involving patients with a diagnosis of schizophrenia (SCZ) based on diagnostic and statistical manual of mental disorders, fifth edition (DSM-5) [24]. This is a longitudinal dataset of speech recordings of Arabic speakers performing picture description [19], [20], phonemic fluency, semantic fluency, paragraph reading & recall, and journaling tasks. These tasks were conducted on a monthly basis for a duration of 6 months.

There are substantially more samples and more speakers in the ADCT dataset than in CSMCI and SCZCS datasets (Table 1). The SCZCS dataset has the most samples per speaker, with the smallest average duration of speech. The CSMCI dataset has the longest average duration of speech, with the smallest average number of samples per speaker.

3.2 Speech tasks↩︎

In each trial, the subjects carried out a set of standardized speech tasks in every recording session. These speech tasks were used in multiple previous studies [25][28] due to the ability to generate speech patterns that could be examined for acoustic and linguistic characteristics associated with their mental and cognitive health condition:

  • Picture Description Task: The subject was presented with a static image depicting an event and was then asked to describe the scenario in their own words. Such tasks have been demonstrated to serve as reliable substitutes for spontaneous discourse [29]. Describing a picture was determined to be an effective speech task for eliciting situations requiring a higher level of cognitive effort and resulting in noticeable changes in speech, which can then be utilized to identify cognitive disorders such as AD or MCI [30]. In all studies, proprietary images were employed. They were designed to match the style and content of the well-researched ‘Cookie theft’ picture [19]. The guiding principles utilized to develop these pictures were based on the core design guidelines described in [31].

  • Phonemic Fluency Task: The FAS (‘F’, ‘A’, ‘S’) task [21], specifically focusing on the letter ‘F’, was employed to evaluate phonemic verbal fluency in participants. Participants were asked to name as many unique words starting with the letter ‘F’ as they can in one minute. This type of assessment has been extensively utilized in diverse populations, including individuals afflicted with AD [32].

  • Semantic Fluency Task: To evaluate semantic (categorical) fluency, participants were asked to list as many different animals [22], household objects, or food items as they could think of in one minute. This assessment has been widely used in a variety of populations, including AD patients [32].

  • Paragraph Reading & Recall Task: This task involves an individual reading a short story at the beginning of the session [33], [34]. Participants were asked to read one of three standard paragraphs, each containing the same number of details and information content units. They were asked to recall information about the story just after reading and again at the end of the session. It is observed that people who have SCZ show deficiencies in memory recall [35], making these tasks good indicators for measuring SCZ.

  • Journaling Task: Participants were provided with a prompt and asked to create narrative in response. Prompts are open-ended, allowing participants to provide as much or as little detail as they choose. An example prompt could be "what did you do yesterday". This task is used to assess the participants emotions, mental health, and verbal ability [36], [37].

3.3 Models↩︎

In this study, we evaluated three state-of-the-art SV models (Table 2), pre-trained on mix of languages, and evaluated on speech from speech-impaired patients speaking in English, German, Danish, Spanish, and Arabic languages in zero-shot settings. All models incorporate a combination of 1D convolution, batch normalization (BN), and Rectified Linear Unit (ReLU) to learn speech representation. We used NVIDIA NeMo (Neural Modules) toolkit 1 implementation of all models.

The SpeakerNet model [38] is based on the QuartzNet ASR (Automatic Speech Recognition) architecture, which consists of an encoder and a decoder. This model is smaller in size compared to the other two models. It consists of 5M parameters, making it very efficient at quickly generating speaker embeddings. It was trained on the VoxCeleb 1 [39] and VoxCeleb 2 [40] datasets.

The TitaNet model [14] architecture is similar to the SpeakerNet’s architecture but five times larger in terms of the number of parameters. It also uses squeeze excitation (SE) blocks and a global average pooling layer after each SE Block. Compared to the other two models, it shows the best reported EER on VoxCeleb1 [39] dataset, but it is slower in generating speaker embeddings. This model was trained on the VoxCeleb 1, VoxCeleb 2 [40], Fisher [18], and Switch Board [41] datasets.

Table 2: Comparison of models, based on the model size. The number of parameters is represented in millions (M).
Model # of Params # of Spkrs
SpeakerNet 5M 7,205
TitaNet 25.3M 16,681
ECAPA-TDNN 22M 14,343

Finally, the ECAPA-TDNN model [42] is a time-delay neural network-based model that has been specifically designed for SV tasks. It uses SE blocks just like TitaNet does. The ECAPA-TDNN model has been shown to achieve similar performance to TitaNet although it has fewer parameters and it was trained on fewer data than Titanet, specifically the subset of VoxCeleb 1 [39], VoxCeleb 2 [40], Fisher [18], and Switch Board [41] datasets.

We separately evaluated each SV model on ADCT (en), CSMCI (es), CSMCI (de), and CSMCI (da), and SCZCS (ar) datasets. First, we generated embeddings of all audio files in each dataset within the same language. Then, we joined those embeddings to create a set of positive and negative pair tuples. The positive tuples comprised pairs of embeddings, where enrollment and test speech came from the same speaker. There was \(\sum_{i=1}^m {n_i \choose 2}\) combinations of positive tuples, where \(m\) is the number of speakers and \(n_i\) is the number of speech recordings for the i-th speaker within the same dataset and language. The negative tuples are pairs of embeddings of speech belonging to different speakers within the same dataset and language. There was \(\frac{1}{2}\sum_{i=1}^m {n_i * (N-n_i)}\) negative pairs, where \(N\) is the total number of all audio files within the same dataset and language.

Once we had all positive and negative pair tuples, we calculated the cosine similarity between the vector embeddings in each tuple. Then, we calculated EER by balancing the true positive rate and the true negative rate. If the cosine similarity was above this balance point, the tuple was predicted to belong to the same speaker. Otherwise, the two embeddings were predicted to belong to different individuals.

Table 3: Benchmarking SV models across English, German, Danish, Spanish, and Arabic languages based on EER% metric. All models are evaluated with cosine similarity. The best (lowest) EER achieved on each dataset is represented in bold. The ADCT, SCZCS, and CSMCI datasets of speech-impaired patients are proprietary datasets we collected for testing the TI SV models.
Models English Non-English Performance on Different Languages (EER%)
Test Data Test Data English German Danish Spanish Arabic
SpeakerNet [38] ADCT SCZCS and CSMCI 4.99 0.52 1.16 4.29 14.97
TitaNet [14] ADCT SCZCS and CSMCI 3.10 0.50 0.57 0.67 9.42
ECAPA-TDNN [42] ADCT SCZCS and CSMCI 2.69 0.58 1.30 1.20 8.26
Table 4: The eLinguistics metric evaluating the proximity among the studied languages. A lower score indicates a higher degree of similarity and the metric has commutative property, making matrix above symmetric. We are showing just values above diagonal to improve clarity.
English German Danish Spanish Arabic
(en) (de) (da) (es) (ar)
English 0.0 31.3 24.6 59.3 85.5
German - 0.0 28.3 56.8 76.3
Danish - - 0.0 51.4 84.9
Spanish - - - 0.0 76.6
Arabic - - - - 0.0

It should also be noted that there were around \(m\) times as many negative tuples as there were positive tuples due to there being more ways of creating negative tuple pairs than positive pairs.

3.4 Language Selection↩︎

Datasets available to us are mainly consisting of European (Germanic and Romance) languages which are also more common in datasets used for pre-training of examined models. In addition, we examine performance on Arabic language, which belongs to Semitic language family and has different linguistic features and structures, including more complex grammar, additional sounds, and right-to-left script. Similar to English, Arabic is spoken across many countries, with distinct dialects. We are focusing on Jordanian Arabic due to current data availability. Jordanian Arabic has its own phonetic variations and pronunciation patterns that differ from other Arabic languages. It also has unique idioms, expressions, phrases, cultural influences, intonation, and accent. In [43], it is shown that verifying English speakers with accent is easier.

The proximity scores based on the eLinguistics metric [44] give additional information on how linguistically close studied languages are to each other (Table 4). For example, German and Danish have closer linguistic ties to English than Spanish does, while Arabic remains the most distinct.

4 Results↩︎

4.1 Benchmarking Speaker Verification Models across Different Languages↩︎

TitaNet performed the best on the German, Danish, and Spanish languages, with ECAPA-TDNN following closely behind, and SpeakerNet having the worst performance (Table 3). The outcomes on the English and Arabic datasets were the best when using the ECAPA-TDNN model, with 2.69% and 8.42% EER, respectively. This results may be due to the fact that TitaNet and ECAPA-TDNN are larger models, pre-trained on a higher number of speakers than SpeakerNet (Table 2). Also, TitaNet is flexible with variable lengths of speech recordings.

All tested models achieve better EER on European languages in zero-shot setting than on English language. While this may look like a surprising outcome, it was also observed by [9]. We hypothesize that the primary cause of this is the diversity of the ADCT English Test Dataset, which is significantly larger in terms of the number of speakers and samples. Additionally, it is the only dataset among those tested that includes a variety of speech tasks. To investigate these assumptions further, we examine the EER for different tasks (Table 6). Lastly, considering the widespread use of English globally and the resultant diversity among English speakers, we anticipate that increase in EER may be partially attributed to the dataset containing a considerable number of individuals for whom English is not the first language, or who speak English with a range of accents.

All tested models exhibit notably poorer performance on Arabic compared to European languages, which aligns with our expectations. Several potential reasons account for this behavior.

First, there is the lexical and acoustic similarities between the source and target languages. As shown in Table 4, languages like German, Danish and English are closer to each other in terms of linguistic proximity, which could partially account for better performance in models trained on these languages. On the other hand, Arabic, with its higher eLinguistics value, poses a significant challenge due to its linguistic dissimilarity from other languages. This might explain the observed drop in performance when dealing with Arabic, underlining the importance of considering linguistic distance in cross-lingual SV [45]. Second, the models have been exposed to substantially more data from the tested European languages and others similar to them (e.g., French, Italian). In contrast, Arabic data is less prevalent in the pre-training dataset and is linguistically distinct from most languages in the pre-trained dataset. Additionally, there could be unique clinical patterns of speech impairment in Arabic that negatively affect model performance. Further research is needed to examine this hypothesis across speech tasks and cognitive diseases. Despite this, it’s crucial to highlight that the ECAPA-TDNN model still demonstrates commendable performance, with an EER of 8.26%, making it a viable option for zero-shot Speaker Verification (SV) in Arabic. Nevertheless, for optimal performance, we recommend fine-tuning ECAPA-TDNN with additional Arabic data.

The performance of the TitaNet and ECAPA-TDNN models is also compared across different languages (Table 5), but in those tests we controlled for confounding factors. We matched the number of speakers and average number of samples per speaker across the experiments and sampled the datasets. TitaNet is outperforming ECAPA-TDNN in German, Spanish, and Danish, with a slight drop in performance when compared to the results from the data with the original number of speakers (Table 5 vs table 3). In addition, the SV performance on German, Spanish, and Danish are still better than the English datasets. This may be due to the differences in speech patterns between patients with MCI and AD [46].

Table 5: Benchmarking SV models across different languages with a balanced number of subjects per dataset. The best (lowest) EER% for each language is bold. ‘#Spkrs’ denotes the number of speakers participating in each dataset. ‘#Smpls’ denotes the number of audio recordings per dataset.
Language Models EER(%) TPR | TNR #Spkrs #Smpls Avg #Smpls
per Spkr
English TitaNet 5.30 0.946 | 0.948 29 212 7.31 \(\pm\) 1.86
ECAPA-TDNN 5.30 0.946 | 0.948
German TitaNet 0.50 1.000 | 0.990 29 135 4.66 \(\pm\) 2.54
ECAPA-TDNN 0.58 0.994 | 0.994
Danish TitaNet 1.22 0.990 | 0.980 29 188 6.48 \(\pm\) 3.45
ECAPA-TDNN 1.60 0.981 | 0.986
Spanish TitaNet 0.81 0.990 | 0.994 29 109 3.76 \(\pm\) 1.07
ECAPA-TDNN 1.46 0.985 | 0.986
Table 6: Benchmarking TitaNet SV model across different speech tasks on the ADCT English dataset. ‘#Spkrs’ denotes the number of speakers participating in each speech task. ‘#Smpls’ denotes the number of audio recordings per task.
Speech Task EER(%) TPR | TNR #Spkrs #Smpls Avg #Smpls Avg Dur of Avg Dur of
per Spkr Audio (sec) Speech (sec)
Phonetic Fluency 4.45 0.956 | 0.955 394 1641 4.24 \(\pm\) 0.94 49.32 \(\pm\) 10.51 19.02 \(\pm\) 9.26
Semantic Fluency 3.99 0.957 | 0.963 394 1643 4.25 \(\pm\) 0.95 51.58 \(\pm\) 9.30 21.33 \(\pm\) 9.89
Picture Description 2.11 0.978 | 0.980 397 2932 7.54 \(\pm\) 1.89 93.21 \(\pm\) 54.61 58.82 \(\pm\) 41.81

There can be many potential factors influencing the SV results across different datasets or languages. In particular, we will discuss the effect of three factors on SV performance including type of the mental disorder or cognitive impairment, type of the target language, and data collection procedure of a dataset.

Type of the mental disorder or cognitive impairment may impact the ability to distinguish participants from one another. Participants in the SCZCS are diagnosed with schizophrenia, whereas ADCT and CSMCI patients have Alzheimer’s disease or MCI. It is known that the nature of speech may vary between AD, MCI, and schizophrenia [46].

Another potential factor influencing the ability to generalize across different languages, can be the lexical or acoustic similarities [45] between the source and target languages. Languages, which have similar vocabulary or linguistic characteristics to the source language may benefit when performing TI SV. If the source language and the target languages are dissimilar, this may influence the ability for the monolingual to generalize to that target language. We believe this is the reason behind worse performance on Arabic language [45].

Finally, data collection procedure can impact the quality and length of audio and consequently SV performance. The three models we used in this study are able to handle variable lengths of speech but it is not known what length of speech is optimal. In addition, the influence of environmental noise and possible parts of examiner’s speech is not taken into consideration when analysing the results.

4.2 Evaluation↩︎

When observing the applicability of these models for our primary use case, we see that the usage of TI SV models is a feasible approach that could be applied in detecting duplicate participants across studies. However, setting a language-specific threshold may be necessary to get the best results. In our experiments, different tuned thresholds were used to maximize EER for each language. This may be caused by dialect variability within languages and amount of non-native speakers. It is also evident that these models seem to be sufficiently generalizable across languages without language-specific fine-tuning.

4.3 Benchmarking Speaker Verification Models across Different Speech Tasks↩︎

The TitaNet model is tested on ADCT (en) data across different speech tasks in Table 6. The picture description shows the lowest EER compared to the phonetic and semantic fluency tasks. These results suggests that picture description samples perform better in characterizing the differences between the speakers in SV tasks used in clinical settings. It is also important to note that average number of samples per speaker, and average duration of audio and speech are also the best for picture description task. In the future work, we will examine in more details influence of each of these confounding factors.

5 Conclusion↩︎

This research paper proposes the utilization of speech recordings for speaker verification in clinical trials, particularly for detecting and monitoring cognitive and mental health disorders. We demonstrated effectiveness of TitaNet, ECAPA-TDNN, and SpeakerNet models in zero-shot settings on speech-impaired patients speaking European languages and Arabic.

Our findings indicate that these TI SV models offer a promising solution to address the issue of duplicate participants without requiring fine-tuning, achieving high performance across multiple languages. We also show that the type of speech tasks conducted impacts the performance of SV models. This insight can inform the design and selection of appropriate tasks to enhance the accuracy and reliability of speaker verification systems.

In the future work, it would be valuable to establish a baseline by evaluating results from speakers without cognitive impairments in each language, providing a comparative measure of performance. Additionally, investigating the influence of specific diseases, noise, speech length, and the number of samples on SV model performance would deepen our understanding of the factors that affect accuracy.

Expanding the evaluation to include additional target languages would further validate the robustness and applicability of the SV models in diverse linguistic contexts. Lastly, exploring within-language performance variations, particularly in languages with distinct dialects, such as Arabic, would provide insights into the challenges and opportunities associated with dialectal variations in speaker verification.

6 Ethical Considerations↩︎

As with any research involving human subjects, ethical considerations must be taken into account when analyzing and reporting the results. In the context of the SV research, the following ethical considerations are relevant.

6.1 Privacy↩︎

Speaker verification systems rely on biometric data, which can raise privacy concerns. It is crucial to ensure that the privacy of the enrolled speakers and their sensitive personal information is protected throughout the process. Our research used clinical trial datasets containing recordings from patients with Alzheimer’s disease and mild cognitive impairment and schizophrenia. Due to the sensitivity of the data, we have made sure to conceal the identities of the individuals in the study and refrain from using personally-identifiable information.

6.2 Bias↩︎

Speaker verification systems have the potential to reinforce biases in society if not properly designed and evaluated. For instance, if the training data is not representative of the diverse population, the resulting system may not perform equally well across all groups. We can’t ensure the diversity of data that was used in the training of the model as we do not have access to the demographic characteristics of the datasets and other sensitive information of participants enrolled in the trials.

6.3 Misuse↩︎

Speaker verification systems can be misused for unethical purposes, such as identity theft or surveillance. It is important to emphasize that our research is focused on developing and evaluating SV models for legitimate purposes, i.e., solving the duplicate participant problem.

7 Limitations↩︎

Applying speaker verification (SV) to cognitive and mental health data comes with certain domain-specific limitations. Firstly, the variability in speech patterns among individuals with cognitive and mental health disorders can lead to variations in speech characteristics and affect the performance of SV models, as they are typically trained on relatively standardized speech data. Additionally, the cognitive impairment manifestation of the same disease may vary with language and those patterns are understudied for non-English languages. The impairment patterns are also shown to be different for different diseases. Consequently, the EER of SV systems may be different when applied to populations with cognitive and mental health disorders, which is examined in [43].

Another limitation arises from the limited availability of data. Obtaining large amounts of speech data from individuals with specific cognitive and mental health disorders is challenging. Datasets contain different amount of speakers for each severity level which impact is shown in [43]. Additionally, the type of speech tasks performed by individuals in clinical trials can influence speech characteristics and, subsequently, SV model performance as discussed in this paper. It is also important to consider the presence of co-occurring disorders among individuals with cognitive and mental health disorders. The presence of comorbid conditions can introduce additional variability in speech patterns, making it harder for SV models to accurately distinguish between speakers.

References↩︎

[1]
Thomas M Shiovitz, Charles S Wilcox, Lilit Gevorgyan, and Adnan Shawkat.2013. . Innovations in Clinical Neuroscience10, 2(2013), 17.
[2]
R Ramya, P Deepan Nivash, I Anwar Khansa, and B Harish Iyappa.2022. . In 2022 3rd International Conference on Smart Electronics and Communication (ICOSEC). IEEE, 1468–1471.
[3]
Shi-Huang Chen, Yu-Ren Luo, and Rodrigo Capobianco Guido.2009. . In 2009 11th IEEE International Symposium on Multimedia. IEEE, 562–566.
[4]
Atsuki Tamoto Katunobu Itou.2019. . In Proceedings of the 10th International Symposium on Information and Communication Technology. 336–341.
[5]
Håkan Melin.1996. . Department of Speech, Music and Hearing, KTH, Available from: http://www. speech. kth. se/~ melin/publications. html(1996).
[6]
Soroosh Tayebi Arasteh, Tobias Weise, Maria Schuster, Elmar Nöth, Andreas Maier, and Seung Hee Yang.2022. arXiv preprint arXiv:2204.06450(2022).
[7]
Shrikant Upadhyay, Mohit Kumar, Ashwani Kumar, Ramesh Karnati, Gouse Baig Mahommad, Sara A Althubiti, Fayadh Alenezi, and Kemal Polat.2022. . Computational and Mathematical Methods in Medicine2022(2022).
[8]
Amna Irum Ahmad Salman.2019. . International Journal of Machine Learning and Computing9, 1(2019).
[9]
Roza Chojnacka, Jason Pelecanos, Quan Wang, and Ignacio Lopez Moreno.2021. . arXiv preprint arXiv:2104.02125(2021).
[10]
Youzhi Tu, Weiwei Lin, and Man-Wai Mak.2022. . IEEE Access(2022).
[11]
Neil T Kleynhans Etienne Barnard.2005. . (2005).
[12]
Jyh-Min Cheng Hsiao-Chuan Wang.2004. . In 2004 International Symposium on Chinese Spoken Language Processing. IEEE, 285–288.
[13]
M Erdal Özbek, Mohamed Amine Haytom, and Estelle Cherrier.2018. . In 2018 26th Signal Processing and Communications Applications Conference (SIU). IEEE, 1–4.
[14]
Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg.2022. . In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8102–8106.
[15]
Yi Zhou, Xiaohai Tian, and Haizhou Li.2021. . IEEE/ACM Transactions on Audio, Speech, and Language Processing29(2021), 3427–3439.
[16]
Xuhong Li, Haoyi Xiong, Xingjian Li, Xuanyu Wu, Xiao Zhang, Ji Liu, Jiang Bian, and Dejing Dou.2022. . Knowledge and Information Systems64, 12(2022), 3197–3234.
[17]
Lantian Li, Dong Wang, Askar Rozi, and Thomas Fang Zheng.2017. . In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1040–1044.
[18]
Christopher Cieri, David Miller, and Kevin Walker.2004. . In LREC, Vol. 4. 69–71.
[19]
Harold Goodglass, Edith Kaplan, and Sandra Weintraub.2001. BDAE: The Boston diagnostic aphasia examination. Lippincott Williams & Wilkins Philadelphia, PA.
[20]
James T Becker, François Boiler, Oscar L Lopez, Judith Saxton, and Karen L McGonigle.1994. . Archives of neurology51, 6(1994), 585–594.
[21]
John G Borkowski, Arthur L Benton, and Otfried Spreen.1967. . Neuropsychologia5, 2(1967), 135–140.
[22]
Tom N Tombaugh, Jean Kozak, and Laura Rees.1999. . Archives of clinical neuropsychology14, 2(1999), 167–177.
[23]
Giovanni B Frisoni, Bengt Winblad, and John T O’Brien.2011. . International psychogeriatrics23, 8(2011), 1191–1196.
[24]
D American Psychiatric Association, American Psychiatric Association, et al2013. Diagnostic and statistical manual of mental disorders: DSM-5. Vol. 5. American psychiatric association Washington, DC.
[25]
Zining Zhu, Jekaterina Novikova, and Frank Rudzicz.2018. . In NeurIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language IRASL.
[26]
Ben Eyre, Aparna Balagopalan, and Jekaterina Novikova.2020. . In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020). Association for Computational Linguistics, Online, 193–199. ://doi.org/10.18653/v1/2020.wnut-1.25.
[27]
Jekaterina Novikova.2021. . In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), Wei Xu, Alan Ritter, Tim Baldwin, and Afshin Rahimi(Eds.). Association for Computational Linguistics, Online, 334–339. ://doi.org/10.18653/v1/2021.wnut-1.37.
[28]
Mashrura Tasnim, Malikeh Ehghaghi, Brian Diep, and Jekaterina Novikova.2022. . In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, Ayah Zirikly, Dana Atzil-Slonim, Maria Liakata, Steven Bedrick, Bart Desmet, Molly Ireland, Andrew Lee, Sean MacAvaney, Matthew Purver, Rebecca Resnik, and Andrew Yates(Eds.). Association for Computational Linguistics, Seattle, USA, 1–16. ://doi.org/10.18653/v1/2022.clpsych-1.1.
[29]
Elaine Giles, Karalyn Patterson, and John R Hodges.1996. . Aphasiology10, 4(1996), 395–408.
[30]
Kimberly D Mueller, Bruce Hermann, Jonilda Mecollari, and Lyn S Turkstra.2018. . Journal of clinical and experimental neuropsychology40, 9(2018), 917–939.
[31]
Rupal Patel Kathryn Connaghan.2014. . International Journal of Speech-Language Pathology16, 4(2014), 337–343.
[32]
Margaret Crossley, Carl D’arcy, and Nigel SB Rawson.1997. . Journal of clinical and experimental neuropsychology19, 1(1997), 52–62.
[33]
John D Bransford Marcia K Johnson.1972. . Journal of verbal learning and verbal behavior11, 6(1972), 717–726.
[34]
John W Newcomer, Suzanne Craft, Robert Fucetola, Steven O Moldin, Gregg Selke, Leilani Paras, and Ryan Miller.1999. . Schizophrenia Bulletin25, 2(1999), 321–335.
[35]
TE Goldberg, DR Weinberger, NH Pliskin, KF Berman, and MH Podd.1989. . Schizophr Res2(1989), 251–257. Issue 3. ://doi.org/10.1016/0920-9964(89)90001-7.
[36]
Raluca Nicoleta Trifu, Bogdan NEMEŞ, Carolina Bodea-Hațegan, and Doina Cozman.2017. Journal of Evidence-Based Psychotherapies17, 1(2017).
[37]
Monika Sohal, Pavneet Singh, Bhupinder Singh Dhillon, and Harbir Singh Gill.2022. . Family medicine and community health10, 1(2022).
[38]
Nithin Rao Koluguri, Jason Li, Vitaly Lavrukhin, and Boris Ginsburg.2020. . arXiv preprint arXiv:2010.12653(2020).
[39]
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman.2017. . arXiv preprint arXiv:1706.08612(2017).
[40]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman.2018. . arXiv preprint arXiv:1806.05622(2018).
[41]
John Godfrey Edward Holliman.1993. . Linguistic Data Consortium(1993), 34.
[42]
Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt, Brecht Desplanques, and Hwidong Na.2021. . arXiv preprint arXiv:2104.01466(2021).
[43]
Ehghaghi Malikeh, Stanojevic Marija, Akram Ali, and Novikova Jekaterina.2023. . (2023).
[44]
Vincent Beaufils Johannes Tomin.2023. Stochastic approach to worldwide language classification: the signals and the noise towards long-range exploration. ://www.eLinguistics.net.
[45]
Barry R. Chiswick Paul W. Miller.2008. . Journal of Multilingual and Multicultural Development(2008), 1–11.
[46]
I. Martínez-Nicolás, T.E. Llorente, F. Martínez-Sánchez, and J.J.G. Meilán.2021. . Frontiers in Psychology12(2021), 620251. ://doi.org/10.3389/fpsyg.2021.620251.

  1. https://github.com/NVIDIA/NeMo↩︎