Abstract

This paper introduces the Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD-SVDSR), a resource specifically designed to facilitate the creation of high-quality deepfakes and support the development of detection systems trained against them. The dataset comprises 45-minute audio recordings from 36 participants, each reading various newspaper articles recorded under controlled conditions and captured via five microphones of differing quality. By focusing on extended-duration audio, ELAD-SVDSR captures a richer range of speech attributes—such as pitch contours, intonation patterns, and nuanced delivery—enabling models to generate more realistic and coherent synthetic voices. In turn, this approach allows for the creation of robust deepfakes that can serve as challenging examples in datasets used to train and evaluate synthetic voice detection methods. As part of this effort, 20 deepfake voices have already been created and added to the dataset to showcase its potential. Anonymized metadata accompanies the dataset on speaker demographics. ELAD-SVDSR is expected to spur significant advancements in audio forensics, biometric security, and voice authentication systems.

IEEE SOCIETY/COUNCIL Signal Processing Society (SPS)

DATA DOI/PID 10.21227/ab5w-0c23
DATA TYPE/LOCATION Audio; Potsdam, NY, USA

Audio Dataset, Synthetic Voice Detection, Speaker Recognition, Deepfake Detection, Voice Synthesis, Biometric Security.

BACKGROUND↩︎

The rapid advancement of text-to-speech (TTS) and deep learning techniques has enabled the production of highly realistic synthetic voices, often referred to as deepfakes [1]. Early TTS systems were limited in quality primarily due to short-duration training data, which provided only a narrow sampling of speech features. However, modern approaches like WaveNet [1] and Tacotron 2 [2] allow for longer length audio input and contain more sophisticated architectures.

As the fidelity of TTS models improves, concerns about misuse for impersonation and fraud have grown [3]. Moreover, detecting synthetic audio has become increasingly challenging, leading to dedicated efforts such as the ASVspoof initiative [4] to develop and benchmark anti-spoofing methods. While many detection algorithms perform well on short speech segments, longer-duration recordings can better reveal subtle artifacts in synthesis[5].

Extended-length audio also benefits speaker recognition systems, which rely on robust feature representations extracted from diverse acoustic conditions [6]. Longer recordings allow for more comprehensive modeling of a speaker’s unique vocal traits, helping TTS models produce stronger deepfake voices and providing richer data for detection systems to identify the presence of synthetic elements. Despite these advantages, there remains a shortage of publicly available datasets that provide extended-duration recordings suitable for both training and evaluating deepfake generation and detection algorithms. The dataset presented in this paper, ELAD-SVDSR, addresses this gap by offering 45-minute recordings from 36 participants, alongside 20 deepfake samples generated.

The VCTK Corpus [7] is a multi-speaker English dataset recorded under controlled studio conditions. It contains roughly 110 speakers, each providing short read passages. Although this dataset has good speaker diversity, it focuses on shorter utterances and lacks synthetic parallel data.

LibriSpeech [8] is a large corpus derived from public domain audiobooks, offering around 1,000 hours of speech from more than 2,400 speakers. While it provides extended recordings for ASR, it lacks controlled microphone variations and does not include high-fidelity synthetic samples.

VoxCeleb [9] consists of thousands of speakers drawn from YouTube interviews. Its real-world noise and broad speaker coverage are strengths for speaker identification tasks, but the dataset primarily contains shorter clips, lacks microphone diversity, and does not include synthetic utterances.

LJSpeech [10] is a single-speaker dataset with about 24 hours of read speech. It is widely used for building TTS models due to its consistent recording environment. However, the single-speaker limitation and relatively short utterances reduce its suitability for multi-speaker deepfake research.

Mozilla’s Common Voice [11] is a crowdsourced dataset with contributions from global volunteers. It supports multiple languages and speaker demographics but varies significantly in recording quality and clip length, and it does not offer paired synthetic samples for deepfake detection.

TIMIT [12] is a classic dataset featuring phonetically balanced, short utterances in English. With high-quality, time-aligned transcriptions, TIMIT remains popular for phonetic research. However, the limited speaker set and short, lab-recorded prompts render it insufficient for extended deepfake generation and detection.

AISHELL-1 [13] is an open-source Mandarin speech corpus recorded under relatively quiet conditions. While it provides useful data for non-English speech research, it consists mostly of short utterances and lacks any synthetic component, limiting its value for deepfake-related studies.

Overall, these datasets (summarized in Table 1) have significantly advanced speech technology but do not comprehensively address the need for extended-duration recordings, controlled microphone diversity, and integrated synthetic samples. ELAD-SVDSR fills this gap by providing 45-minute recordings per speaker under multiple microphone conditions alongside high-quality deepfake audio.

Table 1: Comparison of Popular Speech Datasets with ELAD-SVDSR.
Dataset	# Speakers	Total Hours	Avg.Clip Length	Synthetic Data?	Mic Diversity	Extended Per Speaker
VCTK [7]	\(\sim110\)	\(\sim44\)	Short (a few seconds)	No	Single, Studio	No
LibriSpeech [8]	2,484+	\(\sim1000\)	Mostly a few seconds	No	Not specified	Partial
VoxCeleb [9]	7,000+	\(\sim2000\)	Short to medium (\(<\)10 s)	No	In-the-wild	No
LJSpeech [10]	1	\(\sim24\)	10–15 s	No	Single, Studio	No
Common Voice [11]	Thousands	Several thousand	Mostly a few seconds	No	User-recorded, variable	No
TIMIT [12]	630	\(\sim5\)	3–4 s	No	Single, Lab	No
AISHELL-1 [13]	400+	\(\sim178\)	Short utterances	No	Single, Quiet	No
ELAD-SVDSR (Proposed)	36	\(\sim27\)	45 min/speaker	Yes (20 deepfakes)	5 distinct mics	Yes

COLLECTION METHODS AND DESIGN↩︎

Institutional Review Board (IRB) Approval↩︎

ELAD-SVDSR was developed following rigorous ethical guidelines and procedures approved by the Institutional Review Board (IRB Approval No. 24-42) at Clarkson University [14]. This approval ensures that all aspects of the research involving human subjects adhere to the highest ethical standards, particularly regarding informed consent, data confidentiality, and the overall treatment of participants.

The IRB’s primary role is to safeguard the rights and well-being of research participants. All participants were fully informed about the study’s nature, potential risks and benefits, and their rights as participants. Each participant signed an informed consent form detailing the study’s purpose, procedures, data-sharing permissions, and measures to ensure their confidentiality. Flyers were distributed throughout the university campus to recruit participants.

0.0.0.2 Data Confidentiality and Security

The IRB addressed the critical concern of protecting participant data. All voice recordings and associated metadata were anonymized before inclusion in the dataset, meaning no identifying information was linked to the tapes. After anonymization, all identifiable data were permanently deleted, and the consent forms were physically and securely stored at the university. The data was stored in secure, password-protected environments, accessible only to authorized researchers.

Participants for the ELAD-SVDSR dataset were recruited from the Clarkson University community through flyers and electronic communications. All potential participants were provided with detailed information about the study, including its objectives, procedures, and their rights as participants. Informed consent was obtained from all participants before the data collection began.

Recording Environments↩︎

The recording sessions were conducted in a closed-room environment with minimal external noise and maintaining consistent acoustic conditions. Participants were seated as close as possible to all recording equipment. Additionally, researchers followed a uniform protocol during all sessions—monitoring microphone placement, participant comfort, and other environmental factors—to ensure high-quality and reliable audio samples. Figure 1 shows the environment of live data collection process.

Recording Procedure↩︎

Participants were instructed to read aloud three contemporary news articles during their recording sessions making the data text-dependent. The first article, sourced from The New York Times, covered economic and political topics related to tariff policies [15]. The remaining two were taken from TIME magazine: one discussing issues surrounding pregnancy criminalization in a post-Dobbs context [16] and another examining how scams and fraud schemes evolve in the digital age [17]. These articles were specifically chosen due to their diverse vocabulary and subject matter, which help cover a wide range of phoneme variations in the English language to reflect distinct intonations, linguistic structures, and contexts.

Recording Equipment↩︎

All voice recordings for this study were captured using:

Audio-Technica AT2020 [18]
Shure SM58 [19]
TOZO A1 [20]
Inni Oasis R1 [21]
ZIPCIDE Digital Voice Activated Recorder (Spy Pen) [22]

Each device’s complete specifications are publicly available through its respective manufacturer’s website [18]–[22]. The recordings were made at a standard sampling rate of 44.1 kHz. The Audio-Technica AT2020 operates within a frequency range of 20 Hz to 20 kHz, has a sensitivity of -37 dB, and can handle a maximum sound pressure level (SPL) of 144 dB. It requires 48V phantom power to operate and connects via an XLR output[18]. The Shure SM58 operates within a frequency response range of 50 Hz to 15 kHz and is optimized to emphasize clarity in the vocal midrange while attenuating low-frequency background noise[19]. The TOZO A1 earbuds capture near-field audio through an integrated microphone and connect via Bluetooth [20]. The Inni Oasis R1 is a touchscreen digital recorder capable of high-fidelity audio capture under relatively quiet conditions [21]. Finally, the ZIPCIDE Digital Voice Activated Recorder (commonly referred to as a “spy pen”) discreetly captures speech, functioning in a compact form factor for on-the-go recording [22]. Figure 2 displays the microphones used for data collection.

Figure 2: Microphones and recording devices used in this study (from top to bottom): Audio-Technica AT2020, Shure SM58, TOZO A1 earbuds, Inni Oasis R1 recorder, and the ZIPCIDE Spy Pen..

VALIDATION AND QUALITY↩︎

Data Quality Control↩︎

Each recording was manually reviewed for clarity and consistency to ensure the highest quality data. The final dataset excluded records that contained any clipping, inconsistent sampling rates (44.1 kHz), unclear or interrupted speech, and technical malfunctions.

Speaker Diversity↩︎

As shown in Table 2, broad representation across age, gender, and accents is ensured.

Table 2: Participant Demographic Information
Category	Subcategory	Number of Participants
Gender	Man
	Woman
	Prefer not to respond
Race	Caucasian
	Black
	Hispanic
	Native American
	Middle Eastern
	Indian
	Asian
	Other
Age	-25 years
	-30 years
	-40 years

Audio Quality Metrics↩︎

The audio properties highlighted in Table 3 demonstrate the key attributes. Noise levels were measured in A-weighted decibels (dBA) using calibrated microphones. The noise levels ranged between 2.66 dBA and 10.82 dBA, with an average of 6.27 dBA. This range reflects noise close to near-silent environments (4.86 dB)[23]–[25].

Table 3: Summary of Audio properties in the ELAD-SVDSR Dataset
Metric	Value
Sampling Rate	44.1 kHz
Bit Depth	16-bit
Average Noise Level of all audio files	6.27 dB
Total Duration of Recordings	27 hours

Table 4 shows the notably high signal-to-noise ratio (SNR) values obtained from this analysis. The SNR is computed using the formula as Ahmed et al.[26] as: \[\text{SNR (in dB)} = 10 \times \log_{10} \biggl( \frac{P_{\text{signal}}}{P_{\text{noise}}} \biggr),\] where \(P_{\text{signal}}\) and \(P_{\text{noise}}\) are the mean-square power of the speech signal and the background noise, respectively. Specifically, \[P = \frac{1}{N} \sum_{n=1}^{N} x[n]^2,\] with \(x[n]\) denoting the amplitude of the signal or noise at the \(n\)-th sample, and \(N\) is the total sample count.

To approximate the noise level, any audio segment whose amplitude falls below a set threshold—derived from the average overall signal energy—was designated as noise. The remaining portion of the audio was treated as the speech signal. Two additional SNR measures were evaluated: Segmented SNR (SegSNR) and Frequency-Weighted SNR (fwSNR).

Segmented SNR (SegSNR): This metric divides the audio into short, fixed-length frames (around 20–30 ms). The SNR for each segment is calculated as: \[\text{SegSNR (in dB)} = 10 \times \log_{10} \biggl(\frac{P_{\text{segment}}}{P_{\text{noise}}}\biggr),\] where \(P_{\text{segment}}\) corresponds to the mean-square power of that segment, and \(P_{\text{noise}}\) is the noise power. The final SegSNR is obtained by averaging across all segments, offering a granular view of time-varying noise levels.

Frequency-Weighted SNR (fwSNR): An A-weighting filter is applied to emphasize frequencies crucial for human hearing (typically 500 Hz–5 kHz). The fwSNR is then given by: \[\text{fwSNR (in dB)} = 10 \times \log_{10} \biggl(\frac{P_{\text{A-weighted signal}}}{P_{\text{A-weighted noise}}}\biggr),\] where speech and noise powers are computed after the filter is applied. As with the other methods, noise segments are estimated based on low-energy detection. The scripts for SNR calculations can be accessed at: https://github.com/ahmedajan/SNR_Calculation_For_VPQAD/tree/main/VPQAD_SNR

Table 4: Summary of Different SNR Metrics for ELAD-SVDSR
Metric	Highest Value (dB)	Lowest Value (dB)	Mean Value (dB)
SNR
Segmented SNR (SegSNR)
Frequency-Weighted SNR

Under these high-SNR conditions, background noise remains minimal relative to the speech signal, indicating a more favorable environment for intelligibility.

Generation of DeepFakes↩︎

A total of 20 deepfake voices were generated using Tortoise TTS [27]. Before feeding each participant’s 45-minute speech into the model, the data was manually pre-processed using Audacity audio software[28]. The data was manually heard, silences were trimmed, and disfluencies (fillers, long pauses)were removed; this manually annotated audio was segmented into \(\sim\)10-second clips. Two of these clips were separated randomly to compare with the deepfakes generated. The other clips were fed into the model for training. Deepfakes were generated such that the synthetic audio had the same sentence as the two test clips separated for proper matching.

0.0.0.3 VeriSpeak Match Scores

To evaluate the quality of deepfakes generated, VeriSpeak[29], an industry-standard speaker recognition performance measurement tool was used. VeriSpeak Match Scores quantified how closely the generated voice resembles the genuine speaker’s characteristics. Although VeriSpeak internally calculates match scores on an absolute scale, we present normalized percentages based on the original-to-original comparison as 100%. Table 5 summarizes the results for 3 random samples of these normalized results drawn from the dataset.

Table 5: Normalized VeriSpeak Match Scores for Original vs.Deepfake Recordings
Participant	Recording Type	Match Score (Normalized %)
Subject 3	Original	100.0
	Deepfake	41.7
Subject 6	Original	100.0
	Deepfake	39.3
Subject 18	Original	100.0
	Deepfake	34.5

While the similarity percentages may seem low, they are of good quality when it comes to generation of deepfakes as demonstrated by Table 6. For this comparison, 10 different audio samples are taken from each of the different datasets to generate deepfakes and verySpeak match scores. The DeepFake-to-original scores are then normalized as a percentage against the Original-to-Original scores and the results are shown.

Table 6: Mean Percentage Normalized Similarity Scores for Various Datasets from the Literature Review
Dataset	Mean Similarity (%)
VCTK [7]	15.6
LibriSpeech [8]	18.9
VoxCeleb [9]	24.2
LJSpeech [10]	28.9
Common Voice [11]	19.3
TIMIT [12]	10.1
AISHELL-1 [13]	2.4
ELAD-SVDSR (Proposed)	37.2

RECORDS AND STORAGE↩︎

Data Processing and Storage↩︎

After each recording session, the data was stored in a secure, password-protected digital archive, with access only to authorized researchers. This was done to maintain privacy while all identifying information was separated from the audio files and permanently deleted from the recordings. The files were anonymized by assigning each recording a unique identification code, ensuring no personal information was linked to the voice data. This ensured that the dataset could now be made public while maintaining participants’ privacy according to the IRB regulations.

File Naming Conventions and Folder Architecture↩︎

At the top level, the dataset contains essential files such as the README (explaining the usage of the dataset, especially the mic numbers and their corresponding microphones) and a metadata folder that holds CSV files with details on the speaker demographics. The main body of the dataset is contained within the "subjects" directory, where each subject is assigned its folder (e.g., subject01, subject02, etc.). Within each subject folder, there are three subdirectories: "original_audio," which stores the raw recordings; "preprocessed_audio," which contains annotated versions of these recordings; and "deepfake_audio," which holds the synthesized deepfake outputs. The files are all in .wav* format and the naming convention indicates the origin of each audio file; for instance, a file named "subject01_mic1.wav" in the original_audio folder signifies a recording from subject 01 captured with Shure SM58, while "subject01_mic1_deepfake01.wav" in the deepfake_audio folder denotes the first deepfake generated from that same microphone recording as shown in Figure 3.

Figure 3: Directory Architecture for ELAD-SVDSR

INSIGHTS AND NOTES↩︎

Accessing data↩︎

The Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition(ELAD-SVDSR) is intended exclusively for academic research. To obtain access, researchers are required to sign the End User License Agreement (EULA), which can be requested via email at mimtiaz@clarkson.edu or downloaded directly from the IEEE Dataport. A signed EULA must then be returned to this email address. Only emails originating from academic accounts will be accepted.

Dataset Limitations↩︎

ELAD-SVDSR’s extended-duration recordings capture a range of speech characteristics, yet several practical constraints remain. Although the dataset features 36 speakers, this may not fully reflect the linguistic diversity seen in broader populations. The collection was conducted in a controlled, low-noise environment, producing high-fidelity audio that may not mirror real-world acoustic conditions. Additionally, the focus on English newspaper readings restricts the applicability of the dataset to other languages and more spontaneous speech tasks. The deepfake profiles were generated using a specific text-to-speech model, which can introduce biases that limit their generalizability. Future expansions to include more varied recording scenarios, a broader speaker population, and alternative speech domains would help mitigate these limitations and enhance ELAD-SVDSR’s utility.

SOURCE CODE AND SCRIPTS↩︎

The scripts for calculating signal-to-noise ratios (SNR), segmenting audio files, and evaluating speech quality metrics are publicly accessible. These are available via GitHub and the repository link for scripts used in this dataset are available at: https://github.com/ahmedajan/SNR_Calculation_For_VPQAD/tree/main/VPQAD_SNR

ACKNOWLEDGEMENTS AND INTERESTS↩︎

R.V., A.A., J.P. and D.P. conducted the data collection. J.P., D.P. and A.C. conducted preprocessing of the data. A.A. curated the data collection, analyzed the data and wrote the manuscript. S.S. and M.H.I. reviewed the data collection, curation, and analysis. All authors reviewed the manuscript.

This work is supported by the Center for Identification Technology Research and the National Science Foundation under Grant No. 1650503.

The article authors have declared no conflicts of interest.

References↩︎

[1]

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” in 9th ISCA Speech Synthesis Workshop, 2016, pp. 125–125.

[2]

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2018, pp. 4779–4783.

[3]

T. Kinnunen, K. A. Lee, N. Evans, and J. Yamagishi, “ASVspoof 2017: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan,” arXiv preprint arXiv:1703.10129, 2017.

[4]

M. Todisco, A. Nautsch, S. Yadav, N. Evans, T. Kinnunen, K. A. Lee, S. Murdoch, and J. Yamagishi, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Interspeech 2019, 2019, pp. 1008–1012.

[5]

S. Ö. Arik, M. Chrzanowski, A. Coates, G. F. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, and S. Sengupta, “Deep Voice: Real-time Neural Text-to-Speech,” in Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 195–204.

[6]

T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: from features to supervectors,” Speech Communication, vol. 52, no. 1, pp. 12–40, 2010.

[7]

C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit,” 2017, accessed: 2025-03-20. [Online]. Available: https://datashare.is.ed.ac.uk/handle/10283/2651.

[8]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2015, pp. 5206–5210.

[9]

A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in Interspeech, 2017, pp. 2616–2620.

[10]

K. Ito, “LJ Speech Dataset,” 2017, accessed: 2025-03-20. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/.

[11]

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, M. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common Voice: A Massively-Multilingual Speech Corpus,” arXiv preprint arXiv:1912.06670, 2020.

[12]

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” Linguistic Data Consortium, 1993.

[13]

H. Bu, J. Du, X. Na, B. Feng, and H. Dai, “AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).IEEE, 2017, pp. 1–5.

[14]

Clarkson University, “Institutional Review Board (IRB),” 2024, accessed: 2024-08-17. [Online]. Available: https://www.clarkson.edu/academics/research/institutional-review-board.

[15]

“Trump’s Economic Push Touts Lower Gas Prices and New Tariffs,” accessed: 2025-01-17. [Online]. Available: https://www.nytimes.com/2024/09/10/us/politics/trump-economics-gas-tariffs-inflation.html.

[16]

“Pregnancy Criminalization in a Post-Dobbs Era,” accessed: 2025-01-17. [Online]. Available: https://time.com/7024133/pregnancy-criminalization-post-dobbs/.

[17]

“The Age of Scams,” accessed: 2025-01-28. [Online]. Available: https://time.com/7021745/the-age-of-scams-2/.

[18]

Audio-Technica, “AT2020 Cardioid Condenser Microphone,” 2024, accessed: 2024-10-24. [Online]. Available: https://www.audio-technica.com/en-us/at2020.

[19]

Shure, “SM58 Vocal Microphone,” 2024, accessed: 2024-08-20. [Online]. Available: https://www.shure.com/en-US/products/microphones/sm58?variant=SM58-LC.

[20]

“TOZO A1 Wireless Earbuds,” 2025, accessed: 2025-01-03. [Online]. Available: https://www.tozostore.com/products/a1?variant=44909930217761.

[21]

“Inni Oasis R1 Voice Recorder,” 2025, accessed: 2024-12-28. [Online]. Available: https://www.innioasis.com/products/72gb-digital-voice-recorder-with-playback-innioasis-r1-full-touchscreen-voice-recorder-with-bluetooth-and-intelligent-stt-transcription-voice-activated-sound-audio-recorder-device-with-mic.

[22]

“ZIPCIDE Digital Voice Activated Recorder Pen,” 2025, accessed: 2024-12-12. [Online]. Available: https://www.spyguy.com/products/voice-activated-recorder-pen?currency=USD&stkn=ebfd260631bb&gad_source=1&gclid=CjwKCAjwnPS-BhBxEiwAZjMF0r23RZRkiKkViRAv-8rwvgyfli-0Xa8PqgpszjWYuyL2f_64A6tBGxoCRlEQAvD_BwE.

[23]

E. McPhillips, “Noise levels of everyday sounds,” 2022, accessed: 2024-09-05. [Online]. Available: https://www.audicus.com/noise-levels-of-everyday-sounds/.

[24]

“Decibel Examples: Noise Levels of CommonSounds,” 2024, accessed: 2024-09-05. [Online]. Available: https://lexiehearing.com/blog/decibel-examples.

[25]

“Common NoiseLevels,” 2024, accessed: 2024-09-05. [Online]. Available: https://noiseawareness.org/info-center/common-noise-levels/.

[26]

A. Ahmed, M. J. A. Khondkar, A. Herrick, S. Schuckers, and M. H. Imtiaz, “Descriptor: Voice Pre-Processing and Quality Assessment Dataset (VPQAD),” IEEE Data Descriptions, 2024.

[27]

“Tortoise TTS: Advanced Open-Source Voice Cloning,” 2025, accessed: 2025-01-29. [Online]. Available: https://github.com/neonbjd/tortoise-tts.

[28]

Audacity Team, “Audacity (Version 3.2.5) [Computer software],” 2023, accessed: 2025-02-01. [Online]. Available: https://www.audacityteam.org/.

[29]

Neurotechnology, “VeriSpeak - Speaker Recognition System,” 2025, accessed: 2025-03-21. [Online]. Available: https://www.neurotechnology.com/verispeak.html.

Descriptor: Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD-SVDSR)

Abstract

BACKGROUND↩︎

COLLECTION METHODS AND DESIGN↩︎

Institutional Review Board (IRB) Approval↩︎

0.0.0.2 Data Confidentiality and Security

Recording Environments↩︎

Recording Procedure↩︎

Recording Equipment↩︎

VALIDATION AND QUALITY↩︎

Data Quality Control↩︎

Speaker Diversity↩︎

Audio Quality Metrics↩︎

Generation of DeepFakes↩︎

0.0.0.3 VeriSpeak Match Scores

RECORDS AND STORAGE↩︎

Data Processing and Storage↩︎

File Naming Conventions and Folder Architecture↩︎

INSIGHTS AND NOTES↩︎

Accessing data↩︎

Dataset Limitations↩︎

SOURCE CODE AND SCRIPTS↩︎

ACKNOWLEDGEMENTS AND INTERESTS↩︎

References↩︎

Subjects

Updated on Academus

Descriptor: Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD-SVDSR)

Abstract

BACKGROUND↩︎

Related Work↩︎

COLLECTION METHODS AND DESIGN↩︎

Institutional Review Board (IRB) Approval↩︎

0.0.0.1 Ethical Considerations and Informed Consent

0.0.0.2 Data Confidentiality and Security

Participant Recruitment and Consent↩︎

Recording Environments↩︎

Recording Procedure↩︎

Recording Equipment↩︎

VALIDATION AND QUALITY↩︎

Data Quality Control↩︎

Speaker Diversity↩︎

Audio Quality Metrics↩︎

Generation of DeepFakes↩︎

0.0.0.3 VeriSpeak Match Scores

RECORDS AND STORAGE↩︎

Data Processing and Storage↩︎

File Naming Conventions and Folder Architecture↩︎

INSIGHTS AND NOTES↩︎

Accessing data↩︎

Dataset Limitations↩︎

SOURCE CODE AND SCRIPTS↩︎

ACKNOWLEDGEMENTS AND INTERESTS↩︎

References↩︎

Subjects

Updated on Academus