🪴 Anil's Garden

Speech Datasets

YODAS Youtube-Oriented Dataset for Audio and Speech
Releasing Youtube-Commons a massive open corpus for conversational and multimodal data - from Pleias - Pierre-Carl Langlais
Libriheavy a 50,000 hours ASR corpus with punctuation casing and context
MOSEL 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Granary Speech Recognition and Translation Dataset in 25 European Languages
VoxForge: Free Speech… Recognition (Linux, Windows and Mac) - voxforge.org
NUTSHELL A Dataset for Abstract Generation from Scientific Talks
MSR-86K An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
AISHELL-2 Transforming Mandarin ASR Research Into Industrial Scale
VoxCommunis A Corpus for Cross-linguistic Phonetic Analysis
Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
People’s Speech Dataset MLCommons Datasets - from https://mlcommons.org/
Multilingual Spoken Words Dataset MLCommons Datasets - from https://mlcommons.org/
Speech Commands A Dataset for Limited-Vocabulary Speech Recognition
Towards Measuring Fairness in AI the Casual Conversations Dataset
TED-LIUM an Automatic Speech Recognition dedicated corpus
TED-LIUM 3 twice as much data and corpus repartition for experiments on speaker adaptation
The M-AILABS Speech Dataset
The AMI Meeting Corpus
FLEURS Few-shot Learning Evaluation of Universal Representations of Speech
FLEURS-R A Restored Multilingual Speech Corpus for Generation Tasks
Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo)
Speech-MASSIVE A Multilingual Speech Dataset for SLU and Beyond
Europarl-ST A Multilingual Corpus For Speech Translation Of Parliamentary Debates
Europarl-ASR A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data FilteringVerbatimization
CoVoST A Diverse Multilingual Speech-To-Text Translation Corpus
CoVoST 2 and Massively Multilingual Speech-to-Text Translation
MuST-C a Multilingual Speech Translation Corpus
MLS A Large-Scale Multilingual Dataset for Speech Research
VoxPopuli A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
SPGISpeech 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
GigaSpeech An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
The People’s Speech A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
Librispeech An ASR corpus based on public domain audio books
- LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned
- available at OpenSLR
Libri-Light A Benchmark for ASR with Limited or No Supervision
- Meta blog post: Libri-light
- Preparation and download script
- large.tar (51934 hours, 3.05 TB (download link: https://dl.fbaipublicfiles.com/librilight/data/large.tar)
LibriTTS A Corpus Derived from LibriSpeech for Text-to-Speech
CVSS Corpus and Massively Multilingual Speech-to-Speech Translation -speech-translation
Fisher and CALLHOME Spanish—English Speech Translation
Festvox CMU_ARCTIC Databases
- constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced
- US English single speaker databases
- designed for unit selection speech synthesis research
- ~1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg
- databses include US English male (bdl) and female (slt) speakers (both experinced voice talent)…as well as other accented speakers
- The 1132 sentence prompt list is available from cmuarctic.data
- the distributions include 16KHz waveform and simultaneous EGG signals
- full phoentically labelling was perfromed by the CMU Sphinx using the FestVox based labelling scripts
AISHELL-1 An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline - Aishell is an open-source Chinese Mandarin speech corpus
- 400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz
- manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection
- free for academic use
- published by Beijing Shell Shell Technology Co. Ltd.
TIMIT Acoustic-Phonetic Continuous Speech Corpus - Linguistic Data Consortium
- Available: https://github.com/philipperemy/timit
- Paper / Docs from NIST: https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir4930.pdf
- published 1993
- standard dataset used for the evaluation of automatic speech recognition (ASR) systems
- 630 speakers
- 8 dialects of American English
- each speaker reads 10 phonetically-rich sentences
- corpus includes time-aligned
  - orthographic
  - phonetic
  - word transcriptions
- 16-bit, 16kHz speech waveform
- transcriptions have been hand-verified
AVSpeech Audio Visual Speech Dataset
- See paper: Looking to Listen at the Cocktail Party A Speaker-Independent Audio-Visual Model for Speech Separation
- large-scale audio-visual dataset
- speech video clips with no interfering background noises
- segments 3-10 seconds long
- in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video
- ~ 4,700 hours of video segments
- from ~290k YouTube videos
- spans a wide variety of people, languages and face poses
VoxCeleb2 Deep Speaker Recognition
- audio-visual speaker recognition dataset
- over 1 million utterations
- 6,000 speakers (celebrities)
- 61% of speakers are male (gender balanced-ish)
- span ethnicities, accents and ages
- varied visual and auditory environments
- from Zisserman’s VGG @ Oxford
WenetSpeech
- WenetSpeech A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
- Repository (downloaders?)
- Non-commericla licence
- 10,000+ hours
- multi-domain Chinese ASR data (slight pain to download)
Mozilla Common Voice
- We’re building an open source, multi-language dataset of voices that anyone can use to train speech-enabled applications.
VCTK
- CSTR VCTK Corpus
  - 110 English speakers
  - …with various accents
  - each speaker reads out ~400 sentences
  - sentences selected from a newspaper
  - rainbow passage and an elicitation paragraph used for the speech accent archive
- downloadable via Hugging Face datasets
Hi-Fi Multi-Speaker English TTS Dataset
CSS10 A Collection of Single Speaker Speech Datasets for 10 Languages
The LJ Speech Dataset
HUI-Audio-Corpus-German
MAVD The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information
Sam, the Non-Binary TTS Voice
- Sam is a non-binary text-to-speech (TTS) voice that can be embedded into any voice assistant software solution. Accenture Labs created Sam in collaboration with Cereproc. By open-sourcing the components that we used to create this voice, we hope to encourage adoption and creation of others like it in the future so that there will eventually be a diversity of non-binary voices out in the world.
The Spotify Podcast Dataset - this paper was replaced by the Coling paper: 100,000 Podcasts A Spoken English Document Corpus
- is this dataset still available?
Emilia An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
- available via HF: https://huggingface.co/datasets/amphion/Emilia-Dataset (also included in document metadata)
Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings

Speech Translation Datasets

How2 - English instructional videos with Portuguese translations
Augmented Librispeech - English audiobooks (Gutenberg Project) from the LibriSpeech Corpus translated into written French text
MuST-C a Multilingual Speech Translation Corpus - from FBK, a corpus of English TED talks translated into 14 languages
CoVoST and CoVoST 2 and Massively Multilingual Speech-to-Text Translation - CoVoST 2 has one-to-many and many-to-one datasets in 15 languages. Both CoVoST 1 and 2 use Mozilla Common Voice data
Europarl-ST - from European Parliament proceedings from 2008-2012. Has multiple sources and targets for both speech and text. 4 languages.
VoxPopuli - expansion deck for Europarl-ST - extends it with data from 2009-2020
Kosp2E - Korean speech audio with English parallel texts. Domains: Newspapers (Zeroth), textbooks (KSS), EmphStyleKQC (AI stuff), Covid-ED (Covid diaries; lots of emotional speech audio)
GigaST - English speech audio translated to German and Chinese text. Based on GigaSpeech ASR corpus - 10,000 hours transcribed speech, audiobooks, podcasts and YouTube.
Prabhupadavani - code-mixed speech, mainly English with Bengali and Sanskrit interspersed, text in 25 languages

Besides these popular ST datasets, there are some other smaller size datasets such as Fisher(Cieri et al., 2004), Call-Home4 , Gordard Corpus(Godard et al., 2017), Glosse Audio Corpus5 , BTEC 6 , WSJ7 , IWSLT8 , CHiME-4 Corpus(Christensen et al., 2010), Miami Corpus(Deuchar, 2008), and MSLT Corpus (Federmann and Lewis, 2016). — Recent Advances in Direct Speech-to-text Translation

Speech Dataset Lists / Directories

TensorFlow Datasets
Google Research > Resources > Datasets
- In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
Google Dataset Search
Papers with Code Datasets
- Papers with Code Datasets - Speech
Text-to-speech datasets - Hugging Face Audio Course - see original website
OpenSLR Resources - We are open to hosting any type of data that’s useful for speech recognition and related tasks, that needs a stable URL where it can be downloaded from. We may think more carefully in cases where the data is very large (e.g. tens of gigabytes or more).
[[View Datasets provided by MLCommon]
- MLCommons datasets on Hugging Face
Spotify Research Data Sources (link):
TORCHAUDIO.DATASETS
- All datasets are subclasses of torch.utils.data.Dataset and have __getitem__ and __len__ methods implemented.
- Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples in parallel using torch.multiprocessing workers. For example:

yesno_data = torchaudio.datasets.YESNO('.', download=True)
data_loader = torch.utils.data.DataLoader(
    yesno_data,
    batch_size=1,
    shuffle=True,
    num_workers=args.nThreads)

Audio Datasets

The MAESTRO Dataset - MIDI and Audio Edited for Synchronous TRacks and Organization
- ~200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms
OpenMIC-2018
Clotho An Audio Captioning Dataset
Clotho-AQA A Crowdsourced Dataset for Audio Question Answering
Spotify data sources:
- Developer.spotify.com
- Spotify.design
- Research.spotify.com
- News.spotify.com
- Backstage.io
  - open source framework for building developer portals
  - doesn’t have Spotify in the name but is built by them too
- Investors.spotify.com
- Spotifycodes.com

Multimodal Datasets

Generic section for all datasets with vision, speech, audio and other. Some overlap with speech/other sections is fine.

LAION datasets (https://laion.ai/projects/) including:
Coyo dataset - 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models
Conceptual 12M Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Im2Text Describing Images Using 1 Million Captioned Photographs - SBU Captions

Vision Datasets

COCO: Microsoft COCO Common Objects in Context
COIN: COIN A Large-scale Dataset for Comprehensive Instructional Video Analysis
CrossTask: Cross-task weakly supervised learning from instructional videos
DeVAn: DeVAn Dense Video Annotation for Video-Language Models
DiDeMo:
EPIC-KITCHENS:
HMDB
HowTo100M: HowTo100M Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
JFT-300M: Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
JFT-3B
JHMDB
Kinetics: Quo Vadis, Action Recognition A New Model and the Kinetics Dataset
MSR-VTT:
Something-Something V2 (SS-V2): The something something video database for learning and evaluating visual common sense
Tasty: TASTY A Transformer based Approach to Space and Time complexity
TVQA: TVQA Localized, Compositional Video Question Answering
UCF101
UCF101-24
YT-Temporal-180M introduced in MERLOT Multimodal Neural Script Knowledge Models
YT-Temporal-1B introduced in MERLOT Reserve Neural Script Knowledge through Vision and Language and Sound
YouCook2: Towards Automatic Learning of Procedures from Web Instructional Videos
Video Instruction Dataset: Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models
Visual Genome: Visual Genome Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Evaluation datasets (some of which are missing above) for vision found in Evaluation

Text Datasets

BigDocs An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Toxicity of the Commons Curating Open-Source Pre-Training Data
The Pile An 800GB Dataset of Diverse Text for Language Modeling
C4 from the T5 paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
RedPajama-Data-v2 An open dataset with 30 trillion tokens for training large language models
MASSIVE A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
Training Verifiers to Solve Math Word Problems
MAWPS A Math Word Problem Repository
Global MMLU Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Measuring Massive Multitask Language Understanding
EMMA-500 Enhancing Massively Multilingual Adaptation of Large Language Models
Europarl: A Parallel Corpus for Statistical Machine Translation Philipp Koehn, MT Summit 2005 PDF - consists of European Parliament Proceedings Parallel Corpus from 1996 to 2011
RecipeNLG
Common Corpus: common_corpus from PleIAs (clipped: PleIAscommon_corpus · Datasets at Hugging Face)

Parallel Corpora / Machine Translation Corpora

Instruction Tuning / Supervised Finetuning Datasets (Textual)

Look under 👉 Instruction Tuning for Large Language Models A Survey

Neuroscience Datasets

MEG-MASC a high-quality magneto-encephalography dataset for evaluating natural speech processing

See Data availability

🪴 Anil's Garden

Explorer

Datasets