Speech Datasets
- MSR-86K An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
- Libriheavy a 50,000 hours ASR corpus with punctuation casing and context
- AISHELL-2 Transforming Mandarin ASR Research Into Industrial Scale
- VoxCommunis A Corpus for Cross-linguistic Phonetic Analysis
- Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
- Peopleâs Speech Dataset MLCommons Datasets - from https://mlcommons.org/
- Multilingual Spoken Words Dataset MLCommons Datasets - from https://mlcommons.org/
- Speech Commands A Dataset for Limited-Vocabulary Speech Recognition
- Towards Measuring Fairness in AI the Casual Conversations Dataset
- TED-LIUM an Automatic Speech Recognition dedicated corpus
- TED-LIUM 3 twice as much data and corpus repartition for experiments on speaker adaptation
- The M-AILABS Speech Dataset
- The AMI Meeting Corpus
- FLEURS Few-shot Learning Evaluation of Universal Representations of Speech
- FLEURS-R A Restored Multilingual Speech Corpus for Generation Tasks
- Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo)
- Speech-MASSIVE A Multilingual Speech Dataset for SLU and Beyond
- Europarl-ST A Multilingual Corpus For Speech Translation Of Parliamentary Debates
- Europarl-ASR A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data FilteringVerbatimization
- CoVoST A Diverse Multilingual Speech-To-Text Translation Corpus
- CoVoST 2 and Massively Multilingual Speech-to-Text Translation
- MuST-C a Multilingual Speech Translation Corpus
- MLS A Large-Scale Multilingual Dataset for Speech Research
- VoxPopuli A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
- SPGISpeech 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
- GigaSpeech An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
- The Peopleâs Speech A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
- Librispeech An ASR corpus based on public domain audio books
- LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned
- available at OpenSLR
- Libri-Light A Benchmark for ASR with Limited or No Supervision
- Meta blog post: Libri-light
- Preparation and download script
- large.tar (51934 hours, 3.05 TB (download link: https://dl.fbaipublicfiles.com/librilight/data/large.tar)
- LibriTTS A Corpus Derived from LibriSpeech for Text-to-Speech
- CVSS Corpus and Massively Multilingual Speech-to-Speech Translation -speech-translation
- Fisher and CALLHOME SpanishâEnglish Speech Translation
- Festvox CMU_ARCTIC Databases
- constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced
- US English single speaker databases
- designed for unit selection speech synthesis research
- ~1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg
- databses include US English male (bdl) and female (slt) speakers (both experinced voice talent)âŠas well as other accented speakers
- The 1132 sentence prompt list is available from cmuarctic.data
- the distributions include 16KHz waveform and simultaneous EGG signals
- full phoentically labelling was perfromed by the CMU Sphinx using the FestVox based labelling scripts
- AISHELL-1 An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline - Aishell is an open-source Chinese Mandarin speech corpus
- 400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz
- manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection
- free for academic use
- published by Beijing Shell Shell Technology Co. Ltd.
- TIMIT Acoustic-Phonetic Continuous Speech Corpus - Linguistic Data Consortium
- Available: https://github.com/philipperemy/timit
- Paper / Docs from NIST: https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir4930.pdf
- published 1993
- Â standard dataset used for the evaluation of automatic speech recognition (ASR) systems
- 630 speakers
- 8 dialects of American English
- each speaker reads 10 phonetically-rich sentences
- corpus includes time-aligned
- orthographic
- phonetic
- word transcriptions
- 16-bit, 16kHz speech waveform
- transcriptions have been hand-verified
- AVSpeech Audio Visual Speech Dataset
- See paper: Looking to Listen at the Cocktail Party A Speaker-Independent Audio-Visual Model for Speech Separation
- large-scale audio-visual dataset
- speech video clips with no interfering background noises
- segments 3-10 seconds long
- in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video
- ~ 4,700 hours of video segments
- from ~290k YouTube videos
- spans a wide variety of people, languages and face poses
- VoxCeleb2 Deep Speaker Recognition
- audio-visual speaker recognition dataset
- over 1 million utterations
- 6,000 speakers (celebrities)
- 61% of speakers are male (gender balanced-ish)
- span ethnicities, accents and ages
- varied visual and auditory environments
- from Zissermanâs VGG @ Oxford
- WenetSpeech
- WenetSpeech A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
- Repository (downloaders?)
- Non-commericla licence
- 10,000+ hours
- multi-domain Chinese ASR data (slight pain to download)
- Mozilla Common Voice
- Weâre building an open source, multi-language dataset of voices that anyone can use to train speech-enabled applications.
- VCTK
- CSTR VCTK Corpus
- 110 English speakers
- âŠwith various accents
- each speaker reads out ~400 sentences
- sentences selected from a newspaper
- rainbow passage and an elicitation paragraph used for the speech accent archive
- downloadable via Hugging Face datasets
- CSTR VCTK Corpus
- Hi-Fi Multi-Speaker English TTS Dataset
- CSS10 A Collection of Single Speaker Speech Datasets for 10 Languages
- The LJ Speech Dataset
- HUI-Audio-Corpus-German
- MAVD The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information
- Sam, the Non-Binary TTS Voice
- Sam is a non-binary text-to-speech (TTS) voice that can be embedded into any voice assistant software solution. Accenture Labs created Sam in collaboration with Cereproc. By open-sourcing the components that we used to create this voice, we hope to encourage adoption and creation of others like it in the future so that there will eventually be a diversity of non-binary voices out in the world.
- The Spotify Podcast Dataset - this paper was replaced by the Coling paper: 100,000 Podcasts A Spoken English Document Corpus
- is this dataset still available?
- Emilia An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
- available via HF: https://huggingface.co/datasets/amphion/Emilia-Dataset (also included in document metadata)
- Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings
See also:
- Phoneme-based Speech Datasets in The taste of IPA Towards open-vocabulary keyword spotting and forced alignment in any language
Speech Translation Datasets
- How2 - English instructional videos with Portuguese translations
- Augmented Librispeech - English audiobooks (Gutenberg Project) from the LibriSpeech Corpus translated into written French text
- MuST-C a Multilingual Speech Translation Corpus - from FBK, a corpus of English TED talks translated into 14 languages
- CoVoST and CoVoST 2 and Massively Multilingual Speech-to-Text Translation - CoVoST 2 has one-to-many and many-to-one datasets in 15 languages. Both CoVoST 1 and 2 use Mozilla Common Voice data
- Europarl-ST - from European Parliament proceedings from 2008-2012. Has multiple sources and targets for both speech and text. 4 languages.
- VoxPopuli - expansion deck for Europarl-ST - extends it with data from 2009-2020
- Kosp2E - Korean speech audio with English parallel texts. Domains: Newspapers (Zeroth), textbooks (KSS), EmphStyleKQC (AI stuff), Covid-ED (Covid diaries; lots of emotional speech audio)
- GigaST - English speech audio translated to German and Chinese text. Based on GigaSpeech ASR corpus - 10,000 hours transcribed speech, audiobooks, podcasts and YouTube.
- Prabhupadavani - code-mixed speech, mainly English with Bengali and Sanskrit interspersed, text in 25 languages
Besides these popular ST datasets, there are some other smaller size datasets such as Fisher(Cieri et al., 2004), Call-Home4 , Gordard Corpus(Godard et al., 2017), Glosse Audio Corpus5 , BTEC 6 , WSJ7 , IWSLT8 , CHiME-4 Corpus(Christensen et al., 2010), Miami Corpus(Deuchar, 2008), and MSLT Corpus (Federmann and Lewis, 2016). â Recent Advances in Direct Speech-to-text Translation
Speech Dataset Lists / Directories
- TensorFlow Datasets
- Google Research > Resources > Datasets
- In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
- Google Dataset Search
- Papers with Code Datasets
- Text-to-speech datasets - Hugging Face Audio Course - see original website
- OpenSLR Resources - We are open to hosting any type of data thatâs useful for speech recognition and related tasks, that needs a stable URL where it can be downloaded from. We may think more carefully in cases where the data is very large (e.g. tens of gigabytes or more).
- [[View Datasets provided by MLCommon]
- Spotify Research Data Sources (link):
- TORCHAUDIO.DATASETS
- All datasets are subclasses of torch.utils.data.Dataset and have
__getitem__
and__len__
methods implemented. - Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples in parallel using
torch.multiprocessing
workers. For example:
- All datasets are subclasses of torch.utils.data.Dataset and have
yesno_data = torchaudio.datasets.YESNO('.', download=True)
data_loader = torch.utils.data.DataLoader(
yesno_data,
batch_size=1,
shuffle=True,
num_workers=args.nThreads)
Audio Datasets
- The MAESTRO Dataset - MIDI and Audio Edited for Synchronous TRacks and Organization
- ~200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms
- OpenMIC-2018
- Clotho An Audio Captioning Dataset
- Clotho-AQA A Crowdsourced Dataset for Audio Question Answering
- Spotify data sources:
- Developer.spotify.com
- Spotify.design
- Research.spotify.com
- News.spotify.com
- Backstage.io
- open source framework for building developer portals
- doesnât have Spotify in the name but is built by them too
- Investors.spotify.com
- Spotifycodes.com
Multimodal Datasets
Generic section for all datasets with vision, speech, audio and other. Some overlap with speech/other sections is fine.
- LAION datasets (https://laion.ai/projects/) including:
- Coyo dataset - 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models
- Conceptual 12M Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
- Im2Text Describing Images Using 1 Million Captioned Photographs - SBU Captions
Vision Datasets
- COCO: Microsoft COCO Common Objects in Context
- COIN: COIN A Large-scale Dataset for Comprehensive Instructional Video Analysis
- CrossTask: Cross-task weakly supervised learning from instructional videos
- DeVAn: DeVAn Dense Video Annotation for Video-Language Models
- DiDeMo:
- EPIC-KITCHENS:
- HMDB
- HowTo100M: HowTo100M Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
- JFT-300M: Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
- JFT-3B
- JHMDB
- Kinetics: Quo Vadis, Action Recognition A New Model and the Kinetics Dataset
- MSR-VTT:
- Something-Something V2 (SS-V2): The something something video database for learning and evaluating visual common sense
- Tasty: TASTY A Transformer based Approach to Space and Time complexity
- TVQA: TVQA Localized, Compositional Video Question Answering
- UCF101
- UCF101-24
- YT-Temporal-180M introduced in MERLOT Multimodal Neural Script Knowledge Models
- YT-Temporal-1B introduced in MERLOT Reserve Neural Script Knowledge through Vision and Language and Sound
- YouCook2: Towards Automatic Learning of Procedures from Web Instructional Videos
- Video Instruction Dataset: Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models
- Visual Genome: Visual Genome Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Evaluation datasets (some of which are missing above) for vision found in Evaluation
Text Datasets
- BigDocs An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
- Toxicity of the Commons Curating Open-Source Pre-Training Data
- The Pile An 800GB Dataset of Diverse Text for Language Modeling
- C4 from the T5 paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- RedPajama-Data-v2 An open dataset with 30 trillion tokens for training large language models
- MASSIVE A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
- Training Verifiers to Solve Math Word Problems
- MAWPS A Math Word Problem Repository
- Global MMLU Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
- Measuring Massive Multitask Language Understanding
- EMMA-500 Enhancing Massively Multilingual Adaptation of Large Language Models
- Europarl: A Parallel Corpus for Statistical Machine Translation Philipp Koehn, MT Summit 2005 PDF - consists of European Parliament Proceedings Parallel Corpus from 1996 to 2011
- RecipeNLG
- Common Corpus: common_corpus from PleIAs (clipped: PleIAscommon_corpus · Datasets at Hugging Face)
Parallel Corpora / Machine Translation Corpora
Instruction Tuning / Supervised Finetuning Datasets (Textual)
Look under đ Instruction Tuning for Large Language Models A Survey