Speech Datasets

See also:

Speech Translation Datasets

  • How2 - English instructional videos with Portuguese translations
  • Augmented Librispeech - English audiobooks (Gutenberg Project) from the LibriSpeech Corpus translated into written French text
  • MuST-C a Multilingual Speech Translation Corpus - from FBK, a corpus of English TED talks translated into 14 languages
  • CoVoST and CoVoST 2 and Massively Multilingual Speech-to-Text Translation - CoVoST 2 has one-to-many and many-to-one datasets in 15 languages. Both CoVoST 1 and 2 use Mozilla Common Voice data
  • Europarl-ST - from European Parliament proceedings from 2008-2012. Has multiple sources and targets for both speech and text. 4 languages.
  • VoxPopuli - expansion deck for Europarl-ST - extends it with data from 2009-2020
  • Kosp2E - Korean speech audio with English parallel texts. Domains: Newspapers (Zeroth), textbooks (KSS), EmphStyleKQC (AI stuff), Covid-ED (Covid diaries; lots of emotional speech audio)
  • GigaST - English speech audio translated to German and Chinese text. Based on GigaSpeech ASR corpus - 10,000 hours transcribed speech, audiobooks, podcasts and YouTube.
  • Prabhupadavani - code-mixed speech, mainly English with Bengali and Sanskrit interspersed, text in 25 languages

Besides these popular ST datasets, there are some other smaller size datasets such as Fisher(Cieri et al., 2004), Call-Home4 , Gordard Corpus(Godard et al., 2017), Glosse Audio Corpus5 , BTEC 6 , WSJ7 , IWSLT8 , CHiME-4 Corpus(Christensen et al., 2010), Miami Corpus(Deuchar, 2008), and MSLT Corpus (Federmann and Lewis, 2016). — Recent Advances in Direct Speech-to-text Translation

Speech Dataset Lists / Directories

yesno_data = torchaudio.datasets.YESNO('.', download=True)
data_loader = torch.utils.data.DataLoader(
    yesno_data,
    batch_size=1,
    shuffle=True,
    num_workers=args.nThreads)

Audio Datasets

Multimodal Datasets

Generic section for all datasets with vision, speech, audio and other. Some overlap with speech/other sections is fine.

Vision Datasets

Evaluation datasets (some of which are missing above) for vision found in Evaluation

Text Datasets

Parallel Corpora / Machine Translation Corpora

Instruction Tuning / Supervised Finetuning Datasets (Textual)

Look under 👉 Instruction Tuning for Large Language Models A Survey

Neuroscience Datasets

See Data availability