Over 1.5 TB’s of Labeled Audio Datasets | by Christopher Dossman | Towards Data Science

Excerpt

At Wonder Technologies, we have spent a lot of time building Deep learning systems that understand the world through audio. From deep learning based voice extraction to teaching computers how to read…


List of 25 Large Audio Datasets I use for my audio research

[

Christopher Dossman

](https://cdossman.medium.com/?source=post_page-----b45b88cd4ad--------------------------------)[

Towards Data Science

](https://towardsdatascience.com/?source=post_page-----b45b88cd4ad--------------------------------)

At Wonder Technologies, we have spent a lot of time building Deep learning systems that understand the world through audio. From deep learning based voice extraction to teaching computers how to read our emotions, we needed to use a wide set of data to deliver APIs that worked even in the craziest sound environments. Here is a list of datasets that I found pretty useful for our research and that I’ve personally used to make my audio related models perform much better in real-world environments.

Trying to build a custom dataset? Not sure where to start? Join me for a 30-minute one on one to talk about your project. Sign up for a time slot

Music Datasets

Free Music Archive

FMA is a dataset for music analysis. The dataset consists of full-length and HQ audio, pre-computed features, and track and user-level meta-data. It is an open dataset created for evaluating several tasks in Music Information Retrieval (MIR).

This one’s huge, almost 1000 GB in size.

Million Song Dataset

The Million Song Dataset is a freely-available collection of audio features and meta-data for a million contemporary popular music tracks. The core of the dataset is the feature analysis and meta-data for one million songs. The dataset does not include any audio, only the derived features. The sample audio can be fetched from services like 7digital, using the code provided by Columbia University. The size of this dataset is about 280 GB.

Top 6 Cheat Sheets Novice Machine Learning Engineers Need

Speech Datasets

Free Spoken Digit Dataset

This one was created to solve the task of identifying spoken digits in audio samples. It’s an open dataset so the hope is that it will keep growing as people keep contributing more samples. Currently, it contains the below characteristics: 1) 3 speakers 2) 1,500 recordings (50 of each digit per speaker) 3) English pronunciations. This is a really small set- about 10 MB in size.

LibriSpeech

This dataset is a large-scale corpus of around 1000 hours of English speech. The data has been sourced from audio books from the LibriVox project and is 60 GB in size.

VoxCeleb

VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from You Tube videos. The data is mostly gender balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions, and age. There is no overlap between the development and test sets. It’s an intriguing use case for isolating and identifying which superstar the voice belongs to.

This set is 150 MB in size and has about 2000 hours of speech.

The Spoken Wikipedia Corpora

This is a corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia. Hundreds of hours of aligned audio and annotations can be mapped back to the original HTML. The entire set is about 38 GB in size available in both audio and without audio format.

Flickr Audio Caption Corpus

40,000 spoken captions of 8,000 natural images, 4.2 GB in size. This corpus was collected in 2015 to investigate multi-modal learning schemes for unsupervised speech pattern discovery.

TED-LIUM

Audio transcription of TED talks. 1495 TED talks audio recordings along with full-text transcriptions of those recordings, created by Laboratoire d’Informatique de l’Université du Maine (LIUM).

Speech Commands Dataset

The dataset (1.4 GB) has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website. It’s released under a Creative Commons-BY 4.0 license and will continue to grow in future releases as more contributions are received. The dataset is designed to let you build basic but useful voice interfaces for applications, with common words like “Yes”, “No”, digits and directions included. The infrastructure used to create the data has been open sourced too, and we hope to see it used by the wider community to create their own versions, especially to cover under served languages and applications.

Common Voice

Common Voice (12 GB is size) is a corpus of speech data read by users on the Common Voice website, and based on text from a number of public domain sources like user-submitted blog posts, old books, movies, and other public speech corpora. Its primary purpose is to enable the training and testing of automatic speech recognition (ASR) systems.

Persian Consonant Vowel Combination (PCVC) Speech Dataset

The Persian Consonant Vowel Combination (PCVC) Speech Dataset is a Modern Persian speech corpus for speech recognition and also speaker recognition. The dataset contains sound samples of Modern Persian combination of vowel and consonant phonemes from different speakers. Every sound sample contains just one consonant and one vowel So it is somehow labeled in phoneme level. This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.

If you want to use this dataset, reference to this paper:

Saber MalekzadeH, Mohammad Hossein Gholizadeh, Seyed Naser Razavi “Full Persian Vowel recognition with MFCC and ANN on PCVC speech dataset” 5th International conference of electrical engineering, computer science and information technology, Iran, Tehran, 2018. (PDF)

VoxForge

Clean speech dataset of accented English. Useful for instances in which you expect to need robustness to different accents or intonations.

CHIME

This is a noisy speech recognition challenge dataset (~4GB in size). The dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.

You can download the dataset from here.

2000 HUB5 English

English-only speech data used most recently in the Deep Speech paper from Baidu.

Parkinson Speech Dataset with Multiple Types of Sound Recordings Data Set

The training data belongs to 20 Parkinson’s Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken for this 20 MB set.

Zero Resource Speech Challenge

The ultimate goal of the Zero Resource Speech Challenge is to construct a system that learns an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only information available to a language learning infant. “Zero resource” refers to zero linguistic expertise (e.g., orthographic/linguistic transcriptions), not zero information besides audio (visual, limited human feedback, etc). The fact that 4-year-olds spontaneously learn a language without supervision from language experts show that this goal is theoretically reachable.

ISOLET Data Set

This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task.

Arabic Speech Corpus

The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes. This Speech corpus has been developed as part of PhD work carried out by Nawar Halabiat the University of Southampton. The corpus was recorded in South Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.

TIMIT Corpus

The TIMIT corpus (440 MB) of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance.

Multimodal EmotionLines Dataset (MELD)

Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Each utterance in a dialogue has been labeled with— Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. Download here

Getting more data for your algorithms is one way to increase accuracy. Explore a few more cheats in Deep Learning Performance Cheat Sheet

Sound/Nature

AudioSet

An expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. To download this set, click on this GitHub link.

Mivia Audio Events Dataset

6,000 events of surveillance applications, namely glass breaking, gunshots, and screams. The events are divided into a training set composed of 4,200 events and a test set composed of 1,800 events.

To download this dataset, you must register yourself on the Mivia website.

Environmental Audio Datasets

This page tries to maintain a list of datasets suitable for environmental audio research. In addition to the freely available dataset, also proprietary and commercial datasets are listed here for completeness. In addition to the datasets, also some of the on-line sound services are listed at the end of the page. These services can be used to form new datasets for special research needs.

The datasets are divided into two tables:

  • Sound events table contains datasets suitable for research in the field of automatic sound event detection and automatic sound tagging.
  • Acoustic scenes table contains datasets suitable for research involving the audio-based context recognition and acoustic scene classification.

FSD: a dataset of everyday sounds (Freesound)

The AudioSet Ontology is a hierarchical collection of over 600 sound classes and we have filled them with 297,159 audio samples from Freesound. This process generated 678,511 candidate annotations that express the potential presence of sound sources in audio clips. FSD includes a variety of everyday sounds, from human and animal, sounds to music and sounds made by things, all under Creative Commons licenses. By creating this dataset, we seek to promote research that will enable machines to hear and interpret sound similarly to humans.

Freesound is a platform for the collaborative creation of audio collections labeled by humans and based on Freesound content.

Urban Sound Classification

The dataset (6 GB) is called UrbanSound and contains 8732 labeled sound excerpts (4s) of urban sounds from 10 classes namely: Air Conditioner, Car Horn, Children Playing, Dog bark, Drilling Engine, Idling, Gun Shot, Jackhammer, Siren and Street Music The attributes of data are as follows: ID — Unique ID of sound excerpt Class — type of sound.

Urban Sound Dataset

This dataset contains 1302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. Each recording may contain multiple sound events, but for each file, only events from a single class are labeled. The classes are drawn from the urban sound taxonomy.

Bird Audio Detection challenge

In collaboration with the IEEE Signal Processing Society, a research data challenge was introduced to create a robust and scalable bird detection algorithm. This challenge contained new datasets (5.4 GB) collected in real live bio-acoustics monitoring projects, and an objective, standardized evaluation framework.

Let’s also connect on Twitter or LinkedIn