🪴 Anil's Garden

❯

VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis

17 Jun 20253 min read

paper
low-resource
phonetics
speech
lrec/2022
annotated

Title: VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis
Authors: Emily Ahn, Eleanor Chodroff
Published: 2022-06-01
Link: https://aclanthology.org/2022.lrec-1.566/

Abstract

Cross-linguistic phonetic analysis has long been limited by data scarcity and insufficient computational resources. In the past few years, the availability of large-scale cross-linguistic spoken corpora has increased dramatically, but the data still require considerable computational power and processing for downstream phonetic analysis. To facilitate large-scale cross-linguistic phonetic research in the field, we release the VoxCommunis Corpus, which contains acoustic models, pronunciation lexicons, and word- and phone-level alignments, derived from the publicly available Mozilla Common Voice Corpus. The current release includes data from 36 languages. The corpus also contains acoustic-phonetic measurements, which currently consist of formant frequencies (F1–F4) from all vowel quartiles. Major advantages of this corpus for phonetic analysis include the number of available languages, the large amount of speech per language, as well as the fact that most language datasets have dozens to hundreds of contributing speakers. We demonstrate the utility of this corpus for downstream phonetic research in a descriptive analysis of language-specific vowel systems, as well as an analysis of “uniformity” in vowel realization across languages. The VoxCommunis Corpus is free to download and use under a CC0 license.

Data: https://huggingface.co/datasets/pacscilab/VoxCommunis
- used to be hosted on OSF. Moved to HF for space reasons.
Code: https://github.com/pacscilab/voxcommunis

Overview from OSF

Due to space limitations, the current version of the VoxCommunis Corpus is now maintained on HuggingFace at: https://huggingface.co/datasets/pacscilab/VoxCommunis/

The VoxCommunis Corpus contains acoustic models, lexicons, and force-aligned TextGrids with phone- and word-level segmentations derived from the Mozilla Common Voice Corpus. The Mozilla Common Voice Corpus contains audio data with transcriptions from over 70 languages. The Mozilla Common Voice Corpus and derivative VoxCommunis Corpus here are free to download and use under a CC0 license. As of writing, most files are based on Common Voice Version 7.0 unless otherwise indicated by the suffix “_cv10” which would indicate Common Voice Version 10.0.

The lexicons are developed using Epitran and the XPF Corpus which are both rule-based G2P systems. Some manual correction has been applied, and we hope to continue improving these. Any updates from the community are welcome.

The acoustic models have been trained using the Montreal Forced Aligner (version 2.0), and the force-aligned TextGrids are obtained directly from those alignments. These acoustic models can be downloaded and re-used with the Montreal Forced Aligner for new data.

The spkr_files contain the mapping from the original client_id to the simplified spkr_id in the formants data. The speaker IDs in the formant data are based on the client_id order in the validated set of Common Voice Version 7.0 and are generated by running remap_spkrs.py on validated.tsv (included in the Common Voice language-specific download).

For use of this derivative data, please cite the original corpus (Mozilla Common Voice Corpus), as well as:

Ahn, Emily, and Chodroff, Eleanor. (2022). VoxCommunis: A corpus for cross-linguistic phonetic analysis. Proceedings of the 13th Conference on Language Resources and Evaluation Conference (LREC 2022).

Some errors have been flagged for the Armenian Corpus in that it merges Western and Eastern Armenian and misses several schwas in the transcription. Please be forewarned! If you identify other major errors, let us know, and we’ll certainly add the limitations, and hopefully update the resource if and when possible. We will likely migrate this repository over to GitHub soon to enable pull requests and facilitate updates.

Some additional code relevant to the classification of speakers as having “high” or “low” formant settings can be found here: https://github.com/emilyahn/outliers/blob/main/src/assign_formant_range.py

Source: https://osf.io/t957v/wiki/home/

https://github.com/dmort27/epitran

Graph View

Backlinks

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Datasets

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis

Overview from OSF

Graph View

Backlinks