Speech Datasets are under 👉 Datasets » Speech Datasets

Surveys & Reviews

Resources 📚

Evaluation, Leaderboards and Challenges

Metrics » Speech-to-Speech Translation (S2ST)

Automatic S2ST Metrics

  • ASR-BLEU: the speech output will be automatically transcribed with a Chinese ASR system trained on WenetSpeech, and then BLEU and chrF will be computed between the produced transcript and a textual human reference.
  • BLASER: the newly proposed text-free speech-to-speech translation evaluation metric, BLASER, will be computed between the translated speech and referenced speech.

Human S2ST Metrics (Human Evaluation; taken from IWSLT 2023)

  • Translation quality: bilingual annotators will be presented with the source audio and the target audio, and give scores between 1 and 5.
  • Output speech quality: in addition to translation quality (capturing meaning), the quality of the speech output will also be human-evaluated along three dimensions: naturalness (voice and pronunciation), clarity of speech (understandability), and sound quality (noise and other artifacts). These axes are more fine-grained than the traditional overall MOS score.

The detailed guidelines for speech quality are as follows:

  • Naturalness: recordings that sound human-like, with natural-sounding pauses, stress, and intonation, should be given a high score. Recordings that sound robotic, flat, or otherwise unnatural should be given a low score.
  • Clarity of speech: recordings with clear speech and no mumbling and unclear phrases should be given a high score. Recordings with a large amount of mumbling and unclear phrases should be given a low score.
  • Sound quality: recordings with clean audio and no noise and static in the background should be given a high score. Recordings with a large amount of noise and static in the background should be given a low score.

Challenges, Workshops & Conferences

Tools & Frameworks

See also resources filed under 👉 Audio, Speech and Music Tools

ASR

Text-to-Speech Tools

Speech Translation

The Textless NLP Project from Meta

Initiative fromMeta kicked off in 2021 written up in Textless NLP Generating expressive speech from raw audio contemporaneous with the release of the papers:

  1. Generative Spoken Language Modeling from Raw Audio
  2. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
  3. Text-Free Prosody-Aware Generative Spoken Language Modeling

Implementations exist in fairseq (v1) repo under examples/textless_nlp

See any other papers taggedtextless-nlp

From Generative Spoken Language Modeling from Raw Audio:

Being able to achieve ’textless NLP’ would be beneficial for the majority of the world’s languages which do not have large textual resources or even a widely used standardized orthography (Swiss German, dialectal Arabic, Igbo, etc.), and which, despite being used by millions of users, have little chance of being served by current text-based technology. It would also be useful for ’high-resource’ languages, where the oral and written forms often mismatch in terms of lexicon and syntax, and where some linguistically relevant signals carried by prosody and intonation are basically absent from text.

Audiocraft (Meta)

Release: AudioCraft A simple one-stop shop for audio modeling
Code: https://github.com/facebookresearch/audiocraft

Groups

Groups doing significant work on speech, worth monitoring.

Footnotes

  1. Tom BĂ€ckström, Okko RĂ€sĂ€nen, Abraham Zewoudie, Pablo PĂ©rez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban GĂłmez Mellado, Mariem Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, Paavo Alku, and Mohammad Hassan Vali “Introduction to Speech Processing”, 2nd Edition, 2022. URL: https://speechprocessingbook.aalto.fi, DOI: 10.5281/zenodo.6821775. ↩