🪴 Anil's Garden

❯

❯

Speech and Audio

Speech and Audio

18 Jul 202518 min read

speech
audio
topic
kyutai
s2st
s2tt
Meta
textless-nlp

SpeechQE Estimating the Quality of Direct Speech Translation
Spoken Language Modeling with Duration-Penalized Self-Supervised Units
Kimi-Audio Technical Report
WhisperX Time-Accurate Speech Transcription of Long-Form Audio
Parallel WaveGAN A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
CLaM-TTS Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
dMel Speech Tokenization made Simple
SHuBERT Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
SpeechCLIP Integrating Speech with Pre-Trained Vision and Language Model
HiFi-Codec Group-residual Vector quantization for High Fidelity Audio Codec - HiFi-Codec
Llasa Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
ZSVC Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
FocalCodec Low-Bitrate Speech Coding via Focal Modulation Networks from Mirco Ravanelli
Meta Audiobox Aesthetics Unified Automatic Quality Assessment for Speech, Music, and Sound
NUTSHELL A Dataset for Abstract Generation from Scientific Talks
Translation in the Hands of ManyCentering Lay Users in Machine Translation Interactions
Voice Conversion With Just Nearest Neighbors

Voxtral Mistral AI
Multi-resolution HuBERT Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
Exploration on HuBERT with Multiple Resolutions
SemantiCodec An Ultra Low Bitrate Semantic Audio Codec for General Sound
2025-04-03: Scaling Analysis of Interleaved Speech-Text Language Models
2025-03-19: Vision-Speech Models Teaching Speech Models to Converse about Images - MoshiVis fromkyutai
2025-03-18: SVLA A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
2025-03-18: MoonCast High-Quality Zero-Shot Podcast Generation
2025-03-13: AudioX Diffusion Transformer for Anything-to-Audio Generation
2025-02-27: Crossing the uncanny valley of conversational voice - Sesame
2025-02-17: Step-Audio Unified Understanding and Generation in Intelligent Speech Interaction
2025-02-14: OWLS Scaling Laws for Multilingual Speech Recognition and Translation Models
2025-02-05: High-Fidelity Simultaneous Speech-To-Speech Translation - Hibiki fromkyutai
2025-01-20: LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
2025-01-10: xLSTM-SENet xLSTM for Single-Channel Speech Enhancement
2025-01-10: MinMo A Multimodal Large Language Model for Seamless Voice Interaction
2025-01-04: Prepending or Cross-Attention for Speech-to-Text An Empirical Comparison
2024-12-24: How Real is Your Real-Time Simultaneous Speech-to-Text Translation System
2024-12-16: Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
2024-12-13: MERaLiON-AudioLLM Bridging Audio and Language with Large Language Models
2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
2024-12-04: Explainability for Speech Models On the Challenges of Acoustic Feature Selection
2024-12-02: AlignFormer Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
2024-11-29: Scaling Transformers for Low-Bitrate High-Quality Speech Coding
2024-11-27: SALMONN-omni A Codec-free LLM for Full-duplex Speech Understanding and Generation
2024-11-13: A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
2024-11-09: Selective State Space Model for Monaural Speech Enhancement
2024-11-04: Align-SLM Textless Spoken Language Models with Reinforcement Learning from AI Feedback
2024-11-03: SPES Spectrogram Perturbation for Explainable Speech-to-Text Generation
2024-11-03: Introducing hertz-dev - Standard Intelligence - waiting for a paper on this one; added PLACEHOLDER hertz-dev - Standard Intelligence for now
2024-11-01: Freeze-Omni A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
2024-10-31: DC-Spin A Speaker-invariant Speech Tokenizer for Spoken Language Models
2024-10-23: OmniFlatten An End-to-end GPT Model for Seamless Voice Conversation
2024-10-22: WavTokenizer an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
2024-10-22: VoiceBench Benchmarking LLM-Based Voice Assistants
2024-10-22: Continuous Speech Tokenizer in Text To Speech
2024-10-20: MaskGCT Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
2024-10-20: Ichigo Mixed-Modal Early-Fusion Realtime Voice Assistant
2024-10-19: DM-Codec Distilling Multimodal Representations for Speech Tokenization
2024-10-16: What Do Speech Foundation Models Not Learn About Speech
2024-10-06: HALL-E Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
2024-10-05: SyllableLM Learning Coarse Semantic Units for Speech Language Models
2024-09-26: EMOVA Empowering Language Models to See, Hear and Speak with Vivid Emotions
2024-09-22: What Are They Doing Joint Audio-Speech Co-Reasoning
2024-09-18: Moshi a speech-text foundation model for real-time dialogue - Moshi fromkyutai
2024-09-11: A Suite for Acoustic Language Model Evaluation
2024-09-10: LLaMA-Omni Seamless Speech Interaction with Large Language Models
2024-09-09: Leveraging Content and Acoustic Representations for Speech Emotion Recognition
2024-09-01: Comparing Discrete and Continuous Space LLMs for Speech Recognition
2024-08-29: Mini-Omni Language Models Can Hear, Talk While Thinking in Streaming
2024-08-14: CMU’s IWSLT 2024 Simultaneous Speech Translation System
2024-08-13: Style-Talker Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
2024-08-05: Language Model Can Listen While Speaking
2024-07-04: FunAudioLLM Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
2024-07-04: DASS Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners - DASS
2024-06-28: BESTOW Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
2024-06-20: DASB - Discrete Audio and Speech Benchmark
2024-06-17: GAMA A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities - GAMA
2024-06-15: How Should We Extract Discrete Audio Tokens from Self-Supervised Models
2024-06-12: VALL-E R Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
2024-06-11: The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
2024-06-09: MS-HuBERT Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
2024-06-08: VALL-E 2 Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
2024-06-08: Exploring the Benefits of Tokenization of Discrete Acoustic Units
2024-05-12: Unified Video-Language Pre-training with Synchronized Audio - VLSA
2024-04-27: T-CLAP Temporal-Enhanced Contrastive Language-Audio Pretraining - T-CLAP
2024-03-31: WavLLM Towards Robust and Adaptive Speech Large Language Model
2024-03-19: Listenable Maps for Audio Classifiers
2024-02-29: Compact Speech Translation Models via Discrete Speech Units Pretraining
2024-02-20: OWSM-CTC An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
2024-02-19: AnyGPT Unified Multimodal LLM with Discrete Sequence Modeling - AnyGPT
2024-02-16: Pushing the Limits of Zero-shot End-to-End Speech Translation
2024-02-12: Careless Whisper Speech-to-Text Hallucination Harms
2024-02-12: BASE TTS Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
2024-02-08: SpiRit-LM Interleaved Spoken and Written Language Model
2024-02-08: Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
2024-01-30: OWSM v3 1 Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
2024-01-24: SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
2023-12-23: Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
2023-12-21: EmphAssess a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
2023-11-14: Qwen-Audio Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
2023-11-12: AudioChatLlama Towards General-Purpose Speech Abilities for LLMs
2023-10-24: P-Flow A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
2023-10-23: SALMONN Towards Generic Hearing Abilities for Large Language Models
2023-10-23: Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
2023-10-13: SALM Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
2023-09-27: HyPoradise An Open Baseline for Generative Speech Recognition with Large Language Models
2023-09-25: Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
2023-09-14: Voxtlm unified decoder-only models for consolidating speech recognitionsynthesis and speechtext continuation tasks
2023-09-13: Can Whisper Perform Speech-Based In-Context Learning
2023-09-07: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units A Comparative Study
2023-08-31: SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
2023-08-23: SpeechX Neural Codec Language Model as a Versatile Speech Transformer - SpeechX
2023-08-22: SeemlessM4T - Introducing a foundational multimodal model for speech translation - SeamlessM4T
2023-08-22: SeamlessM4T Massively Multilingual & Multimodal Machine Translation
2023-06-23: Voicebox Text-Guided Multilingual Universal Speech Generation at Scale
2023-06-13: StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
2023-05-29: VAST A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
2023-05-25: VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
2023-05-24: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM - Spectron
2023-05-22: Textually Pretrained Speech Language Models - TWIST
2023-05-18: SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
2023-05-16: SoundStorm Efficient Parallel Audio Generation
2023-04-25: AudioGPT Understanding and Generating Speech, Music, Sound, and Talking Head
2023-03-14: I3D Transformer architectures with input-dependent dynamic depth for speech recognition
2023-03-07: Speak Foreign Languages with Your Own Voice Cross-Lingual Neural Codec Language Modeling
2023-01-05: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - Vall-E
2022-12-06: Robust Speech Recognition via Large-Scale Weak Supervision - Whisper
2022-11-12: Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
2022-11-11: Speech-to-Speech Translation For A Real-world Unwritten Language - Meta paper on Taiwanese Hokkien
2022-11-08: Comparative layer-wise analysis of self-supervised speech models
2022-10-24: High Fidelity Neural Audio Compression
2022-10-12: SQuId Measuring Speech Naturalness in Many Languages
2022-09-30: SpeechLM Enhanced Speech Pre-Training with Unpaired Textual Data
2022-09-30: AudioGen Textually Guided Audio Generation
2022-09-07: AudioLM a Language Modeling Approach to Audio Generation - AudioLM
2022-06-05: Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
2022-04-05: UTMOS UTokyo-SaruLab System for VoiceMOS Challenge 2022
2022-02-07: data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language
2022-02-03: mSLAM Massively multilingual joint pre-training for speech and text
2021-12-04: YourTTS Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
2021-11-17: XLS-R Self-supervised Cross-lingual Speech Representation Learning at Scale
2021-11-09: Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
2021-11-03: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
2021-10-26: WavLM Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
2021-10-14: SpeechT5 Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
2021-10-05: DistilHuBERT Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
2021-09-07: Text-Free Prosody-Aware Generative Spoken Language Modeling
2021-08-07: W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training - w2v-BERT
2021-07-13: Zero-shot Speech Translation
2021-07-07: SoundStream An End-to-End Neural Audio Codec - SoundStream
2021-06-14: HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - HuBERT
2021-06-11: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - VITS
2021-04-05: AST Audio Spectrogram Transformer
2021-04-01: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
2021-02-01: Generative Spoken Language Modeling from Raw Audio
2021-01-09: UniSpeech Unified Speech Representation Learning with Labeled and Unlabeled Data
2020-10-20: Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
2020-09-04: SEANet A Multi-modal Speech Enhancement Network
2020-06-23: Real Time Speech Enhancement in the Waveform Domain
2020-06-20: wav2vec 2 0 A Framework for Self-Supervised Learning of Speech Representations - wav2vec 2.0
2020-05-16: Conformer Convolution-augmented Transformer for Speech Recognition - Conformer
2020-01-25: Multi-task self-supervised learning for Robust Speech Recognition - Mirco Ravanelli and co
2019-11-21: Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
2019-06-06: Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View - Macron-Net
2019-04-18: SpecAugment A Simple Data Augmentation Method for Automatic Speech Recognition
2019-04-11: wav2vec Unsupervised Pre-training for Speech Recognition - wav2vec
2019-04-06: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - Mirco Ravanelli and co
2018-12-01: Learning Speaker Representations with Mutual Information from Mirco Ravanelli and Yoshua Bengio
2018-07-10: Representation Learning with Contrastive Predictive Coding - CPC
2018-04-04: Learning Filterbanks from Raw Speech for Phone Recognition
2018-03-23: Style Tokens Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis - Style Tokens
2017-12-16: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions - Tacotron 2
2017-09-22: Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
2017-03-19: Tacotron Towards End-to-End Speech Synthesis - Tacotron
2016-09-12: WaveNet A Generative Model for Raw Audio
2014-09-01: Neural Machine Translation by Jointly Learning to Align and Translate

Speech Datasets are under 👉 Datasets » Speech Datasets

Surveys & Reviews

Self-Supervised Speech Representation Learning A Review
Audio-Language Models for Audio-Centric Tasks A survey
WavChat A Survey of Spoken Dialogue Models
A Survey on Speech Large Language Models
Towards audio language modeling — an overview
Speech Trident - Awesome Speech LM - very comprehensive repository of relevant speech/audio/SFM papers
Recent Advances in Speech Language Models A Survey 👈 adults wrote this
Speech Translation with Speech Foundation Models and Large Language Models What is There and What is Missing - from Marco Gaido and former colleagues from FBK. Nice framing of modality and length adapters, LLM backbones, speech foundation models (“SFM”s)
A Brief Overview of Unsupervised Neural Speech Representation Learning - takes a slightly more formal but still relaxed approach to explaining unsupervised speech representation learning (the team from Copenhagen in Denmark)
A Review of Deep Learning Techniques for Speech Processing
A Survey on Neural Speech Synthesis
End-to-End Speech Recognition A Survey
Recent Advances in Direct Speech-to-text Translation -s2st
End-to-End Speech-to-Text Translation A Survey -s2tt

Resources 📚

Introduction to Speech Processing ✨ - open access and creative commons book of speech processing, intended as pedagogical material for engineering students from a team¹ at Aalto University
Spectral Audio Signal Processing (Julius O. Smith III)
Lawrence R. Rabiner and Ronald W. Schafer (2007) Introduction to Digital Speech Processing - also available in Drive
Lawrence R. Rabiner and Ronald W. Schafer (1978). Digital Processing of Speech Signals - messy photocopy available here
The Scientist and Engineer’s Guide to Digital Signal Processing - Chapter 22 Audio Processing
Mel Frequency Cepstral Coefficient (MFCC) tutorial - Practical Cryptography - a complete and rigorous but minimal walk-through of computation of the MFCCs including explanations and formulae for the Discrete Fourier Transform (DFT) and explanation for the delta/delta deltas (as well as the obvious stuff, like computing the mel filterbank)
- the Python implementation is actually the same jameslyons/python_speech_features repo that I recorded in Audio, Speech and Music Tools
An Intuitive Discrete Fourier Transform Tutorial - Practical Cryptography
An Intuitive Discrete Fourier Transform Tutorial - Practical Cryptography
Speech Processing for Machine Learning Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between from Haytham M Fayek
- walks through pre-emphasis, framing, windowing, the Fourier Transform and Power Spectrum, filter banks, MFCCs and mean normalization
- another great resource like the MFCC tutorial from Practical Cryptography
SpeechBrain A General-Purpose Speech Toolkit and Open-Source Conversational AI with SpeechBrain 1 0 which are the SpeechBrain papers
- SpeechBrain is one of the Audio, Speech and Music Tools
Connectionist Temporal Classification
- especially Sequence Modeling with CTC - distill.pub write-up / explainer for the CTC loss
A Course in Phonetics (2005; 6th Edition) by Peter Ladefoged and Keith Johnson
Egor Shvecov: Lecture on Neural Audio Codecs - Encodec and SoundStream
Conformer An interesting ML architecture that I’m abandoning Knowing.NET

Evaluation, Leaderboards and Challenges

🤗 Open ASR Leaderboard - 📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub.
Aishell1
cochlscene
Clotho: Clotho An Audio Captioning Dataset
Clotho-AQA: Clotho-AQA A Crowdsourced Dataset for Audio Question Answering
VocalSound
ML-SUPERB 2 0 Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
SUPERB Speech processing Universal PERformance Benchmark
- https://superbbenchmark.org/leaderboard
Dynamic-SUPERB Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - seems to have fallen by the wayside since mid-2024
A Suite for Acoustic Language Model Evaluation
What Are They Doing Joint Audio-Speech Co-Reasoning
The Zero Resource Speech Challenge 2021 Spoken language modelling
- The Zero Resource Speech Benchmark (series)
- Spoken Language Modeling - Task 4 - ZeroSpeech
The Zero Resource Speech Benchmark 2021 Metrics and baselines for unsupervised spoken language modeling
Accuracy Benchmarking Speechmatics - direct link 🔗
- shows unambiguously that you should perform text normalization on both the references and the hypotheses
DASB - Discrete Audio and Speech Benchmark
MOSNet Deep Learning based Objective Assessment for Voice Conversion
ViSQOL v3 An Open Source Production Ready Objective Speech and Audio Metric
Mean Opinion Score
Multiple Stimuli with Hidden Reference and Anchor (MUSHRA)
- codified in Recommendation ITU-R BS.1534-3
Mel cepstral distortion (MCD) - measure of how different two sequences of mel cepstra; possible to use to evaluate the quality of synthesized speech
Perceptual Evaluation of Speech Quality (PESQ)
- superceded by ITU-T - P.863.X (P.863.2 at the time of writing)
P.863.2: Extension of ITU-T P.863 for multidimensional assessment of degradations in telephony speech signals up to fullband
TorchMetrics: Audio Metrics
- Complex Scale-Invariant Signal-to-Noise Ratio (C-SI-SNR)
- Deep Noise Suppression Mean Opinion Score (DNSMOS)
- Non-Intrusive Speech Quality Assessment (NISQA v2.0)
- Perceptual Evaluation of Speech Quality (PESQ)
- Permutation Invariant Training (PIT)
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
- Scale-Invariant Signal-to-Noise Ratio (SI-SNR)
- Short-Time Objective Intelligibility (STOI)
  - mpariente/pystoi: Python implementation of the Short Term Objective Intelligibility measure
- Signal to Distortion Ratio (SDR)
- Signal-to-Noise Ratio (SNR)
- Source Aggregated Signal-to-Distortion Ratio (SA-SDR)
- Speech-to-Reverberation Modulation Energy Ratio (SRMR)
MOSNet Deep Learning based Objective Assessment for Voice Conversion
ViSQOL v3 An Open Source Production Ready Objective Speech and Audio Metric
Mean Opinion Score
Multiple Stimuli with Hidden Reference and Anchor (MUSHRA)
- codified in Recommendation ITU-R BS.1534-3
Mel cepstral distortion (MCD) - measure of how different two sequences of mel cepstra; possible to use to evaluate the quality of synthesized speech
Perceptual Evaluation of Speech Quality (PESQ)
- superceded by ITU-T - P.863.X (P.863.2 at the time of writing)
P.863.2: Extension of ITU-T P.863 for multidimensional assessment of degradations in telephony speech signals up to fullband
TorchMetrics: Audio Metrics
- Complex Scale-Invariant Signal-to-Noise Ratio (C-SI-SNR)
- Deep Noise Suppression Mean Opinion Score (DNSMOS)
- Non-Intrusive Speech Quality Assessment (NISQA v2.0)
- Perceptual Evaluation of Speech Quality (PESQ)
- Permutation Invariant Training (PIT)
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
- Scale-Invariant Signal-to-Noise Ratio (SI-SNR)
- Short-Time Objective Intelligibility (STOI)
  - mpariente/pystoi: Python implementation of the Short Term Objective Intelligibility measure
- Signal to Distortion Ratio (SDR)
- Signal-to-Noise Ratio (SNR)
- Source Aggregated Signal-to-Distortion Ratio (SA-SDR)
- Speech-to-Reverberation Modulation Energy Ratio (SRMR)
Artificial Analysis TTS Leaderboard - A/B (Elo) leaderboard for TTS including the latest models (2024-12-14)
- Personal leaderboard (generated after personally doing 30+ A/B comparisons)
HF TTS Arena Leaderboard - mostly open models; includes ElevenLabs but no other industry systems

Metrics » Speech-to-Speech Translation (S2ST)

Automatic S2ST Metrics

ASR-BLEU: the speech output will be automatically transcribed with a Chinese ASR system trained on WenetSpeech, and then BLEU and chrF will be computed between the produced transcript and a textual human reference.
BLASER: the newly proposed text-free speech-to-speech translation evaluation metric, BLASER, will be computed between the translated speech and referenced speech.

Human S2ST Metrics (Human Evaluation; taken from IWSLT 2023)

Translation quality: bilingual annotators will be presented with the source audio and the target audio, and give scores between 1 and 5.
Output speech quality: in addition to translation quality (capturing meaning), the quality of the speech output will also be human-evaluated along three dimensions: naturalness (voice and pronunciation), clarity of speech (understandability), and sound quality (noise and other artifacts). These axes are more fine-grained than the traditional overall MOS score.

The detailed guidelines for speech quality are as follows:

Naturalness: recordings that sound human-like, with natural-sounding pauses, stress, and intonation, should be given a high score. Recordings that sound robotic, flat, or otherwise unnatural should be given a low score.
Clarity of speech: recordings with clear speech and no mumbling and unclear phrases should be given a high score. Recordings with a large amount of mumbling and unclear phrases should be given a low score.
Sound quality: recordings with clean audio and no noise and static in the background should be given a high score. Recordings with a large amount of noise and static in the background should be given a low score.

Challenges, Workshops & Conferences

URGENT Challenge
NeurIPS SAS 2020 - the NeurIPS 2020 workshop on Self-Supervised Learning for Speech and Audio Processing
Interspeech
- ISCA Online Archive: Proceedings for all INTERSPEECH, EUROSPEECH and ICSLP conferences

Tools & Frameworks

See also resources filed under 👉 Audio, Speech and Music Tools

ASR

fairseq S2T Fast Speech-to-Text Modeling with fairseq

Text-to-Speech Tools

CoquiTTS: 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Hugging Face Audio Pipelines - Text-to-Speech
2023-05-12: Better speech synthesis through scaling - TorToise TTS
EmotiVoice
eSpeak - Wikipedia
MBROLA - Wikipedia

Speech Translation

ESPnet-ST All-in-One Speech Translation Toolkit
NeurST
SLT.KIT

The Textless NLP Project from Meta

Initiative fromMeta kicked off in 2021 written up in Textless NLP Generating expressive speech from raw audio contemporaneous with the release of the papers:

Generative Spoken Language Modeling from Raw Audio
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Text-Free Prosody-Aware Generative Spoken Language Modeling

Implementations exist in fairseq (v1) repo under examples/textless_nlp

See any other papers taggedtextless-nlp

From Generative Spoken Language Modeling from Raw Audio:

Being able to achieve ’textless NLP’ would be beneficial for the majority of the world’s languages which do not have large textual resources or even a widely used standardized orthography (Swiss German, dialectal Arabic, Igbo, etc.), and which, despite being used by millions of users, have little chance of being served by current text-based technology. It would also be useful for ’high-resource’ languages, where the oral and written forms often mismatch in terms of lexicon and syntax, and where some linguistically relevant signals carried by prosody and intonation are basically absent from text.

Audiocraft (Meta)

Release: AudioCraft A simple one-stop shop for audio modeling
Code: https://github.com/facebookresearch/audiocraft

AudioGen Textually Guided Audio Generation
Simple and Controllable Music Generation
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Groups

Groups doing significant work on speech, worth monitoring.

MIT CSAIL Spoken Language Systems Group
- publications “clipped” snapshot on 2024-10-22 MIT CSAIL Spoken Language Systems Group - Publications
Microsoft Speech Research
- clipped on 2024-12-21: (Microsoft) Speech Research
Digital Phonetics research group at IMS, University of Stuttgart
- group is headed by Prof. Dr. Thang Vu since June 2018

Footnotes

Tom Bäckström, Okko Räsänen, Abraham Zewoudie, Pablo Pérez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban Gómez Mellado, Mariem Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, Paavo Alku, and Mohammad Hassan Vali “Introduction to Speech Processing”, 2nd Edition, 2022. URL: https://speechprocessingbook.aalto.fi, DOI: 10.5281/zenodo.6821775. ↩

Graph View

Surveys & Reviews
Resources 📚
Evaluation, Leaderboards and Challenges
Metrics » Speech-to-Speech Translation (S2ST)
Challenges, Workshops & Conferences
Tools & Frameworks
ASR
Text-to-Speech Tools
Speech Translation
The Textless NLP Project from Meta
Audiocraft (Meta)
Groups

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋