- 2025-03-19: Vision-Speech Models Teaching Speech Models to Converse about Images - MoshiVis
- AudioX Diffusion Transformer for Anything-to-Audio Generation
- MoonCast High-Quality Zero-Shot Podcast Generation
- FunAudioLLM Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
- MinMo A Multimodal Large Language Model for Seamless Voice Interaction
- SEANet A Multi-modal Speech Enhancement Network
- CMUâs IWSLT 2024 Simultaneous Speech Translation System
- 2025-03-XX: Crossing the uncanny valley of conversational voice - Sesame
- 2025-02-17: Step-Audio Unified Understanding and Generation in Intelligent Speech Interaction
- 2025-02-14: OWLS Scaling Laws for Multilingual Speech Recognition and Translation Models
- 2025-02-05: High-Fidelity Simultaneous Speech-To-Speech Translation
- 2024-12-13: MERaLiON-AudioLLM Bridging Audio and Language with Large Language Models
- 2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
- 2024-12-02: AlignFormer Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
- 2024-11-29: Scaling Transformers for Low-Bitrate High-Quality Speech Coding
- 2024-11-27: SALMONN-omni A Codec-free LLM for Full-duplex Speech Understanding and Generation
- 2024-11-01: Freeze-Omni A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
- 2024-10-23: OmniFlatten An End-to-end GPT Model for Seamless Voice Conversation
- 2024-10-22: VoiceBench Benchmarking LLM-Based Voice Assistants
- 2024-10-22: Continuous Speech Tokenizer in Text To Speech
- 2024-10-20: Ichigo Mixed-Modal Early-Fusion Realtime Voice Assistant
- 2024-10-06: HALL-E Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
- 2024-09-10: LLaMA-Omni Seamless Speech Interaction with Large Language Models
- 2024-08-05: Language Model Can Listen While Speaking
- 2023-03-07: Speak Foreign Languages with Your Own Voice Cross-Lingual Neural Codec Language Modeling
- 2025-01-04: Prepending or Cross-Attention for Speech-to-Text An Empirical Comparison
- 2025-01-20: LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
- 2025-01-10: xLSTM-SENet xLSTM for Single-Channel Speech Enhancement
- 2024-12-24: How Real is Your Real-Time Simultaneous Speech-to-Text Translation System
- 2024-12-16: Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
- 2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
- 2024-12-04: Explainability for Speech Models On the Challenges of Acoustic Feature Selection
- 2024-11-13: A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
- 2024-11-09: Selective State Space Model for Monaural Speech Enhancement
- 2024-11-04: Align-SLM Textless Spoken Language Models with Reinforcement Learning from AI Feedback
- 2024-11-03: SPES Spectrogram Perturbation for Explainable Speech-to-Text Generation
- 2024-11-03: Introducing hertz-dev - Standard Intelligence - waiting for a paper on this one; added PLACEHOLDER hertz-dev - Standard Intelligence for now
- 2024-10-31: DC-Spin A Speaker-invariant Speech Tokenizer for Spoken Language Models
- 2024-10-22: WavTokenizer an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
- 2024-10-20: MaskGCT Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
- 2024-10-19: DM-Codec Distilling Multimodal Representations for Speech Tokenization
- 2024-10-16: What Do Speech Foundation Models Not Learn About Speech
- 2024-10-05: SyllableLM Learning Coarse Semantic Units for Speech Language Models
- 2024-09-26: EMOVA Empowering Language Models to See, Hear and Speak with Vivid Emotions
- 2024-09-22: What Are They Doing Joint Audio-Speech Co-Reasoning
- 2024-09-18: Moshi a speech-text foundation model for real-time dialogue - Moshi
- 2024-09-11: A Suite for Acoustic Language Model Evaluation
- 2024-09-09: Leveraging Content and Acoustic Representations for Speech Emotion Recognition
- 2024-09-01: Comparing Discrete and Continuous Space LLMs for Speech Recognition
- 2024-08-29: Mini-Omni Language Models Can Hear, Talk While Thinking in Streaming
- 2024-08-13: Style-Talker Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
- 2024-07-04: DASS Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners - DASS
- 2024-06-28: BESTOW Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
- 2024-06-20: DASB - Discrete Audio and Speech Benchmark
- 2024-06-17: GAMA A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities - GAMA
- 2024-06-15: How Should We Extract Discrete Audio Tokens from Self-Supervised Models
- 2024-06-12: VALL-E R Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
- 2024-06-11: The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
- 2024-06-09: MS-HuBERT Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
- 2024-06-08: VALL-E 2 Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
- 2024-06-08: Exploring the Benefits of Tokenization of Discrete Acoustic Units
- 2024-05-12: Unified Video-Language Pre-training with Synchronized Audio - VLSA
- 2024-04-27: T-CLAP Temporal-Enhanced Contrastive Language-Audio Pretraining - T-CLAP
- 2024-03-31: WavLLM Towards Robust and Adaptive Speech Large Language Model
- 2024-03-19: Listenable Maps for Audio Classifiers
- 2024-02-29: Compact Speech Translation Models via Discrete Speech Units Pretraining
- 2024-02-20: OWSM-CTC An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
- 2024-02-19: AnyGPT Unified Multimodal LLM with Discrete Sequence Modeling - AnyGPT
- 2024-02-16: Pushing the Limits of Zero-shot End-to-End Speech Translation
- 2024-02-12: Careless Whisper Speech-to-Text Hallucination Harms
- 2024-02-12: BASE TTS Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
- 2024-02-08: SpiRit-LM Interleaved Spoken and Written Language Model
- 2024-02-08: Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
- 2024-01-30: OWSM v3 1 Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
- 2024-01-24: SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
- 2023-12-23: Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
- 2023-12-21: EmphAssess a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
- 2023-11-14: Qwen-Audio Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
- 2023-11-12: AudioChatLlama Towards General-Purpose Speech Abilities for LLMs
- 2023-10-24: P-Flow A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
- 2023-10-23: SALMONN Towards Generic Hearing Abilities for Large Language Models
- 2023-10-23: Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
- 2023-10-13: SALM Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
- 2023-09-27: HyPoradise An Open Baseline for Generative Speech Recognition with Large Language Models
- 2023-09-25: Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
- 2023-09-14: Voxtlm unified decoder-only models for consolidating speech recognitionsynthesis and speechtext continuation tasks
- 2023-09-13: Can Whisper Perform Speech-Based In-Context Learning
- 2023-09-07: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units A Comparative Study
- 2023-08-31: SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
- 2023-08-23: SpeechX Neural Codec Language Model as a Versatile Speech Transformer - SpeechX
- 2023-08-22: SeemlessM4T - Introducing a foundational multimodal model for speech translation - SeamlessM4T
- 2023-08-22: SeamlessM4T Massively Multilingual & Multimodal Machine Translation
- 2023-06-23: Voicebox Text-Guided Multilingual Universal Speech Generation at Scale
- 2023-06-13: StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
- 2023-05-29: VAST A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
- 2023-05-25: VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
- 2023-05-24: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM - Spectron
- 2023-05-22: Textually Pretrained Speech Language Models - TWIST
- 2023-05-18: SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
- 2023-05-16: SoundStorm Efficient Parallel Audio Generation
- 2023-04-25: AudioGPT Understanding and Generating Speech, Music, Sound, and Talking Head
- 2023-03-14: I3D Transformer architectures with input-dependent dynamic depth for speech recognition
- 2023-01-05: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - Vall-E
- 2022-12-06: Robust Speech Recognition via Large-Scale Weak Supervision - Whisper
- 2022-11-12: Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
- 2022-11-11: Speech-to-Speech Translation For A Real-world Unwritten Language - Meta paper on Taiwanese Hokkien
- 2022-11-08: Comparative layer-wise analysis of self-supervised speech models
- 2022-10-24: High Fidelity Neural Audio Compression
- 2022-10-12: SQuId Measuring Speech Naturalness in Many Languages
- 2022-09-30: SpeechLM Enhanced Speech Pre-Training with Unpaired Textual Data
- 2022-09-30: AudioGen Textually Guided Audio Generation
- 2022-09-07: AudioLM a Language Modeling Approach to Audio Generation - AudioLM
- 2022-06-05: Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
- 2022-04-05: UTMOS UTokyo-SaruLab System for VoiceMOS Challenge 2022
- 2022-02-07: data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language
- 2022-02-03: mSLAM Massively multilingual joint pre-training for speech and text
- 2021-12-04: YourTTS Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
- 2021-11-17: XLS-R Self-supervised Cross-lingual Speech Representation Learning at Scale
- 2021-11-09: Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
- 2021-11-03: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
- 2021-10-26: WavLM Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
- 2021-10-14: SpeechT5 Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
- 2021-10-05: DistilHuBERT Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
- 2021-09-07: Text-Free Prosody-Aware Generative Spoken Language Modeling
- 2021-08-07: W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training - w2v-BERT
- 2021-07-13: Zero-shot Speech Translation
- 2021-07-07: SoundStream An End-to-End Neural Audio Codec - SoundStream
- 2021-06-14: HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - HuBERT
- 2021-06-11: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - VITS
- 2021-04-05: AST Audio Spectrogram Transformer
- 2021-04-01: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
- 2021-02-01: Generative Spoken Language Modeling from Raw Audio
- 2021-01-09: UniSpeech Unified Speech Representation Learning with Labeled and Unlabeled Data
- 2020-10-20: Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
- 2020-06-23: Real Time Speech Enhancement in the Waveform Domain
- 2020-06-20: wav2vec 2 0 A Framework for Self-Supervised Learning of Speech Representations - wav2vec 2.0
- 2020-05-16: Conformer Convolution-augmented Transformer for Speech Recognition - Conformer
- 2020-01-25: Multi-task self-supervised learning for Robust Speech Recognition - Mirco Ravanelli and co
- 2019-11-21: Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
- 2019-06-06: Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View - Macron-Net
- 2019-04-18: SpecAugment A Simple Data Augmentation Method for Automatic Speech Recognition
- 2019-04-11: wav2vec Unsupervised Pre-training for Speech Recognition - wav2vec
- 2019-04-06: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - Mirco Ravanelli and co
- 2018-12-01: Learning Speaker Representations with Mutual Information from Mirco Ravanelli and Yoshua Bengio
- 2018-07-10: Representation Learning with Contrastive Predictive Coding - CPC
- 2018-04-04: Learning Filterbanks from Raw Speech for Phone Recognition
- 2018-03-23: Style Tokens Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis - Style Tokens
- 2017-12-16: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions - Tacotron 2
- 2017-09-22: Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
- 2017-03-19: Tacotron Towards End-to-End Speech Synthesis - Tacotron
- 2016-09-12: WaveNet A Generative Model for Raw Audio
- 2014-09-01: Neural Machine Translation by Jointly Learning to Align and Translate
Speech Datasets are under đ Datasets » Speech Datasets
Surveys & Reviews
- Self-Supervised Speech Representation Learning A Review
- Audio-Language Models for Audio-Centric Tasks A survey
- WavChat A Survey of Spoken Dialogue Models
- A Survey on Speech Large Language Models
- Towards audio language modeling â an overview
- Speech Trident - Awesome Speech LM - very comprehensive repository of relevant speech/audio/SFM papers
- Recent Advances in Speech Language Models A Survey đ adults wrote this
- Speech Translation with Speech Foundation Models and Large Language Models What is There and What is Missing - from Marco Gaido and former colleagues from FBK. Nice framing of modality and length adapters, LLM backbones, speech foundation models (âSFMâs)
- A Brief Overview of Unsupervised Neural Speech Representation Learning - takes a slightly more formal but still relaxed approach to explaining unsupervised speech representation learning (the team from Copenhagen in Denmark)
- A Review of Deep Learning Techniques for Speech Processing
- A Survey on Neural Speech Synthesis
- End-to-End Speech Recognition A Survey
- Recent Advances in Direct Speech-to-text Translation -s2st
- End-to-End Speech-to-Text Translation A Survey -s2tt
Resources đ
- Introduction to Speech Processing ⚠- open access and creative commons book of speech processing, intended as pedagogical material for engineering students from a team1 at Aalto University
- Spectral Audio Signal Processing (Julius O. Smith III)
- Lawrence R. Rabiner and Ronald W. Schafer (2007) Introduction to Digital Speech Processing - also available in Drive
- Lawrence R. Rabiner and Ronald W. Schafer (1978). Digital Processing of Speech Signals - messy photocopy available here
- The Scientist and Engineerâs Guide to Digital Signal Processing - Chapter 22 Audio Processing
- Mel Frequency Cepstral Coefficient (MFCC) tutorial - Practical Cryptography - a complete and rigorous but minimal walk-through of computation of the MFCCs including explanations and formulae for the Discrete Fourier Transform (DFT) and explanation for the delta/delta deltas (as well as the obvious stuff, like computing the mel filterbank)
- the Python implementation is actually the same jameslyons/python_speech_features repo that I recorded in Audio, Speech and Music Tools
- An Intuitive Discrete Fourier Transform Tutorial - Practical Cryptography
- An Intuitive Discrete Fourier Transform Tutorial - Practical Cryptography
- Speech Processing for Machine Learning Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and Whatâs In-Between from Haytham M Fayek
- walks through pre-emphasis, framing, windowing, the Fourier Transform and Power Spectrum, filter banks, MFCCs and mean normalization
- another great resource like the MFCC tutorial from Practical Cryptography
- SpeechBrain A General-Purpose Speech Toolkit and Open-Source Conversational AI with SpeechBrain 1 0 which are the SpeechBrain papers
- SpeechBrain is one of the Audio, Speech and Music Tools
- Connectionist Temporal Classification
- especially Sequence Modeling with CTC - distill.pub write-up / explainer for the CTC loss
- A Course in Phonetics (2005; 6th Edition) by Peter Ladefoged and Keith Johnson
- Egor Shvecov: Lecture on Neural Audio Codecs - Encodec and SoundStream
- Conformer An interesting ML architecture that Iâm abandoning Knowing.NET
Evaluation, Leaderboards and Challenges
- đ€ Open ASR Leaderboard - đ The đ€ Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub.
- Aishell1
- cochlscene
- Clotho: Clotho An Audio Captioning Dataset
- Clotho-AQA: Clotho-AQA A Crowdsourced Dataset for Audio Question Answering
- VocalSound
- ML-SUPERB 2 0 Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
- SUPERB Speech processing Universal PERformance Benchmark
- Dynamic-SUPERB Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - seems to have fallen by the wayside since mid-2024
- A Suite for Acoustic Language Model Evaluation
- What Are They Doing Joint Audio-Speech Co-Reasoning
- The Zero Resource Speech Challenge 2021 Spoken language modelling
- The Zero Resource Speech Benchmark 2021 Metrics and baselines for unsupervised spoken language modeling
- Accuracy Benchmarking Speechmatics - direct link đ
- shows unambiguously that you should perform text normalization on both the references and the hypotheses
- DASB - Discrete Audio and Speech Benchmark
- MOSNet Deep Learning based Objective Assessment for Voice Conversion
- ViSQOL v3 An Open Source Production Ready Objective Speech and Audio Metric
- Mean Opinion Score
- Multiple Stimuli with Hidden Reference and Anchor (MUSHRA)
- codified in Recommendation ITU-R BS.1534-3
- Mel cepstral distortion (MCD) - measure of how different two sequences of mel cepstra; possible to use to evaluate the quality of synthesized speech
- Perceptual Evaluation of Speech Quality (PESQ)
- superceded by ITU-T - P.863.X (P.863.2 at the time of writing)
- P.863.2: Extension of ITU-T P.863 for multidimensional assessment of degradations in telephony speech signals up to fullband
- TorchMetrics: Audio Metrics
- Complex Scale-Invariant Signal-to-Noise Ratio (C-SI-SNR)
- Deep Noise Suppression Mean Opinion Score (DNSMOS)
- Non-Intrusive Speech Quality Assessment (NISQA v2.0)
- Perceptual Evaluation of Speech Quality (PESQ)
- Permutation Invariant Training (PIT)
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
- Scale-Invariant Signal-to-Noise Ratio (SI-SNR)
- Short-Time Objective Intelligibility (STOI)
- Signal to Distortion Ratio (SDR)
- Signal-to-Noise Ratio (SNR)
- Source Aggregated Signal-to-Distortion Ratio (SA-SDR)
- Speech-to-Reverberation Modulation Energy Ratio (SRMR)
- MOSNet Deep Learning based Objective Assessment for Voice Conversion
- ViSQOL v3 An Open Source Production Ready Objective Speech and Audio Metric
- Mean Opinion Score
- Multiple Stimuli with Hidden Reference and Anchor (MUSHRA)
- codified in Recommendation ITU-R BS.1534-3
- Mel cepstral distortion (MCD) - measure of how different two sequences of mel cepstra; possible to use to evaluate the quality of synthesized speech
- Perceptual Evaluation of Speech Quality (PESQ)
- superceded by ITU-T - P.863.X (P.863.2 at the time of writing)
- P.863.2: Extension of ITU-T P.863 for multidimensional assessment of degradations in telephony speech signals up to fullband
- TorchMetrics: Audio Metrics
- Complex Scale-Invariant Signal-to-Noise Ratio (C-SI-SNR)
- Deep Noise Suppression Mean Opinion Score (DNSMOS)
- Non-Intrusive Speech Quality Assessment (NISQA v2.0)
- Perceptual Evaluation of Speech Quality (PESQ)
- Permutation Invariant Training (PIT)
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
- Scale-Invariant Signal-to-Noise Ratio (SI-SNR)
- Short-Time Objective Intelligibility (STOI)
- Signal to Distortion Ratio (SDR)
- Signal-to-Noise Ratio (SNR)
- Source Aggregated Signal-to-Distortion Ratio (SA-SDR)
- Speech-to-Reverberation Modulation Energy Ratio (SRMR)
- Artificial Analysis TTS Leaderboard - A/B (Elo) leaderboard for TTS including the latest models (2024-12-14)
- Personal leaderboard (generated after personally doing 30+ A/B comparisons)
- HF TTS Arena Leaderboard - mostly open models; includes ElevenLabs but no other industry systems
Metrics » Speech-to-Speech Translation (S2ST)
Automatic S2ST Metrics
- ASR-BLEU: the speech output will be automatically transcribed with a Chinese ASR system trained on WenetSpeech, and then BLEU and chrF will be computed between the produced transcript and a textual human reference.
- BLASER: the newly proposed text-free speech-to-speech translation evaluation metric, BLASER, will be computed between the translated speech and referenced speech.
Human S2ST Metrics (Human Evaluation; taken from IWSLT 2023)
- Translation quality: bilingual annotators will be presented with the source audio and the target audio, and give scores between 1 and 5.
- Output speech quality: in addition to translation quality (capturing meaning), the quality of the speech output will also be human-evaluated along three dimensions: naturalness (voice and pronunciation), clarity of speech (understandability), and sound quality (noise and other artifacts). These axes are more fine-grained than the traditional overall MOS score.
The detailed guidelines for speech quality are as follows:
- Naturalness: recordings that sound human-like, with natural-sounding pauses, stress, and intonation, should be given a high score. Recordings that sound robotic, flat, or otherwise unnatural should be given a low score.
- Clarity of speech: recordings with clear speech and no mumbling and unclear phrases should be given a high score. Recordings with a large amount of mumbling and unclear phrases should be given a low score.
- Sound quality: recordings with clean audio and no noise and static in the background should be given a high score. Recordings with a large amount of noise and static in the background should be given a low score.
Challenges, Workshops & Conferences
- URGENT Challenge
- NeurIPS SAS 2020 - the NeurIPS 2020 workshop on Self-Supervised Learning for Speech and Audio Processing
- Interspeech
- ISCA Online Archive: Proceedings for all INTERSPEECH, EUROSPEECH and ICSLP conferences
Tools & Frameworks
See also resources filed under đ Audio, Speech and Music Tools
ASR
Text-to-Speech Tools
- CoquiTTS: đžđŹ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
- Hugging Face Audio Pipelines - Text-to-Speech
- 2023-05-12: Better speech synthesis through scaling - TorToise TTS
- EmotiVoice
- eSpeak - Wikipedia
- MBROLA - Wikipedia
Speech Translation
- ESPnet-ST All-in-One Speech Translation Toolkit
- NeurST
- SLT.KIT
The Textless NLP Project from Meta
Initiative fromMeta kicked off in 2021 written up in Textless NLP Generating expressive speech from raw audio contemporaneous with the release of the papers:
- Generative Spoken Language Modeling from Raw Audio
- Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
- Text-Free Prosody-Aware Generative Spoken Language Modeling
Implementations exist in fairseq (v1) repo under examples/textless_nlp
See any other papers taggedtextless-nlp
From Generative Spoken Language Modeling from Raw Audio:
Being able to achieve âtextless NLPâ would be beneficial for the majority of the worldâs languages which do not have large textual resources or even a widely used standardized orthography (Swiss German, dialectal Arabic, Igbo, etc.), and which, despite being used by millions of users, have little chance of being served by current text-based technology. It would also be useful for âhigh-resourceâ languages, where the oral and written forms often mismatch in terms of lexicon and syntax, and where some linguistically relevant signals carried by prosody and intonation are basically absent from text.
Audiocraft (Meta)
Release: AudioCraft A simple one-stop shop for audio modeling
Code: https://github.com/facebookresearch/audiocraft
- AudioGen Textually Guided Audio Generation
- Simple and Controllable Music Generation
- From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Groups
Groups doing significant work on speech, worth monitoring.
- MIT CSAIL Spoken Language Systems Group
- publications âclippedâ snapshot on 2024-10-22 MIT CSAIL Spoken Language Systems Group - Publications
- Microsoft Speech Research
- clipped on 2024-12-21: (Microsoft) Speech Research
- Digital Phonetics research group at IMS, University of Stuttgart
- group is headed by Prof. Dr. Thang Vu since June 2018
Footnotes
-
Tom BĂ€ckström, Okko RĂ€sĂ€nen, Abraham Zewoudie, Pablo PĂ©rez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban GĂłmez Mellado, Mariem Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, Paavo Alku, and Mohammad Hassan Vali âIntroduction to Speech Processingâ, 2nd Edition, 2022. URL: https://speechprocessingbook.aalto.fi, DOI: 10.5281/zenodo.6821775. â©