todo make this a timeline, ideally by migrating to use of tags and having either a dataview with the publication date as a column or - better - a script to generate a materialised speech papers/models/releases timeline (by parsing it inc. for clipped releases/repos, for example)
- TranSpeech Speech-to-Speech Translation With Bilateral Perturbation
- CASPER A Large Scale Spontaneous Speech Dataset
- Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
- Neural Networks Fail to Learn Periodic Functions and How to Fix It - “Snake activations”
- NaturalSpeech 3 Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
- Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
- BLAB Brutally Long Audio Bench
- boson-aihiggs-audio Text-audio foundation model from Boson AI
- CLaM-TTS Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
- dMel Speech Tokenization made Simple
- Exploration on HuBERT with Multiple Resolutions
- FocalCodec Low-Bitrate Speech Coding via Focal Modulation Networks from Mirco Ravanelli
- HiFi-Codec Group-residual Vector quantization for High Fidelity Audio Codec - HiFi-Codec
- Kimi-Audio Technical Report
- Llasa Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
- Looking to Listen at the Cocktail Party A Speaker-Independent Audio-Visual Model for Speech Separation
- Meta Audiobox Aesthetics Unified Automatic Quality Assessment for Speech, Music, and Sound
- Multi-resolution HuBERT Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
- NUTSHELL A Dataset for Abstract Generation from Scientific Talks
- Parallel WaveGAN A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
- SemantiCodec An Ultra Low Bitrate Semantic Audio Codec for General Sound
- SHuBERT Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
- SpeechCLIP Integrating Speech with Pre-Trained Vision and Language Model
- SpeechQE Estimating the Quality of Direct Speech Translation
- Spoken Language Modeling with Duration-Penalized Self-Supervised Units
- tinyCLAP Distilling Constrastive Language-Audio Pretrained Models
- Translation in the Hands of ManyCentering Lay Users in Machine Translation Interactions
- TS3-Codec Transformer-Based Simple Streaming Single Codec
- Voice Conversion With Just Nearest Neighbors
- Voxtral Mistral AI
- WhisperX Time-Accurate Speech Transcription of Long-Form Audio
- ZSVC Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
- Jasper An End-to-End Convolutional Neural Acoustic Model
- End-to-end ASR from Supervised to Semi-Supervised Learning with Modern Architectures
- Scaling Up Online Speech Recognition Using ConvNets
- EmotiVoice 😊 a Multi-Voice and Prompt-Controlled TTS Engine - netease-youdao
- 2025-04-03: Scaling Analysis of Interleaved Speech-Text Language Models
- 2025-03-19: Vision-Speech Models Teaching Speech Models to Converse about Images - MoshiVis fromkyutai
- 2025-03-18: SVLA A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
- 2025-03-18: MoonCast High-Quality Zero-Shot Podcast Generation
- 2025-03-13: AudioX Diffusion Transformer for Anything-to-Audio Generation
- 2025-02-27: Crossing the uncanny valley of conversational voice - Sesame
- 2025-02-17: Step-Audio Unified Understanding and Generation in Intelligent Speech Interaction
- 2025-02-14: OWLS Scaling Laws for Multilingual Speech Recognition and Translation Models
- 2025-02-05: High-Fidelity Simultaneous Speech-To-Speech Translation - Hibiki fromkyutai
- 2025-01-20: LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
- 2025-01-10: xLSTM-SENet xLSTM for Single-Channel Speech Enhancement
- 2025-01-10: MinMo A Multimodal Large Language Model for Seamless Voice Interaction
- 2025-01-04: Prepending or Cross-Attention for Speech-to-Text An Empirical Comparison
- 2024-12-24: How Real is Your Real-Time Simultaneous Speech-to-Text Translation System
- 2024-12-16: Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
- 2024-12-13: MERaLiON-AudioLLM Bridging Audio and Language with Large Language Models
- 2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
- 2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
- 2024-12-04: Explainability for Speech Models On the Challenges of Acoustic Feature Selection
- 2024-12-02: AlignFormer Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
- 2024-11-29: Scaling Transformers for Low-Bitrate High-Quality Speech Coding
- 2024-11-27: SALMONN-omni A Codec-free LLM for Full-duplex Speech Understanding and Generation
- 2024-11-13: A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
- 2024-11-09: Selective State Space Model for Monaural Speech Enhancement
- 2024-11-04: Align-SLM Textless Spoken Language Models with Reinforcement Learning from AI Feedback
- 2024-11-03: SPES Spectrogram Perturbation for Explainable Speech-to-Text Generation
- 2024-11-03: Introducing hertz-dev - Standard Intelligence - waiting for a paper on this one; added PLACEHOLDER hertz-dev - Standard Intelligence for now
- 2024-11-01: Freeze-Omni A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
- 2024-10-31: DC-Spin A Speaker-invariant Speech Tokenizer for Spoken Language Models
- 2024-10-23: OmniFlatten An End-to-end GPT Model for Seamless Voice Conversation
- 2024-10-22: WavTokenizer an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
- 2024-10-22: VoiceBench Benchmarking LLM-Based Voice Assistants
- 2024-10-22: Continuous Speech Tokenizer in Text To Speech
- 2024-10-20: MaskGCT Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
- 2024-10-20: Ichigo Mixed-Modal Early-Fusion Realtime Voice Assistant
- 2024-10-19: DM-Codec Distilling Multimodal Representations for Speech Tokenization
- 2024-10-16: What Do Speech Foundation Models Not Learn About Speech
- 2024-10-06: HALL-E Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
- 2024-10-05: SyllableLM Learning Coarse Semantic Units for Speech Language Models
- 2024-09-26: EMOVA Empowering Language Models to See, Hear and Speak with Vivid Emotions
- 2024-09-22: What Are They Doing Joint Audio-Speech Co-Reasoning
- 2024-09-18: Moshi a speech-text foundation model for real-time dialogue - Moshi fromkyutai
- 2024-09-11: A Suite for Acoustic Language Model Evaluation
- 2024-09-10: LLaMA-Omni Seamless Speech Interaction with Large Language Models
- 2024-09-09: Leveraging Content and Acoustic Representations for Speech Emotion Recognition
- 2024-09-01: Comparing Discrete and Continuous Space LLMs for Speech Recognition
- 2024-08-29: Mini-Omni Language Models Can Hear, Talk While Thinking in Streaming
- 2024-08-14: CMU’s IWSLT 2024 Simultaneous Speech Translation System
- 2024-08-13: Style-Talker Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
- 2024-08-05: Language Model Can Listen While Speaking
- 2024-07-04: FunAudioLLM Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
- 2024-07-04: DASS Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners - DASS
- 2024-06-28: BESTOW Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
- 2024-06-20: DASB - Discrete Audio and Speech Benchmark
- 2024-06-17: GAMA A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities - GAMA
- 2024-06-15: How Should We Extract Discrete Audio Tokens from Self-Supervised Models
- 2024-06-12: VALL-E R Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
- 2024-06-11: The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
- 2024-06-09: MS-HuBERT Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
- 2024-06-08: VALL-E 2 Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
- 2024-06-08: Exploring the Benefits of Tokenization of Discrete Acoustic Units
- 2024-05-12: Unified Video-Language Pre-training with Synchronized Audio - VLSA
- 2024-04-27: T-CLAP Temporal-Enhanced Contrastive Language-Audio Pretraining - T-CLAP
- 2024-03-31: WavLLM Towards Robust and Adaptive Speech Large Language Model
- 2024-03-19: Listenable Maps for Audio Classifiers
- 2024-02-29: Compact Speech Translation Models via Discrete Speech Units Pretraining
- 2024-02-20: OWSM-CTC An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
- 2024-02-19: AnyGPT Unified Multimodal LLM with Discrete Sequence Modeling - AnyGPT
- 2024-02-16: Pushing the Limits of Zero-shot End-to-End Speech Translation
- 2024-02-12: Careless Whisper Speech-to-Text Hallucination Harms
- 2024-02-12: BASE TTS Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
- 2024-02-08: SpiRit-LM Interleaved Spoken and Written Language Model
- 2024-02-08: Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
- 2024-01-30: OWSM v3 1 Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
- 2024-01-24: SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
- 2023-12-23: Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
- 2023-12-21: EmphAssess a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
- 2023-11-14: Qwen-Audio Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
- 2023-11-12: AudioChatLlama Towards General-Purpose Speech Abilities for LLMs
- 2023-10-24: P-Flow A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
- 2023-10-23: SALMONN Towards Generic Hearing Abilities for Large Language Models
- 2023-10-23: Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
- 2023-10-13: SALM Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
- 2023-09-27: HyPoradise An Open Baseline for Generative Speech Recognition with Large Language Models
- 2023-09-25: Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
- 2023-09-14: Voxtlm unified decoder-only models for consolidating speech recognitionsynthesis and speechtext continuation tasks
- 2023-09-13: Can Whisper Perform Speech-Based In-Context Learning
- 2023-09-07: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units A Comparative Study
- 2023-08-31: SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
- 2023-08-23: SpeechX Neural Codec Language Model as a Versatile Speech Transformer - SpeechX
- 2023-08-22: SeemlessM4T - Introducing a foundational multimodal model for speech translation - SeamlessM4T
- 2023-08-22: SeamlessM4T Massively Multilingual & Multimodal Machine Translation
- 2023-06-23: Voicebox Text-Guided Multilingual Universal Speech Generation at Scale
- 2023-06-13: StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
- 2023-05-29: VAST A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
- 2023-05-25: VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
- 2023-05-24: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM - Spectron
- 2023-05-22: Textually Pretrained Speech Language Models - TWIST
- 2023-05-18: SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
- 2023-05-16: SoundStorm Efficient Parallel Audio Generation - SoundStorm
- 2023-05-12: Better speech synthesis through scaling - TorToise TTS
- 2023-04-25: AudioGPT Understanding and Generating Speech, Music, Sound, and Talking Head - AudioGPT
- 2023-03-14: I3D Transformer architectures with input-dependent dynamic depth for speech recognition
- 2023-03-07: Speak Foreign Languages with Your Own Voice Cross-Lingual Neural Codec Language Modeling
- 2023-01-05: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - Vall-E
- 2022-12-06: Robust Speech Recognition via Large-Scale Weak Supervision - Whisper
- 2022-11-12: Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
- 2022-11-11: Speech-to-Speech Translation For A Real-world Unwritten Language - Meta paper on Taiwanese Hokkien
- 2022-11-08: Comparative layer-wise analysis of self-supervised speech models
- 2022-10-24: High Fidelity Neural Audio Compression
- 2022-10-12: SQuId Measuring Speech Naturalness in Many Languages
- 2022-09-30: SpeechLM Enhanced Speech Pre-Training with Unpaired Textual Data
- 2022-09-30: AudioGen Textually Guided Audio Generation
- 2022-09-07: AudioLM a Language Modeling Approach to Audio Generation - AudioLM
- 2022-06-05: Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
- 2022-04-05: UTMOS UTokyo-SaruLab System for VoiceMOS Challenge 2022
- 2022-02-07: data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language
- 2022-02-03: mSLAM Massively multilingual joint pre-training for speech and text
- 2021-12-04: YourTTS Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
- 2021-11-17: XLS-R Self-supervised Cross-lingual Speech Representation Learning at Scale
- 2021-11-09: Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
- 2021-11-03: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
- 2021-10-26: WavLM Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
- 2021-10-14: SpeechT5 Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
- 2021-10-05: DistilHuBERT Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
- 2021-09-07: Text-Free Prosody-Aware Generative Spoken Language Modeling
- 2021-08-07: W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training - w2v-BERT
- 2021-07-13: Zero-shot Speech Translation
- 2021-07-07: SoundStream An End-to-End Neural Audio Codec - SoundStream
- 2021-06-14: HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - HuBERT
- 2021-06-11: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - VITS
- 2021-04-05: AST Audio Spectrogram Transformer
- 2021-04-01: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
- 2021-02-01: Generative Spoken Language Modeling from Raw Audio
- 2021-01-09: UniSpeech Unified Speech Representation Learning with Labeled and Unlabeled Data
- 2020-10-20: Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
- 2020-09-04: SEANet A Multi-modal Speech Enhancement Network
- 2020-06-23: Real Time Speech Enhancement in the Waveform Domain
- 2020-06-20: wav2vec 2 0 A Framework for Self-Supervised Learning of Speech Representations - wav2vec 2.0
- 2020-05-16: Conformer Convolution-augmented Transformer for Speech Recognition - Conformer
- 2020-01-25: Multi-task self-supervised learning for Robust Speech Recognition - Mirco Ravanelli and co
- 2019-11-21: Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
- 2019-10-08: MelGAN Generative Adversarial Networks for Conditional Waveform Synthesis - MelGAN
- 2019-06-06: Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View - Macron-Net
- 2019-04-18: SpecAugment A Simple Data Augmentation Method for Automatic Speech Recognition
- 2019-04-11: wav2vec Unsupervised Pre-training for Speech Recognition - wav2vec
- 2019-04-06: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - Mirco Ravanelli and co
- 2018-12-01: Learning Speaker Representations with Mutual Information from Mirco Ravanelli and Yoshua Bengio
- 2018-07-10: Representation Learning with Contrastive Predictive Coding - CPC
- 2018-04-04: Learning Filterbanks from Raw Speech for Phone Recognition
- 2018-03-23: Style Tokens Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis - Style Tokens
- 2017-12-16: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions - Tacotron 2
- 2017-09-22: Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
- 2017-03-19: Tacotron Towards End-to-End Speech Synthesis - Tacotron
- 2016-09-12: WaveNet A Generative Model for Raw Audio
- 2014-09-01: Neural Machine Translation by Jointly Learning to Align and Translate
Speech Datasets are under 👉 Datasets » Speech Datasets