todo make this a timeline, ideally by migrating to use of tags and having either a dataview with the publication date as a column or - better - a script to generate a materialised speech papers/models/releases timeline (by parsing it inc. for clipped releases/repos, for example)

  1. LFM2-Audio An End-to-End Audio Foundation Model Liquid AI
  2. TranSpeech Speech-to-Speech Translation With Bilateral Perturbation
  3. CASPER A Large Scale Spontaneous Speech Dataset
  4. Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
  5. Neural Networks Fail to Learn Periodic Functions and How to Fix It - “Snake activations”
  6. NaturalSpeech 3 Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
  7. Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
  8. BLAB Brutally Long Audio Bench
  9. boson-aihiggs-audio Text-audio foundation model from Boson AI
  10. CLaM-TTS Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
  11. dMel Speech Tokenization made Simple
  12. Exploration on HuBERT with Multiple Resolutions
  13. FocalCodec Low-Bitrate Speech Coding via Focal Modulation Networks from Mirco Ravanelli
  14. HiFi-Codec Group-residual Vector quantization for High Fidelity Audio Codec - HiFi-Codec
  15. Kimi-Audio Technical Report
  16. Llasa Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
  17. Looking to Listen at the Cocktail Party A Speaker-Independent Audio-Visual Model for Speech Separation
  18. Meta Audiobox Aesthetics Unified Automatic Quality Assessment for Speech, Music, and Sound
  19. Multi-resolution HuBERT Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
  20. NUTSHELL A Dataset for Abstract Generation from Scientific Talks
  21. Parallel WaveGAN A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
  22. SemantiCodec An Ultra Low Bitrate Semantic Audio Codec for General Sound
  23. SHuBERT Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
  24. SpeechCLIP Integrating Speech with Pre-Trained Vision and Language Model
  25. SpeechQE Estimating the Quality of Direct Speech Translation
  26. Spoken Language Modeling with Duration-Penalized Self-Supervised Units
  27. tinyCLAP Distilling Constrastive Language-Audio Pretrained Models
  28. Translation in the Hands of ManyCentering Lay Users in Machine Translation Interactions
  29. TS3-Codec Transformer-Based Simple Streaming Single Codec
  30. Voice Conversion With Just Nearest Neighbors
  31. Voxtral Mistral AI
  32. WhisperX Time-Accurate Speech Transcription of Long-Form Audio
  33. ZSVC Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
  34. Jasper An End-to-End Convolutional Neural Acoustic Model
  35. End-to-end ASR from Supervised to Semi-Supervised Learning with Modern Architectures
  36. Scaling Up Online Speech Recognition Using ConvNets
  37. EmotiVoice 😊 a Multi-Voice and Prompt-Controlled TTS Engine - netease-youdao
  38. 2025-04-03: Scaling Analysis of Interleaved Speech-Text Language Models
  39. 2025-03-19: Vision-Speech Models Teaching Speech Models to Converse about Images - MoshiVis fromkyutai
  40. 2025-03-18: SVLA A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
  41. 2025-03-18: MoonCast High-Quality Zero-Shot Podcast Generation
  42. 2025-03-13: AudioX Diffusion Transformer for Anything-to-Audio Generation
  43. 2025-02-27: Crossing the uncanny valley of conversational voice - Sesame
  44. 2025-02-17: Step-Audio Unified Understanding and Generation in Intelligent Speech Interaction
  45. 2025-02-14: OWLS Scaling Laws for Multilingual Speech Recognition and Translation Models
  46. 2025-02-05: High-Fidelity Simultaneous Speech-To-Speech Translation - Hibiki fromkyutai
  47. 2025-01-20: LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
  48. 2025-01-10: xLSTM-SENet xLSTM for Single-Channel Speech Enhancement
  49. 2025-01-10: MinMo A Multimodal Large Language Model for Seamless Voice Interaction
  50. 2025-01-04: Prepending or Cross-Attention for Speech-to-Text An Empirical Comparison
  51. 2024-12-24: How Real is Your Real-Time Simultaneous Speech-to-Text Translation System
  52. 2024-12-16: Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
  53. 2024-12-13: MERaLiON-AudioLLM Bridging Audio and Language with Large Language Models
  54. 2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
  55. 2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
  56. 2024-12-04: Explainability for Speech Models On the Challenges of Acoustic Feature Selection
  57. 2024-12-02: AlignFormer Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
  58. 2024-11-29: Scaling Transformers for Low-Bitrate High-Quality Speech Coding
  59. 2024-11-27: SALMONN-omni A Codec-free LLM for Full-duplex Speech Understanding and Generation
  60. 2024-11-13: A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
  61. 2024-11-09: Selective State Space Model for Monaural Speech Enhancement
  62. 2024-11-04: Align-SLM Textless Spoken Language Models with Reinforcement Learning from AI Feedback
  63. 2024-11-03: SPES Spectrogram Perturbation for Explainable Speech-to-Text Generation
  64. 2024-11-03: Introducing hertz-dev - Standard Intelligence - waiting for a paper on this one; added PLACEHOLDER hertz-dev - Standard Intelligence for now
  65. 2024-11-01: Freeze-Omni A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
  66. 2024-10-31: DC-Spin A Speaker-invariant Speech Tokenizer for Spoken Language Models
  67. 2024-10-23: OmniFlatten An End-to-end GPT Model for Seamless Voice Conversation
  68. 2024-10-22: WavTokenizer an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
  69. 2024-10-22: VoiceBench Benchmarking LLM-Based Voice Assistants
  70. 2024-10-22: Continuous Speech Tokenizer in Text To Speech
  71. 2024-10-20: MaskGCT Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
  72. 2024-10-20: Ichigo Mixed-Modal Early-Fusion Realtime Voice Assistant
  73. 2024-10-19: DM-Codec Distilling Multimodal Representations for Speech Tokenization
  74. 2024-10-16: What Do Speech Foundation Models Not Learn About Speech
  75. 2024-10-06: HALL-E Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
  76. 2024-10-05: SyllableLM Learning Coarse Semantic Units for Speech Language Models
  77. 2024-09-26: EMOVA Empowering Language Models to See, Hear and Speak with Vivid Emotions
  78. 2024-09-22: What Are They Doing Joint Audio-Speech Co-Reasoning
  79. 2024-09-18: Moshi a speech-text foundation model for real-time dialogue - Moshi fromkyutai
  80. 2024-09-11: A Suite for Acoustic Language Model Evaluation
  81. 2024-09-10: LLaMA-Omni Seamless Speech Interaction with Large Language Models
  82. 2024-09-09: Leveraging Content and Acoustic Representations for Speech Emotion Recognition
  83. 2024-09-01: Comparing Discrete and Continuous Space LLMs for Speech Recognition
  84. 2024-08-29: Mini-Omni Language Models Can Hear, Talk While Thinking in Streaming
  85. 2024-08-14: CMU’s IWSLT 2024 Simultaneous Speech Translation System
  86. 2024-08-13: Style-Talker Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
  87. 2024-08-05: Language Model Can Listen While Speaking
  88. 2024-07-04: FunAudioLLM Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
  89. 2024-07-04: DASS Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners - DASS
  90. 2024-06-28: BESTOW Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
  91. 2024-06-20: DASB - Discrete Audio and Speech Benchmark
  92. 2024-06-17: GAMA A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities - GAMA
  93. 2024-06-15: How Should We Extract Discrete Audio Tokens from Self-Supervised Models
  94. 2024-06-12: VALL-E R Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
  95. 2024-06-11: The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
  96. 2024-06-09: MS-HuBERT Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
  97. 2024-06-08: VALL-E 2 Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
  98. 2024-06-08: Exploring the Benefits of Tokenization of Discrete Acoustic Units
  99. 2024-05-12: Unified Video-Language Pre-training with Synchronized Audio - VLSA
  100. 2024-04-27: T-CLAP Temporal-Enhanced Contrastive Language-Audio Pretraining - T-CLAP
  101. 2024-03-31: WavLLM Towards Robust and Adaptive Speech Large Language Model
  102. 2024-03-19: Listenable Maps for Audio Classifiers
  103. 2024-02-29: Compact Speech Translation Models via Discrete Speech Units Pretraining
  104. 2024-02-20: OWSM-CTC An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
  105. 2024-02-19: AnyGPT Unified Multimodal LLM with Discrete Sequence Modeling - AnyGPT
  106. 2024-02-16: Pushing the Limits of Zero-shot End-to-End Speech Translation
  107. 2024-02-12: Careless Whisper Speech-to-Text Hallucination Harms
  108. 2024-02-12: BASE TTS Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
  109. 2024-02-08: SpiRit-LM Interleaved Spoken and Written Language Model
  110. 2024-02-08: Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
  111. 2024-01-30: OWSM v3 1 Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
  112. 2024-01-24: SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
  113. 2023-12-23: Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
  114. 2023-12-21: EmphAssess a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
  115. 2023-11-14: Qwen-Audio Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
  116. 2023-11-12: AudioChatLlama Towards General-Purpose Speech Abilities for LLMs
  117. 2023-10-24: P-Flow A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
  118. 2023-10-23: SALMONN Towards Generic Hearing Abilities for Large Language Models
  119. 2023-10-23: Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
  120. 2023-10-13: SALM Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
  121. 2023-09-27: HyPoradise An Open Baseline for Generative Speech Recognition with Large Language Models
  122. 2023-09-25: Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
  123. 2023-09-14: Voxtlm unified decoder-only models for consolidating speech recognitionsynthesis and speechtext continuation tasks
  124. 2023-09-13: Can Whisper Perform Speech-Based In-Context Learning
  125. 2023-09-07: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units A Comparative Study
  126. 2023-08-31: SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
  127. 2023-08-23: SpeechX Neural Codec Language Model as a Versatile Speech Transformer - SpeechX
  128. 2023-08-22: SeemlessM4T - Introducing a foundational multimodal model for speech translation - SeamlessM4T
  129. 2023-08-22: SeamlessM4T Massively Multilingual & Multimodal Machine Translation
  130. 2023-06-23: Voicebox Text-Guided Multilingual Universal Speech Generation at Scale
  131. 2023-06-13: StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
  132. 2023-05-29: VAST A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
  133. 2023-05-25: VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
  134. 2023-05-24: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM - Spectron
  135. 2023-05-22: Textually Pretrained Speech Language Models - TWIST
  136. 2023-05-18: SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
  137. 2023-05-16: SoundStorm Efficient Parallel Audio Generation - SoundStorm
  138. 2023-05-12: Better speech synthesis through scaling - TorToise TTS
  139. 2023-04-25: AudioGPT Understanding and Generating Speech, Music, Sound, and Talking Head - AudioGPT
  140. 2023-03-14: I3D Transformer architectures with input-dependent dynamic depth for speech recognition
  141. 2023-03-07: Speak Foreign Languages with Your Own Voice Cross-Lingual Neural Codec Language Modeling
  142. 2023-01-05: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - Vall-E
  143. 2022-12-06: Robust Speech Recognition via Large-Scale Weak Supervision - Whisper
  144. 2022-11-12: Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
  145. 2022-11-11: Speech-to-Speech Translation For A Real-world Unwritten Language - Meta paper on Taiwanese Hokkien
  146. 2022-11-08: Comparative layer-wise analysis of self-supervised speech models
  147. 2022-10-24: High Fidelity Neural Audio Compression
  148. 2022-10-12: SQuId Measuring Speech Naturalness in Many Languages
  149. 2022-09-30: SpeechLM Enhanced Speech Pre-Training with Unpaired Textual Data
  150. 2022-09-30: AudioGen Textually Guided Audio Generation
  151. 2022-09-07: AudioLM a Language Modeling Approach to Audio Generation - AudioLM
  152. 2022-06-05: Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
  153. 2022-04-05: UTMOS UTokyo-SaruLab System for VoiceMOS Challenge 2022
  154. 2022-02-07: data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language
  155. 2022-02-03: mSLAM Massively multilingual joint pre-training for speech and text
  156. 2021-12-04: YourTTS Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
  157. 2021-11-17: XLS-R Self-supervised Cross-lingual Speech Representation Learning at Scale
  158. 2021-11-09: Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
  159. 2021-11-03: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
  160. 2021-10-26: WavLM Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
  161. 2021-10-14: SpeechT5 Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
  162. 2021-10-05: DistilHuBERT Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
  163. 2021-09-07: Text-Free Prosody-Aware Generative Spoken Language Modeling
  164. 2021-08-07: W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training - w2v-BERT
  165. 2021-07-13: Zero-shot Speech Translation
  166. 2021-07-07: SoundStream An End-to-End Neural Audio Codec - SoundStream
  167. 2021-06-14: HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - HuBERT
  168. 2021-06-11: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - VITS
  169. 2021-04-05: AST Audio Spectrogram Transformer
  170. 2021-04-01: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
  171. 2021-02-01: Generative Spoken Language Modeling from Raw Audio
  172. 2021-01-09: UniSpeech Unified Speech Representation Learning with Labeled and Unlabeled Data
  173. 2020-10-20: Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
  174. 2020-09-04: SEANet A Multi-modal Speech Enhancement Network
  175. 2020-06-23: Real Time Speech Enhancement in the Waveform Domain
  176. 2020-06-20: wav2vec 2 0 A Framework for Self-Supervised Learning of Speech Representations - wav2vec 2.0
  177. 2020-05-16: Conformer Convolution-augmented Transformer for Speech Recognition - Conformer
  178. 2020-01-25: Multi-task self-supervised learning for Robust Speech Recognition - Mirco Ravanelli and co
  179. 2019-11-21: Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
  180. 2019-10-08: MelGAN Generative Adversarial Networks for Conditional Waveform Synthesis - MelGAN
  181. 2019-06-06: Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View - Macron-Net
  182. 2019-04-18: SpecAugment A Simple Data Augmentation Method for Automatic Speech Recognition
  183. 2019-04-11: wav2vec Unsupervised Pre-training for Speech Recognition - wav2vec
  184. 2019-04-06: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - Mirco Ravanelli and co
  185. 2018-12-01: Learning Speaker Representations with Mutual Information from Mirco Ravanelli and Yoshua Bengio
  186. 2018-07-10: Representation Learning with Contrastive Predictive Coding - CPC
  187. 2018-04-04: Learning Filterbanks from Raw Speech for Phone Recognition
  188. 2018-03-23: Style Tokens Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis - Style Tokens
  189. 2017-12-16: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions - Tacotron 2
  190. 2017-09-22: Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
  191. 2017-03-19: Tacotron Towards End-to-End Speech Synthesis - Tacotron
  192. 2016-09-12: WaveNet A Generative Model for Raw Audio
  193. 2014-09-01: Neural Machine Translation by Jointly Learning to Align and Translate

Speech Datasets are under 👉 Datasets » Speech Datasets