🪴 Anil's Garden

❯

❯

Speech and Audio

❯

Speech and Audio - Rolodex - Papers, Models and Releases

Speech and Audio - Rolodex - Papers, Models and Releases

19 Dec 202511 min read

todo
kyutai

todo make this a timeline, ideally by migrating to use of tags and having either a dataview with the publication date as a column or - better - a script to generate a materialised speech papers/models/releases timeline (by parsing it inc. for clipped releases/repos, for example)

LFM2-Audio An End-to-End Audio Foundation Model Liquid AI
TranSpeech Speech-to-Speech Translation With Bilateral Perturbation
CASPER A Large Scale Spontaneous Speech Dataset
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
Neural Networks Fail to Learn Periodic Functions and How to Fix It - “Snake activations”
NaturalSpeech 3 Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
BLAB Brutally Long Audio Bench
boson-aihiggs-audio Text-audio foundation model from Boson AI
CLaM-TTS Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
dMel Speech Tokenization made Simple
Exploration on HuBERT with Multiple Resolutions
FocalCodec Low-Bitrate Speech Coding via Focal Modulation Networks from Mirco Ravanelli
HiFi-Codec Group-residual Vector quantization for High Fidelity Audio Codec - HiFi-Codec
Kimi-Audio Technical Report
Llasa Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
Looking to Listen at the Cocktail Party A Speaker-Independent Audio-Visual Model for Speech Separation
Meta Audiobox Aesthetics Unified Automatic Quality Assessment for Speech, Music, and Sound
Multi-resolution HuBERT Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
NUTSHELL A Dataset for Abstract Generation from Scientific Talks
Parallel WaveGAN A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
SemantiCodec An Ultra Low Bitrate Semantic Audio Codec for General Sound
SHuBERT Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
SpeechCLIP Integrating Speech with Pre-Trained Vision and Language Model
SpeechQE Estimating the Quality of Direct Speech Translation
Spoken Language Modeling with Duration-Penalized Self-Supervised Units
tinyCLAP Distilling Constrastive Language-Audio Pretrained Models
Translation in the Hands of ManyCentering Lay Users in Machine Translation Interactions
TS3-Codec Transformer-Based Simple Streaming Single Codec
Voice Conversion With Just Nearest Neighbors
Voxtral Mistral AI
WhisperX Time-Accurate Speech Transcription of Long-Form Audio
ZSVC Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Jasper An End-to-End Convolutional Neural Acoustic Model
End-to-end ASR from Supervised to Semi-Supervised Learning with Modern Architectures
Scaling Up Online Speech Recognition Using ConvNets
EmotiVoice 😊 a Multi-Voice and Prompt-Controlled TTS Engine - netease-youdao
2025-04-03: Scaling Analysis of Interleaved Speech-Text Language Models
2025-03-19: Vision-Speech Models Teaching Speech Models to Converse about Images - MoshiVis fromkyutai
2025-03-18: SVLA A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
2025-03-18: MoonCast High-Quality Zero-Shot Podcast Generation
2025-03-13: AudioX Diffusion Transformer for Anything-to-Audio Generation
2025-02-27: Crossing the uncanny valley of conversational voice - Sesame
2025-02-17: Step-Audio Unified Understanding and Generation in Intelligent Speech Interaction
2025-02-14: OWLS Scaling Laws for Multilingual Speech Recognition and Translation Models
2025-02-05: High-Fidelity Simultaneous Speech-To-Speech Translation - Hibiki fromkyutai
2025-01-20: LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
2025-01-10: xLSTM-SENet xLSTM for Single-Channel Speech Enhancement
2025-01-10: MinMo A Multimodal Large Language Model for Seamless Voice Interaction
2025-01-04: Prepending or Cross-Attention for Speech-to-Text An Empirical Comparison
2024-12-24: How Real is Your Real-Time Simultaneous Speech-to-Text Translation System
2024-12-16: Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
2024-12-13: MERaLiON-AudioLLM Bridging Audio and Language with Large Language Models
2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
2024-12-06: Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
2024-12-04: Explainability for Speech Models On the Challenges of Acoustic Feature Selection
2024-12-02: AlignFormer Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
2024-11-29: Scaling Transformers for Low-Bitrate High-Quality Speech Coding
2024-11-27: SALMONN-omni A Codec-free LLM for Full-duplex Speech Understanding and Generation
2024-11-13: A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
2024-11-09: Selective State Space Model for Monaural Speech Enhancement
2024-11-04: Align-SLM Textless Spoken Language Models with Reinforcement Learning from AI Feedback
2024-11-03: SPES Spectrogram Perturbation for Explainable Speech-to-Text Generation
2024-11-03: Introducing hertz-dev - Standard Intelligence - waiting for a paper on this one; added PLACEHOLDER hertz-dev - Standard Intelligence for now
2024-11-01: Freeze-Omni A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
2024-10-31: DC-Spin A Speaker-invariant Speech Tokenizer for Spoken Language Models
2024-10-23: OmniFlatten An End-to-end GPT Model for Seamless Voice Conversation
2024-10-22: WavTokenizer an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
2024-10-22: VoiceBench Benchmarking LLM-Based Voice Assistants
2024-10-22: Continuous Speech Tokenizer in Text To Speech
2024-10-20: MaskGCT Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
2024-10-20: Ichigo Mixed-Modal Early-Fusion Realtime Voice Assistant
2024-10-19: DM-Codec Distilling Multimodal Representations for Speech Tokenization
2024-10-16: What Do Speech Foundation Models Not Learn About Speech
2024-10-06: HALL-E Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
2024-10-05: SyllableLM Learning Coarse Semantic Units for Speech Language Models
2024-09-26: EMOVA Empowering Language Models to See, Hear and Speak with Vivid Emotions
2024-09-22: What Are They Doing Joint Audio-Speech Co-Reasoning
2024-09-18: Moshi a speech-text foundation model for real-time dialogue - Moshi fromkyutai
2024-09-11: A Suite for Acoustic Language Model Evaluation
2024-09-10: LLaMA-Omni Seamless Speech Interaction with Large Language Models
2024-09-09: Leveraging Content and Acoustic Representations for Speech Emotion Recognition
2024-09-01: Comparing Discrete and Continuous Space LLMs for Speech Recognition
2024-08-29: Mini-Omni Language Models Can Hear, Talk While Thinking in Streaming
2024-08-14: CMU’s IWSLT 2024 Simultaneous Speech Translation System
2024-08-13: Style-Talker Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
2024-08-05: Language Model Can Listen While Speaking
2024-07-04: FunAudioLLM Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
2024-07-04: DASS Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners - DASS
2024-06-28: BESTOW Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
2024-06-20: DASB - Discrete Audio and Speech Benchmark
2024-06-17: GAMA A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities - GAMA
2024-06-15: How Should We Extract Discrete Audio Tokens from Self-Supervised Models
2024-06-12: VALL-E R Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
2024-06-11: The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
2024-06-09: MS-HuBERT Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
2024-06-08: VALL-E 2 Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
2024-06-08: Exploring the Benefits of Tokenization of Discrete Acoustic Units
2024-05-12: Unified Video-Language Pre-training with Synchronized Audio - VLSA
2024-04-27: T-CLAP Temporal-Enhanced Contrastive Language-Audio Pretraining - T-CLAP
2024-03-31: WavLLM Towards Robust and Adaptive Speech Large Language Model
2024-03-19: Listenable Maps for Audio Classifiers
2024-02-29: Compact Speech Translation Models via Discrete Speech Units Pretraining
2024-02-20: OWSM-CTC An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
2024-02-19: AnyGPT Unified Multimodal LLM with Discrete Sequence Modeling - AnyGPT
2024-02-16: Pushing the Limits of Zero-shot End-to-End Speech Translation
2024-02-12: Careless Whisper Speech-to-Text Hallucination Harms
2024-02-12: BASE TTS Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
2024-02-08: SpiRit-LM Interleaved Spoken and Written Language Model
2024-02-08: Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
2024-01-30: OWSM v3 1 Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
2024-01-24: SpeechGPT-Gen Scaling Chain-of-Information Speech Generation
2023-12-23: Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
2023-12-21: EmphAssess a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
2023-11-14: Qwen-Audio Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
2023-11-12: AudioChatLlama Towards General-Purpose Speech Abilities for LLMs
2023-10-24: P-Flow A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
2023-10-23: SALMONN Towards Generic Hearing Abilities for Large Language Models
2023-10-23: Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
2023-10-13: SALM Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
2023-09-27: HyPoradise An Open Baseline for Generative Speech Recognition with Large Language Models
2023-09-25: Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
2023-09-14: Voxtlm unified decoder-only models for consolidating speech recognitionsynthesis and speechtext continuation tasks
2023-09-13: Can Whisper Perform Speech-Based In-Context Learning
2023-09-07: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units A Comparative Study
2023-08-31: SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
2023-08-23: SpeechX Neural Codec Language Model as a Versatile Speech Transformer - SpeechX
2023-08-22: SeemlessM4T - Introducing a foundational multimodal model for speech translation - SeamlessM4T
2023-08-22: SeamlessM4T Massively Multilingual & Multimodal Machine Translation
2023-06-23: Voicebox Text-Guided Multilingual Universal Speech Generation at Scale
2023-06-13: StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
2023-05-29: VAST A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
2023-05-25: VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
2023-05-24: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM - Spectron
2023-05-22: Textually Pretrained Speech Language Models - TWIST
2023-05-18: SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
2023-05-16: SoundStorm Efficient Parallel Audio Generation - SoundStorm
2023-05-12: Better speech synthesis through scaling - TorToise TTS
2023-04-25: AudioGPT Understanding and Generating Speech, Music, Sound, and Talking Head - AudioGPT
2023-03-14: I3D Transformer architectures with input-dependent dynamic depth for speech recognition
2023-03-07: Speak Foreign Languages with Your Own Voice Cross-Lingual Neural Codec Language Modeling
2023-01-05: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - Vall-E
2022-12-06: Robust Speech Recognition via Large-Scale Weak Supervision - Whisper
2022-11-12: Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
2022-11-11: Speech-to-Speech Translation For A Real-world Unwritten Language - Meta paper on Taiwanese Hokkien
2022-11-08: Comparative layer-wise analysis of self-supervised speech models
2022-10-24: High Fidelity Neural Audio Compression
2022-10-12: SQuId Measuring Speech Naturalness in Many Languages
2022-09-30: SpeechLM Enhanced Speech Pre-Training with Unpaired Textual Data
2022-09-30: AudioGen Textually Guided Audio Generation
2022-09-07: AudioLM a Language Modeling Approach to Audio Generation - AudioLM
2022-06-05: Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
2022-04-05: UTMOS UTokyo-SaruLab System for VoiceMOS Challenge 2022
2022-02-07: data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language
2022-02-03: mSLAM Massively multilingual joint pre-training for speech and text
2021-12-04: YourTTS Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
2021-11-17: XLS-R Self-supervised Cross-lingual Speech Representation Learning at Scale
2021-11-09: Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
2021-11-03: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
2021-10-26: WavLM Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
2021-10-14: SpeechT5 Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
2021-10-05: DistilHuBERT Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
2021-09-07: Text-Free Prosody-Aware Generative Spoken Language Modeling
2021-08-07: W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training - w2v-BERT
2021-07-13: Zero-shot Speech Translation
2021-07-07: SoundStream An End-to-End Neural Audio Codec - SoundStream
2021-06-14: HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - HuBERT
2021-06-11: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - VITS
2021-04-05: AST Audio Spectrogram Transformer
2021-04-01: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
2021-02-01: Generative Spoken Language Modeling from Raw Audio
2021-01-09: UniSpeech Unified Speech Representation Learning with Labeled and Unlabeled Data
2020-10-20: Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
2020-09-04: SEANet A Multi-modal Speech Enhancement Network
2020-06-23: Real Time Speech Enhancement in the Waveform Domain
2020-06-20: wav2vec 2 0 A Framework for Self-Supervised Learning of Speech Representations - wav2vec 2.0
2020-05-16: Conformer Convolution-augmented Transformer for Speech Recognition - Conformer
2020-01-25: Multi-task self-supervised learning for Robust Speech Recognition - Mirco Ravanelli and co
2019-11-21: Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
2019-10-08: MelGAN Generative Adversarial Networks for Conditional Waveform Synthesis - MelGAN
2019-06-06: Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View - Macron-Net
2019-04-18: SpecAugment A Simple Data Augmentation Method for Automatic Speech Recognition
2019-04-11: wav2vec Unsupervised Pre-training for Speech Recognition - wav2vec
2019-04-06: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - Mirco Ravanelli and co
2018-12-01: Learning Speaker Representations with Mutual Information from Mirco Ravanelli and Yoshua Bengio
2018-07-10: Representation Learning with Contrastive Predictive Coding - CPC
2018-04-04: Learning Filterbanks from Raw Speech for Phone Recognition
2018-03-23: Style Tokens Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis - Style Tokens
2017-12-16: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions - Tacotron 2
2017-09-22: Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
2017-03-19: Tacotron Towards End-to-End Speech Synthesis - Tacotron
2016-09-12: WaveNet A Generative Model for Raw Audio
2014-09-01: Neural Machine Translation by Jointly Learning to Align and Translate

Speech Datasets are under 👉 Datasets » Speech Datasets

Graph View

Backlinks

Speech and Audio

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋