Website: https://poonehmousavi.github.io/rg
Every Thursday at 11am-12pm EDT
- Join us via Zoom
- X / Twitter @convAI2024
- Bluesky @convai-rg.bsky.social
- Youtube Channel for past recordings
- Conversational AI slack to discuss with the community: here.
- Contact here if there is any issues with the invite link.
- Sign up here to receive email communications about the reading group
- Recurring Google Calendar Event
Upcoming Talks
[Dec 19th, 2024]
- Discrete Audio Tokens for Multimodal LLMs
Presenter:Mirco Ravanelli Concordia University - Mila Speaker Bio
Mirco Ravanelli received the Ph.D. (with cum laude distinction) from the University of Trento, Trento, Italy, in December 2017. He is currently an Assistant Professor with Concordia University, Montreal, QC, Canada, an Adjunct Professor with the Universite de Montreal, and a Mila Associate Member. He is the Founder and Leader of the SpeechBrain Project which aims to build an open-source toolkit for conversational AI and speech processing. He is the author or co-author of more than 80 papers on his research interests which include deep learning and conversational AI. He is also an Active Member of the Speech and Machine Learning Communities.
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
[Jan 9th, 2025]
- Neural Audio Codecs in the Era of Speech LMs
Presenter:Haibin Wu Microsoft Speaker Bio
Haibin Wu is a senior researcher at Microsoft, focusing on speech processing. He completed his Ph.D. at National Taiwan University under Prof. Hung-yi Lee. He is a recipient of the Google PhD Fellowship, awarded to only 75 scholars worldwide every year. Haibin has published more than 20 first-author papers in top conferences and journals like ICASSP, Interspeech, TASLP, ACL, ASRU, and SLT. He is also a key contributor to S3prl, an open-source speech toolkit with 2.2k GitHub stars. He gained industry experience through internships at Microsoft, Meta, Amazon, and Tencent, working on speech generation, enhancement, and model compression. Haibin also conducted research as a visiting student at Tsinghua University and the Chinese University of Hong Kong. In addition, Haibin co-organizes the SUPERB and Codec-SUPERB challenges, helping set benchmarks for speech SSL and codec model evaluation.
Neural audio codecs (NACs) have gained significant attention as essential technologies for audio compression and as foundational components for speech language models. In the era of speech LMs, there are both challenges and opportunities in the codec domain. This talk presents three topics of NACs, including modelling, evaluation and security. This talk introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec provides key benefits, including streaming capability, low computational demands, low bitrate, and a single codebook design, all while delivering high audio quality. Additionally, this talk presents Codec-SUPERB, the first benchmark designed to evaluate codec models in terms of reconstruction quality from both signal-level and application-level perspectives. Finally, this talk presents CodecFake, the first deepfake audio dataset based on codecs. The CodecFake dataset equips models to effectively counter codec-based speech generation systems.
[Jan 16th, 2025]
- TBA
Presenter:Martijn Bartelds Stanford University Speaker Bio
Martijn Bartelds is a Postdoctoral Scholar at Stanford University, advised by Dan Jurafsky. His research focuses on multilingual speech and language processing, with a particular interest in understanding where language variety and dialect information is encoded in neural speech models, benchmarking, and model training. He received his PhD with the highest distinction from the University of Groningen, where his thesis was nominated for the university’s best thesis award. He also received a prestigious NWO Rubicon fellowship and was a visiting researcher at Delft University of Technology and the University of Pennsylvania.
[Jan 23th, 2025]
- TBA
Presenter:Piotr Żelasko Nvidia Speaker Bio
Abstract
[Jan 30th, 2025]
- TBA
Presenter:Karen Livescu TTIC Speaker Bio
Abstract
Past Talks, Fall 2024
[Dec 5th, 2024]
- Posthoc Explanations for Audio Models
Presenter:Cem Subakan Université Laval - Mila Speaker Bio
Cem Subakan is an assistant prof. at the computer science department of Laval University, an affiliate assistant prof. at Concordia University and also an associate academic member at Mila. His research is on machine learning for speech and audio, recently focusing more on explainable machine learning. He recently co-organized the explainable AI for speech and audio workshop at ICASSP 2024, and will be a general chair for the IEEE MLSP 2025 conference.
He will discuss his recent work on generating explanations for audio models. While deep learning models excel at achieving high performance, they often function as black boxes, offering little transparency into their decision-making processes. His aim in this line of work is to develop methods that produce listenable explanations for these black-box audio models without compromising their original performance. Through several metrics, he demonstrates that the explanations generated by his approach remain faithful to the original model and are both listenable and understandable.
[Nov 21th, 2024]
- PARAMETER AVERAGING IS ALL YOU NEED TO PREVENT FORGETTING
Presenter:Peter Plantinga McGill University Speaker Bio
Peter Plantinga is a Postdoctoral Researcher at McGill University’s Department of Neurology and Neurosurgery, where his research leverages speech and audio data to develop biomarkers for neurodegenerative diseases. With a long-standing passion for applying AI to assistive technologies, Peter has published extensively on enhancing speech intelligibility in noisy environments for both human listeners and automated systems. He is a core developer of the open-source SpeechBrain toolkit, widely used in the speech processing and conversational AI communities, and previously led speech AI projects at JPMorganChase’s Machine Learning Center of Excellence, contributing to several patents in conversational AI technologies. Peter’s current work sits at the intersection of neuroscience and AI, aiming to advance the understanding and treatment of different neurological disorders through innovations in interpretable machine learning for voice analysis.
Continual learning in end-to-end automatic speech recognition (E2E-ASR) often suffers from catastrophic forgetting, where fine-tuning leads to significant performance degradation on previously seen data. While adapters offer a way to switch between fine-tuned models, they still underperform in unseen domains—a challenge when the input domain is unknown. We propose a method that reduces forgetting to just 3.4%, significantly outperforming fine-tuning strategies like LoRA, which exhibits a 49% forgetting rate. By linearly interpolating the parameters of multiple models fine-tuned from the same generalist model, we achieve a unified model that excels across diverse datasets. Moreover, this model can be iteratively fine-tuned and averaged while maintaining low forgetting rates. Our experiments demonstrate the robustness of this approach across various datasets and models, presenting a promising solution for continual learning in E2E-ASR.
Organizers
Pooneh Mousavi
Pooneh Mousavi (she/her) is a computer science PhD student at Mila and Concordia University, supervised by Professor Mirco Ravanelli. She has a broad interest in deep learning for Conversational AI. Her research focuses on discrete self-supervised learning for speech and audio, exploring its potential to bridge audio and language models. She is also one of the main contributors to the SpeechBrain project, a popular open-source conversational AI toolkit.
Website, Google Scholar, Linkedin
Hiba Akhaddar
Hiba Akhaddar (she/her) is a master’s student majoring in Computer Science at Concordia University and Mila. She is supervised by Pr. Tristan Glatard and Pr. Mirco Ravanelli. Her interests revolve around the applications of Deep Learning in the Medical field. She works on the detection and progression of Parkinson’s Disease from speech.
Website, Linkedin