Interspeech 2024

Kos, Greece

1-5 September 2024

Chairs: Itshak Lapidot, Sharon Gannot
doi: 10.21437/Interspeech.2024

L2 Speech, Bilingualism and Code-Switching


The influence of L2 accent strength and different error types on personality trait ratings
Sarah Wesolek, Piotr Gulgowski, Joanna Blaszczak, Marzena Zygis

Characterizing code-switching: Applying Linguistic Principles for Metric Assessment and Development
Jie Chi, Electra Wallington, Peter Bell

Towards a better understanding of receptive multilingualism: listening conditions and priming effects
Wei Xue, Ivan Yuen, Bernd Möbius

2.5D Vocal Tract Modeling: Bridging Low-Dimensional Efficiency with 3D Accuracy
Debasish Ray Mohapatra, Victor Zappi, Sidney Fels

Speaker Diarization 1


Investigating Confidence Estimation Measures for Speaker Diarization
Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization
Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan

On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan Wang

EEND-M2F: Masked-attention mask transformers for speaker diarization
Marc HÀrkönen, Samuel J. Broughton, Lahiru Samarakoon

AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild
YongKang Yin, Xu Li, Ying Shan, YueXian Zou

Exploiting Wavelet Scattering Transform for an Unsupervised Speaker Diarization in Deep Neural Network Framework
Arunav Arya, Murtiza Ali, Karan Nathwani

Speech and Audio Analysis and Representations


MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto

Audio Fingerprinting with Holographic Reduced Representations
Yusuke Fujita, Tatsuya Komatsu

RAST: A Reference-Audio Synchronization Tool for Dubbed Content
David Meyer, Eitan Abecassis, Clara Fernandez-Labrador, Christopher Schroers

YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation
Xuefei Li, Hao Huang, Ying Hu, Liang He, Jiabao Zhang, Yuyi Wang

Reduce, Reuse, Recycle: Is Perturbed Data Better than Other Language Augmentation for Low Resource Self-Supervised Speech Models
Asad Ullah, Alessandro Ragano, Andrew Hines

AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators
Jaden Pieper, Stephen Voran

Acoustic Event Detection and Classification 2


Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation
Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection
Da Mu, Zhicheng Zhang, Haobo Yue

Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection
Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee, Yong-Hwa Park

Stream-based Active Learning for Anomalous Sound Detection in Machine Condition Monitoring
Tuan Vu Ho, Kota Dohi, Yohei Kawaguchi

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan

FakeSound: Deepfake General Audio Detection
Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

Sound of Traffic: A Dataset for Acoustic Traffic Identification and Counting
Shabnam Ghaffarzadegan, Luca Bondi, Wei-Chang Lin, Abinaya Kumar, Ho-Hsiang Wu, Hans-Georg Horst, Samarjit Das

Detection and Classification of Bioacoustic Signals


Vision Transformer Segmentation for Visual Bird Sound Denoising
Sahil Kumar, Jialu Li, Youshan Zhang

DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition
Xin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard, Alice Baird, Björn Schuller

Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures
Jules Cauzinille, BenoĂźt Favre, Ricard Marxer, Dena Clink, Abdul Hamid Ahmad, Arnaud Rey

Study Selectively: An Adaptive Knowledge Distillation based on a Voting Network for Heart Sound Classification
Xihang Qiu, Lixian Zhu, Zikai Song, Zeyu Chen, Haojie Zhang, Kun Qian, Ye Zhang, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller

SimuSOE: A Simulated Snoring Dataset for Obstructive Sleep Apnea-Hypopnea Syndrome Evaluation during Wakefulness
Jie Lin, Xiuping Yang, Li Xiao, Xinhong Li, Weiyan Yi, Yuhong Yang, Weiping Tu, Xiong Chen

Acoustic Echo Cancellation


Multi-mic Echo Cancellation Coalesced with Beamforming for Real World Adverse Acoustic Conditions
Premanand Nayak, Kamini Sabu, M. Ali Basha Shaik

Interference Aware Training Target for DNN based joint Acoustic Echo Cancellation and Noise Suppression
Vahid Khanagha, Dimitris Koutsaidis, Kaustubh Kalgaonkar, Sriram Srinivasan

Low Complexity Echo Delay Estimator Based on Binarized Feature Matching
Yi Gao, Xiang Su

MSA-DPCRN: A Multi-Scale Asymmetric Dual-Path Convolution Recurrent Network with Attentional Feature Fusion for Acoustic Echo Cancellation
Ye Ni, Cong Pang, Chengwei Huang, Cairong Zou

Efficient Joint Bemforming and Acoustic Echo Cancellation Structure for Conference Call Scenarios
Ofer Schwartz, Sharon Gannot

SDAEC: Signal Decoupling for Advancing Acoustic Echo Cancellation
Fei Zhao, Jinjiang Liu, Xueliang Zhang

Speech Synthesis: Voice Conversion 1


Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals
Kentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari

Neural Codec Language Models for Disentangled and Textless Voice Conversion
Alan Baade, Puyuan Peng, David Harwath

Fine-Grained and Interpretable Neural Speech Editing
Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo

FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion
Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity
Tianhua Qi, Shiyan Wang, Cheng Lu, Yan Zhao, Yuan Zong, Wenming Zheng

Neural Network Architectures for ASR 2


InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions
Yu Nakagome, Michael Hentschel

SEQ-former: A context-enhanced and efficient automatic speech recognition framework
Qinglin Meng, Min Liu, Kaixun Huang, Kun Wei, Lei Xie, Zongfeng Quan, Weihong Deng, Quan Lu, Ning Jiang, Guoqing Zhao

How Much Context Does My Attention-Based ASR System Need?
Robert Flynn, Anton Ragni

Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers’ ability to model hesitation phenomena
Vincenzo Norman Vitale, Loredana Schettino, Francesco Cutugno

Transmitted and Aggregated Self-Attention for Automatic Speech Recognition
Tian-Hao Zhang, Xinyuan Qian, Feng Chen, Xu-Cheng Yin

MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels
Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe

Exploring the Capability of Mamba in Speech Applications
Koichi Miyazaki, Yoshiki Masuyama, Masato Murata

Robust Voice Activity Detection using Locality-Sensitive Hashing and Residual Frequency-Temporal Attention
Shu Li, Peng Zhang, Ye Li

Lightweight Transducer Based on Frame-Level Criterion
Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye

Exploring the limits of decoder-only models trained on public speech recognition corpora
Ankit Gupta, George Saon, Brian Kingsbury

Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model
Xun Gong, Anqi Lv, Zhiming Wang, Yanmin Qian

Decoding Algorithms


Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask
Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu

E-Paraformer: A Faster and Better Parallel Transformer for Non-autoregressive End-to-End Mandarin Speech Recognition
Kun Zou, Fengyun Tan, Ziyang Zhuang, Chenfeng Miao, Tao Wei, Shaodan Zhai, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao

Beam-search SIEVE for low-memory speech recognition
Martino Ciaperoni, Athanasios Katsamanis, Aristides Gionis, Panagiotis Karras

Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU
Daniel Galvez, Vladimir Bataev, Hainan Xu, Tim Kaldewey

Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm
Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Yanzhang He, Pedro Moreno Mengibar

Text-only Domain Adaptation for CTC-based Speech Recognition through Substitution of Implicit Linguistic Information in the Search Space
Tatsunari Takagi, Yukoh Wakabayashi, Atsunori Ogawa, Norihide Kitaoka

Pronunciation Assessment


Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis
Xintong Wang, Mingqian Shi, Ye Wang

MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios
Yu-Wen Chen, Zhou Yu, Julia Hirschberg

A Framework for Phoneme-Level Pronunciation Assessment Using CTC
Xinwei Cao, Zijian Fan, TorbjĂžrn Svendsen, Giampiero Salvi

Phonological-Level Mispronunciation Detection and Diagnosis
Mostafa Shahin, Beena Ahmed

Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment
Heejin Do, Wonjun Lee, Gary Geunbae Lee

Automated content assessment and feedback for Finnish L2 learners in a picture description speaking task
Nhan Phan, Anna von Zansen, Maria Kautonen, Ekaterina Voskoboinik, Tamas Grosz, Raili Hilden, Mikko Kurimo

Spoken Language Processing


Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning
Zhenyu Wang, Shuyu Kong, Li Wan, Biqiao Zhang, Yiteng Huang, Mumin Jin, Ming Sun, Xin Lei, Zhaojun Yang

Relational Proxy Loss for Audio-Text based Keyword Spotting
Youngmoon Jung, Seungjin Lee, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho

CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting
Sichen Jin, Youngmoon Jung, Seungjin Lee, Jaeyoung Roh, Changwoo Han, Hoonyoung Cho

Text-aware Speech Separation for Multi-talker Keyword Spotting
Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu

Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition
Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee

Adding User Feedback To Enhance CB-Whisper
Raul Monteiro

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

Spoken Machine Translation 2


Parameter-Efficient Adapter Based on Pre-trained Models for Speech Translation
Nan Chen, Yonghe Wang, Feilong Bao

Wave to Interlingua: Analyzing Representations of Multilingual Speech Transformers for Spoken Language Translation
Badr M. Abdullah, Mohammed Maqsood Shaik, Dietrich Klakow

Knowledge-Preserving Pluggable Modules for Multilingual Speech Translation Tasks
Nan Chen, Yonghe Wang, Feilong Bao

Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation
Rastislav Rabatin, Frank Seide, Ernie Chang

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation
Peidong Wang, Jian Xue, Jinyu Li, Junkun Chen, Aswin Shanmugam Subramanian

A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation
Anna Min, Chenxu Hu, Yi Ren, Hang Zhao

Translating speech with just images
Dan Oneata, Herman Kamper

ZeroST: Zero-Shot Speech Translation
Sameer Khurana, Chiori Hori, Antoine Laurent, Gordon Wichern, Jonathan Le Roux

Biosignal-enabled Spoken Communication


A multimodal approach to study the nature of coordinative patterns underlying speech rhythm
Jinyu Li, Leonardo Lancia

Towards EMG-to-Speech with Necklace Form Factor
Peter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W Black, Rikky Muller, Gopala Krishna Anumanchipalli

Using articulated speech EEG signals for imagined speech decoding
Chris Bras, Tanvina Patel, Odette Scharenborg

Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals
Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang

Optical Flow Guided Tongue Trajectory Generation for Diffusion-based Acoustic to Articulatory Inversion
Yudong Yang, Rongfeng Su, Rukiye Ruzi, Manwa Ng, Shaofeng Zhao, Nan Yan, Lan Wang

Multimodal Segmentation for Vocal Tract Modeling
Rishi Jain, Bohan Yu, Peter Wu, Tejas Prabhune, Gopala Anumanchipalli

Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss
Jesuraj Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh

Auditory Attention Decoding in Four-Talker Environment with EEG
Yujie Yan, Xiran Xu, Haolin Zhu, Pei Tian, Zhongshu Ge, Xihong Wu, Jing Chen

ASA: An Auditory Spatial Attention Dataset with Multiple Speaking Locations
Zijie Lin, Tianyu He, Siqi Cai, Haizhou Li

Leveraging Graphic and Convolutional Neural Networks for Auditory Attention Detection with EEG
Saurav Pahuja, Gabriel Ivucic, Pascal Himmelmann, Siqi Cai, Tanja Schultz, Haizhou Li

Individual and Social Factors in Phonetics


Echoes of Implicit Bias Exploring Aesthetics and Social Meanings of Swiss German Dialect Features
Tillmann Pistor, Adrian Leemann

In search of structure and correspondence in intra-speaker trial-to-trial variability
Vivian G. Li

Modelled Multivariate Overlap: A method for measuring vowel merger
Irene Smith, Morgan Sonderegger, The Spade Consortium

Entrainment Analysis and Prosody Prediction of Subsequent Interlocutor’s Backchannels in Dialogue
Keiko Ochi, Koji Inoue, Divesh Lala, Tatsuya Kawahara

Exploring the anatomy of articulation rate in spontaneous English speech: relationships between utterance length effects and social factors
James Tanner, Morgan Sonderegger, Jane Stuart-Smith, Tyler Kendall, Jeff Mielke, Robin Dodsworth, Erik Thomas

Familiar and Unfamiliar Speaker Identification in Speech and Singing
Katelyn Taylor, Amelia Gully, Helena Daffern

Paralinguistics


Cross-transfer Knowledge between Speech and Text Encoders to Evaluate Customer Satisfaction
Luis Felipe Parra-Gallego, Tilak Purohit, Bogdan Vlasenko, Juan Rafael Orozco-Arroyave, Mathew Magimai.-Doss

Fine-tuning of Pre-trained Models for Classification of Vocal Intensity Category from Speech Signals
Manila Kodali, Sudarsana Reddy Kadiri, Paavo Alku

Real-world PTSD Recognition: A Cross-corpus and Cross-linguistic Evaluation
Alexander Kathan, Martin BĂŒrger, Andreas Triantafyllopoulos, Sabrina Milkus, Jonas Hohmann, Pauline Muderlak, JĂŒrgen Schottdorf, Richard Musil, Björn Schuller, Shahin Amiriparian

Switching Tongues, Sharing Hearts: Identifying the Relationship between Empathy and Code-switching in Speech
Debasmita Bhattacharya, Eleanor Lin, Run Chen, Julia Hirschberg

Speaker Recognition: Adversarial and Spoofing Attacks


Anti-spoofing Ensembling Model: Dynamic Weight Allocation in Ensemble Models for Improved Voice Biometrics Security
Eros Rosello, Angel M. Gomez, Ivån López-Espejo, Antonio M. Peinado, Juan M. Martín-Doñas

Spoof Diarization: “What Spoofed When” in Partially Spoofed Audio
Lin Zhang, Xin Wang, Erica Cooper, Mireia Diez, Federico Landini, Nicholas Evans, Junichi Yamagishi

Spoofing Speech Detection by Modeling Local Spectro-Temporal and Long-term Dependency
Haochen Wu, Wu Guo, Zhentao Zhang, Wenting Zhao, Shengyu Peng, Jie Zhang

Improving Copy-Synthesis Anti-Spoofing Training Method with Rhythm and Speaker Perturbation
Jingze Lu, Yuxiang Zhang, Zhuo Li, Zengqiang Shang, Wenchao Wang, Pengyuan Zhang

VoiceDefense: Protecting Automatic Speaker Verification Models Against Black-box Adversarial Attacks
Yip Keng Kan, Ke Xu, Hao Li, Jie Shi

Neural Codec-based Adversarial Sample Detection for Speaker Verification
Xuanjun Chen, Jiawei Du, Haibin Wu, Jyh-Shing Roger Jang, Hung-yi Lee

Textual-Driven Adversarial Purification for Speaker Verification
Sizhou Chen, Yibo Bai, Jiadi Yao, Xiao-Lei Zhang, Xuelong Li

Boosting the Transferability of Adversarial Examples with Gradient-Aligned Ensemble Attack for Speaker Recognition
Zhuhai Li, Jie Zhang, Wu Guo, Haochen Wu

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection
Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

Audio Event Detection and Classification 1


Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?
Tiantian Feng, Dimitrios Dimitriadis, Shrikanth S. Narayanan

Scaling up masked audio encoder learning for general audio classification
Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations
Sarthak Yadav, Zheng-Hua Tan

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection
Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin

Sound Event Bounding Boxes
Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux

Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network
Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He

Source Separation 2


Towards Explainable Monaural Speaker Separation with Auditory-based Training
Hassan Taherian, Vahid Ahmadi Kalkhorani, Ashutosh Pandey, Daniel Wong, Buye Xu, DeLiang Wang

Does the Lombard Effect Matter in Speech Separation? Introducing the Lombard-GRID-2mix Dataset
Iva Ewert, Marvin Borsdorf, Haizhou Li, Tanja Schultz

PARIS: Pseudo-AutoRegressIve Siamese Training for Online Speech Separation
Zexu Pan, Gordon Wichern, François G. Germain, Kohei Saijo, Jonathan Le Roux

OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction
Yiru Zhang, Linyu Yao, Qun Yang

Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation
Tsun-An Hsieh, Heeyoul Choi, Minje Kim

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech
Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information
Yiwen Wang, Xihong Wu

Enhanced Reverberation as Supervision for Unsupervised Speech Separation
Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

Noise Reduction, Dereverberation, and Echo Cancellation


Deep Echo Path Modeling for Acoustic Echo Cancellation
Fei Zhao, Chenggang Zhang, Shulin He, Jinjiang Liu, Xueliang Zhang

Graph Attention Based Multi-Channel U-Net for Speech Dereverberation With Ad-Hoc Microphone Arrays
Hongmei Guo, Yijiang Chen, Xiao-Lei Zhang, Xuelong Li

Speech dereverberation constrained on room impulse response characteristics
Louis Bahrman, Mathieu Fontaine, Jonathan Le Roux, Gaël Richard

DeWinder: Single-Channel Wind Noise Reduction using Ultrasound Sensing
Kuang Yuan, Shuo Han, Swarun Kumar, Bhiksha Raj

ANIMAL-CLEAN – A Deep Denoising Toolkit for Animal-Independent Signal Enhancement
Alexander Barnhill, Elmar Noeth, Andreas Maier, Christian Bergler

Elucidating Clock-drift Using Real-world Audios In Wireless Mode For Time-offset Insensitive End-to-End Asynchronous Acoustic Echo Cancellation
Premanand Nayak, M. Ali Basha Shaik

QMixCAT: Unsupervised Speech Enhancement Using Quality-guided Signal Mixing and Competitive Alternating Model Training
Shilin Wang, Haixin Guan, Yanhua Long

Computationally-Efficient Speech Enhancement


Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds
Hanbin Bae, Pavel Andreev, Azat Saginbaev, Nicholas Babaev, WonJun Lee, Hosang Sung, Hoon-Young Cho

Knowledge Distillation for Tiny Speech Enhancement with Latent Feature Augmentation
Behnam Gholami, Mostafa El-Khamy, KeeBong Song

Sub-PNWR: Speech Enhancement Based on Signal Sub-Band Splitting and Pseudo Noisy Waveform Reconstruction Loss
Yuewei Zhang, Huanbin Zou, Jie Zhu

Streamlining Speech Enhancement DNNs: an Automated Pruning Method Based on Dependency Graph with Advanced Regularized Loss Strategies
Zugang Zhao, Jinghong Zhang, Yonghui Liu, Jianbing Liu, Kai Niu, Zhiqiang He

Lightweight Dynamic Sparse Transformer for Monaural Speech Enhancement
Zehua Zhang, Xuyi Zhuang, Yukun Qian, Mingjiang Wang

MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement
Zizhen Lin, Xiaoting Chen, Junyu Wang

Dynamic Gated Recurrent Neural Network for Compute-efficient Speech Enhancement
Longbiao Cheng, Ashutosh Pandey, Buye Xu, Tobi Delbruck, Shih-Chii Liu

Zero-shot TTS


Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima

DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness
Vikentii Pankov, Valeria Pronina, Alexander Kuzmin, Maksim Borisov, Nikita Usoltsev, Xingshan Zeng, Alexander Golubkov, Nikolai Ermolenko, Aleksandra Shirshova, Yulia Matveeva

Noise Robustness, Far-Field, and Multi-Talker ASR


LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification
Xujiang Xing, Mingxing Xu, Thomas Fang Zheng

Serialized Output Training by Learned Dominance
Ying Shi, Lantian Li, Shi Yin, Dong Wang, Jiqing Han

SOT Triggered Neural Clustering for Speaker Attributed ASR
Xianrui Zheng, Guangzhi Sun, Chao Zhang, Philip C. Woodland

Neural Blind Source Separation and Diarization for Distant Speech Recognition
Yoshiaki Bando, Tomohiko Nakamura, Shinji Watanabe

Unified Multi-Talker ASR with and without Target-speaker Enrollment
Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Mana Ihori, Naotaka Kawata, Shota Orihashi, Kazutoshi Shinoda, Taiga Yamane, Saki Mizuno, Keita Suzuki, Satoshi Suzuki, Nobukatsu Hojo, Takafumi Moriya, Atsushi Ando

Contextual Biasing and Adaptation


Keyword-Guided Adaptation of Automatic Speech Recognition
Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor
Nguyen Manh Tien Anh, Thach Ho Sy

Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer
Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen

Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech Recognition
Chengxu Yang, Lin Zheng, Sanli Tian, Gaofeng Cheng, Sujie Xiao, Ta Li

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation
Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey

Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

Prompt Tuning for Speech Recognition on Unknown Spoken Name Entities
Xizi Wei, Stephen McGregor

Improved Factorized Neural Transducer Model For Text-only Domain Adaptation
Junzhe Liu, Jianwei Yu, Xie Chen

Modality Translation Learning for Joint Speech-Text Model
Pin-Yen Liu, Jen-Tzung Chien

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR
Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

Factor-Conditioned Speaking-Style Captioning
Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura

Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR
Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models
Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran

Domain-Aware Data Selection for Speech Classification via Meta-Reweighting
Junghun Kim, Ka Hyun Park, Hoyoung Yoon, U Kang

Spoken Language Understanding


Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model
Hayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Out-of-distribution generalisation in spoken language understanding
Dejan Porjazovski, Anssi Moisio, Mikko Kurimo

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding
Gaëlle LaperriÚre, Sahar Ghannay, Bassam Jabaian, Yannick EstÚve

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier

Using Large Language Model for End-to-End Chinese ASR and NER
Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang

A Contrastive Learning Approach to Mitigate Bias in Speech Models
Alkis Koudounas, Flavio Giobergia, Eliana Pastor, Elena Baralis

Spoken Machine Translation 1


Investigating Decoder-only Large Language Models for Speech-to-text Translation
Chao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov, Ruslan Mavlyutov, Sravya Popuri

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation
Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi DuBois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang, Zoe Abrams, Morgan McGuire

Sign Value Constraint Decomposition for Efficient 1-Bit Quantization of Speech Translation Tasks
Nan Chen, Yonghe Wang, Feilong Bao

Lightweight Audio Segmentation for Long-form Speech Translation
Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung

Contrastive Feedback Mechanism for Simultaneous Speech Translation
Haotian Tan, Sakriani Sakti

Towards Speech-to-Pictograms Translation
Cécile Macaire, Chloé Dion, Didier Schwab, Benjamin Lecouteux, Emmanuelle Esperança-Rodier

Hearing Disorders


Automatic Assessment of Speech Production Skills for Children with Cochlear Implants Using Wav2Vec2.0 Acoustic Embeddings
Seonwoo Lee, Sunhee Kim, Minhwa Chung

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Young Jin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim

Evaluating a 3-factor listener model for prediction of speech intelligibility to hearing-impaired listeners
Mark Huckvale, Gaston Hilkhuysen

Production of fricative consonants in French-speaking children with cochlear implants and typical hearing: acoustic and phonological analyses.
Sophie Fagniart, Brigitte Charlier, VĂ©ronique Delvaux, Bernard Harmegnies, Anne Huberlant, Myriam Piccaluga, Kathy Huet

Signal processing algorithm effective for sound quality of hearing loss simulators
Toshio Irino, Shintaro Doan, Minami Ishikawa

Auditory Spatial Attention Detection Based on Feature Disentanglement and Brain Connectivity-Informed Graph Neural Networks
Yixiang Niu, Ning Chen, Hongqing Zhu, Zhiying Zhu, Guangqiang Li, Yibo Chen

Automatic Detection of Hearing Loss from Children’s Speech using wav2vec 2.0 Features
Jessica Monaghan, Arun Sebastian, Nicky Chong-White, Vicky Zhang, Vijayalakshmi Easwar, Padraig Kitterick

Speech Disorders 2


Whister: Using Whisper’s representations for Stuttering detection
Vrushank Changawala, Frank Rudzicz

Improving Speech-Based Dysarthria Detection using Multi-task Learning with Gradient Projection
Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti

Cascaded Transfer Learning Strategy for Cross-Domain Alzheimer’s Disease Recognition through Spontaneous Speech
Guanlin Chen, Yun Jin

A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous Speech
Loukas Ilias, Dimitris Askounis

Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance
Si-Ioi Ng, Lingfeng Xu, Kimberly D. Mueller, Julie Liss, Visar Berisha

Multimodal Continuous Fingerspelling Recognition via Visual Alignment Learning
Katerina Papadimitriou, Gerasimos Potamianos

Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data
Tomas Arias-Vergara, Paula Andrea PĂ©rez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L. Prince, Maria Schuster, Elmar Noeth, Jonghye Woo, Andreas Maier

DysArinVox: DYSphonia & DYSarthria mandARIN speech corpus
Haojie Zhang, Tao Zhang, Ganjun Liu, Dehui Fu, Xiaohui Hou, Ying Lv

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli

Automatic Longitudinal Investigation of Multiple Sclerosis Subjects
GĂĄbor Gosztolya, Veronika Svindt, Judit BĂłna, IldikĂł Hoffmann

TAUKADIAL Challenge: Speech-Based Cognitive Assessment in Chinese and English (Special Session)


Connected Speech-Based Cognitive Assessment in Chinese and English
Saturnino Luz, Sofia De La Fuente Garcia, Fasih Haider, Davida Fromm, Brian MacWhinney, Alyssa Lanzi, Ya-Ning Chang, Chia-Ju Chou, Yi-Chien Liu

Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis
David Ortiz-Perez, Jose Garcia-Rodriguez, David TomĂĄs

Combining Acoustic Feature Sets for Detecting Mild Cognitive Impairment in the Interspeech’24 TAUKADIAL Challenge
GĂĄbor Gosztolya, LĂĄszlĂł TĂłth

Pre-trained Feature Fusion and Matching for Mild Cognitive Impairment Detection
Junwen Duan, Fangyuan Wei, Hong-Dong Li, Jin Liu

The Interspeech 2024 TAUKADIAL Challenge: Multilingual Mild Cognitive Impairment Detection with Multimodal Approach
Benjamin Barrera-Altuna, Daeun Lee, Zaima Zarnaz, Jinyoung Han, Seungbae Kim

Leveraging Universal Speech Representations for Detecting and Assessing the Severity of Mild Cognitive Impairment Across Languages
Anna Favaro, Tianyu Cao, Najim Dehak, Laureano Moro-Velazquez

Translingual Language Markers for Cognitive Assessment from Spontaneous Speech
Bao Hoang, Yijiang Pang, Hiroko Dodge, Jiayu Zhou

Multilingual Speech and Language Analysis for the Assessment of Mild Cognitive Impairment: Outcomes from the Taukadial Challenge
Paula Andrea PĂ©rez-Toro, Tomas Arias-Vergara, Philipp Klumpp, Tobias Weise, Maria Schuster, Elmar Noeth, Juan Rafael Orozco-Arroyave, Andreas Maier

Show and Tell 1


Production of phrases by mechanical models of the human vocal tract
Takayuki Arai, Ryohei Suzuki, Chandler Earp, Shinya Tsuji, Keiko Ochi

Faster Vocoder: a multi threading approach to achieve low latency during TTS Inference
Vishal Gourav, Ankit Tyagi, Phanindra Mankale

A powerful and modern AAC composition tool for impaired speakers
Aanchan Mohan, Monideep Chakraborti, Katelyn Eng, Nailia Kushaeva, Mirjana Prpa, Jordan Lewis, Tianyi Zhang, Vince Geisler, Carol Geisler

VoxFlow AI: wearable voice converter for atypical speech
Grzegorz P. Mika, Konrad Zieli®nski, PaweƂ Cyrta, Marek Grzelec

Stress transfer in speech-to-speech machine translation
Sai Akarsh, Vamshiraghusimha Narasinga, Anil Kumar Vuppala

Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis
Takuma Okamoto, Yamato Ohtani, Hisashi Kawai

Multi-speaker and multi-dialectal Catalan TTS models for video gaming
Alex PeirĂł-Lilja, JosĂ© Giraldo, MartĂ­ Llopart-Font, Carme Armentano-Oller, Baybars KĂŒlebi, Mireia FarrĂșs

ConnecTone: a modular AAC system prototype with contextual generative text prediction and style-adaptive conversational TTS
Juliana Francis, Éva SzĂ©kely, Joakim Gustafson

Reliable dialogue system for facilitating student-counselor communication
Mahdin Rohmatillah, Bryan Gautama Ngo, Willianto Sulaiman, Po-Chuan Chen, Jen-Tzung Chien

CreakVC: a voice conversion tool for modulating creaky voice
Harm Lameris, Joakim Gustafson, Éva SzĂ©kely

EZTalking: English assessment platform for teachers and students
Yu-Sheng Tsao, Yung-Chang Hsu, Jiun-Ting Li, Siang-Hong Weng, Tien-Hong Lo, Berlin Chen

Phonetics and Phonology of Second Language Acquisition


Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation
Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim

Automatic Speech Recognition with parallel L1 and L2 acoustic phone models to evaluate /l/ allophony in L2 English speech production
Anisia Popescu, Lori Lamel, Ioana Vasilescu, Laurence Devillers

Analysis of articulatory setting for L1 and L2 English speakers using MRI data
Kevin Huang, Jack Goldberg, Louis Goldstein, Shrikanth Narayanan

Bilingual Rhotic Production Patterns: A Generational Comparison of Spanish-English Bilingual Speakers in Canada
Ioana Colgiu, Laura Spinu, Rajiv Rao, Yasaman Rafat

Exploring Impact of Pausing and Lexical Stress Patterns on L2 English Comprehensibility in Real Time
Sylvain Coulange, Tsuneo Kato, Solange Rossato, Monica Masperi

Mandarin T3 Production by Chinese and Japanese Native Speakers
Qi Wu

Corpora-based Approaches in Automatic Emotion Recognition


Reinforcement Learning based Data Augmentation for Noise Robust Speech Emotion Recognition
Sumit Ranjan, Rupayan Chakraborty, Sunil Kumar Kopparapu

Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion
Pravin Mote, Berrak Sisman, Carlos Busso

Confidence-aware Hypothesis Transfer Networks for Source-Free Cross-Corpus Speech Emotion Recognition
Jincen Wang, Yan Zhao, Cheng Lu, Hailun Lian, Hongli Chang, Yuan Zong, Wenming Zheng

An Effective Local Prototypical Mapping Network for Speech Emotion Recognition
Yuxuan Xi, Yan Song, Lirong Dai, Haoyu Song, Ian McLoughlin

Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction
Yuan Gao, Hao Shi, Chenhui Chu, Tatsuya Kawahara

Analysis of Speakers States and Traits


How rhythm metrics are linked to produced and perceived speaker charisma
Oliver Niebuhr, Nafiseh Taghva

A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm
Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

Multimodal Belief Prediction
John Murzaku, Adil Soubki, Owen Rambow

Detecting Empathy in Speech
Run Chen, Haozhe Chen, Anushka Kulkarni, Eleanor Lin, Linda Pang, Divya Tadimeti, Jun Shin, Julia Hirschberg

Learning Representation of Therapist Empathy in Counseling Conversation Using Siamese Hierarchical Attention Network
Dehua Tao, Tan Lee, Harold Chui, Sarah Luk

Modelling Lexical Characteristics of the Healthy Aging Population: A Corpus-Based Study
Han Kunmei

Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment
Maurice Gerczuk, Shahin Amiriparian, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Björn W. Schuller

Spoofing and Deepfake Detection


Source Tracing of Audio Deepfake Systems
Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury

How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?
Tianchi Liu, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis
Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi

SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures
Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali

Interpretable Temporal Class Activation Representation for Audio Spoofing Detection
Menglu Li, Xiao-Ping Zhang

DGPN: A Dual Graph Prototypical Network for Few-Shot Speech Spoofing Algorithm Recognition
Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang, Björn W. Schuller

Audio Captioning, Tagging, and Audio-Text Retrieval


PFCA-Net: Pyramid Feature Fusion and Cross Content Attention Network for Automated Audio Captioning
Jianyuan Sun, Wenwu Wang, Mark D. Plumbley

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation
Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou

Streaming Audio Transformers for Online Audio Tagging
Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

Efficient CNNs with Quaternion Transformations and Pruning for Audio Tagging
Aryan Chaudhary, Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley

ParaCLAP – Towards a general language-audio model for computational paralinguistic tasks
Xin Jing, Andreas Triantafyllopoulos, Björn Schuller

Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley

Generative Speech Enhancement


Universal Score-based Speech Enhancement with High Content Preservation
Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu

Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens
Haici Yang, Jiaqi Su, Minje Kim, Zeyu Jin

Schrödinger Bridge for Generative Speech Enhancement
Ante Jukić, Roman Korostik, Jagadeesh Balam, Boris Ginsburg

Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge
Thanapat Trachu, Chawan Piansaddhayanon, Ekapol Chuangsuwanich

Pre-training Feature Guided Diffusion Model for Speech Enhancement
Yiyuan Yang, Niki Trigoni, Andrew Markham

Guided conditioning with predictive network on score-based diffusion model for speech enhancement
Dail Kim, Da-Hee Yang, Donghyun Kim, Joon-Hyuk Chang, Jeonghwan Choi, Moa Lee, Jaemo Yang, Han-gil Moon

Speech Synthesis: Evaluation


SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang

Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies
Srija Anand, Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra

Assessing the impact of contextual framing on subjective TTS quality
Jens Edlund, Christina TĂ„nnander, SĂ©bastien Le Maguer, Petra Wagner

What do people hear? Listeners’ Perception of Conversational Speech
Adaeze Adigwe, Sarenne Wallbridge, Simon King

Uncertainty-Aware Mean Opinion Score Prediction
Hui Wang, Shiwan Zhao, Jiaming Zhou, Xiguang Zheng, Haoqin Sun, Xuechen Wang, Yong Qin

Lifelong Learning MOS Prediction for Synthetic Speech Quality Evaluation
FĂ©lix Saget, Meysam Shamsi, Marie Tahon

Multilingual ASR


Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems
Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition
Andrés Piñeiro-Martín, Carmen García-Mateo, Laura Docio-Fernandez, María del Carmen López-Pérez, Georg Rehm

M2ASR: Multilingual Multi-task Automatic Speech Recognition via Multi-objective Optimization
A F M Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan

Improving Multilingual ASR Robustness to Errors in Language Input
Brady Houston, Omid Sadjadi, Zejiang Hou, Srikanth Vishnubhotla, Kyu J. Han

General Topics in ASR


Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions
Jiwon Suh, Injae Na, Woohwan Jung

A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword Spotting
Yuang Li, Min Zhang, Chang Su, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang

CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
Mario Zusag, Laurin Wagner, Bernhad Thallinger

On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech Recognition
Peter Mihajlik, Yan Meng, Mate S Kadar, Julian Linke, Barbara Schuppler, Katalin MĂĄdy

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation
Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia Bin

DualPure: An Efficient Adversarial Purification Method for Speech Command Recognition
Hao Tan, Xiaochen Liu, Huan Zhang, Junjian Zhang, Yaguan Qian, Zhaoquan Gu

A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives
Jan Lehečka, Josef V. Psutka, Lubos Smidl, Pavel Ircing, Josef Psutka

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models
Anton de la Fuente, Dan Jurafsky

Fine-Tuning Strategies for Dutch Dysarthric Speech Recognition: Evaluating the Impact of Healthy, Disease-Specific, and Speaker-Specific Data
Spyretta Leivaditi, Tatsunari Matsushima, Matt Coler, Shekhar Nayak, Vass Verkhodanova

Dysarthric Speech Recognition Using Curriculum Learning and Articulatory Feature Embedding
I-Ting Hsieh, Chung-Hsien Wu

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation
Shiyao Wang, Shiwan Zhao, Jiaming Zhou, Aobo Kong, Yong Qin

An efficient text augmentation approach for contextualized Mandarin speech recognition
Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan

Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses
Sheng Li, Chen Chen, Chin Yuen Kwok, Chenhui Chu, Eng Siong Chng, Hisashi Kawai

Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping
Lun Wang, Om Thakkar, Zhong Meng, Nicole Rafidi, Rohit Prabhavalkar, Arun Narayanan

Spoken Language Understanding


AR-NLU: A Framework for Enhancing Natural Language Understanding Model Robustness against ASR Errors
Emmy Phung, Harsh Deshpande, Ahmad Emami, Kanishk Singh

Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding
Mohan Li, Simon Keizer, Rama Doddipatla

VN-SLU: A Vietnamese Spoken Language Understanding Dataset
Tuyen Tran, Khanh Le, Ngoc Dang Nguyen, Minh Vu, Huyen Ngo, Woomyoung Park, Thi Thu Trang Nguyen

Textless Dependency Parsing by Labeled Sequence Prediction
Shunsuke Kando, Yusuke Miyao, Jason Naradowsky, Shinnosuke Takamichi

Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRI
Yaoyao Yue, Michael Proctor, Luping Zhou, Rijul Gupta, Tharinda Piyadasa, Amelia Gully, Kirrie Ballard, Craig Jin

Efficient SQA from Long Audio Contexts: A Policy-driven Approach
Alexander Johnson, Peter Plantinga, Pheobe Sun, Swaroop Gadiyaram, Abenezer Girma, Ahmad Emami

Speech and Multimodal Resources


BESST Dataset: A Multimodal Resource for Speech-based Stress Detection and Analysis
Jan PeĆĄĂĄn, Vojtěch Juƙík, Martin KarafiĂĄt, Jan ČernockĂœ

HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing
Arnon Turetzky, Or Tal, Yael Segal, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi

GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech
Wenbin Wang, Yang Song, Sanjay Jha

STraDa: A Singer Traits Dataset
Yuexuan Kong, Viet-Anh Tran, Romain Hennequin

MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features
Katharina Anderer, Andreas Reich, Matthias Wölfel

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset
Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh

Towards measuring fairness in speech recognition: Fair-Speech dataset
Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

SER Evals: In-domain and Out-of-domain benchmarking for speech emotion recognition
Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem

Pathological Speech Analysis 1


The MARRYS helmet: A new device for researching and training “jaw dancing”
Vidar Freyr Gudmundsson, Keve Mårton Gönczi, Malin Svensson Lundmark, Donna Erickson, Oliver Niebuhr

Exploiting Foundation Models and Speech Enhancement for Parkinson’s Disease Detection from Speech in Real-World Operative Conditions
Moreno La Quatra, Maria Francesca Turco, TorbjĂžrn Svendsen, Giampiero Salvi, Juan Rafael Orozco-Arroyave, Sabato Marco Siniscalchi

Sustained Vowels for Pre- vs Post-Treatment COPD Classification
Andreas Triantafyllopoulos, Anton Batliner, Wolfgang Mayr, Markus Fendler, Florian Pokorny, Maurice Gerczuk, Shahin Amiriparian, Thomas Berghaus, Björn Schuller

Adversarial Robustness Analysis in Automatic Pathological Speech Detection Approaches
Mahdi Amiri, Ina Kodrasi

Automatic Children Speech Sound Disorder Detection with Age and Speaker Bias Mitigation
Gahye Kim, Yunjung Eom, Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So

Speech and Language in Health: from Remote Monitoring to Medical Conversations - 1 (Special Session)


Reference-Free Estimation of the Quality of Clinical Notes Generated from Doctor-Patient Conversations
Mojtaba Kadkhodaie Elyaderani, John Glover, Thomas Schaaf

Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder
Jihyun Mun, Sunhee Kim, Minhwa Chung

Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention
Vladimir Despotovic, Abir Elbéji, Petr V. Nazarov, Guy Fagherazzi

Revealing Confounding Biases: A Novel Benchmarking Approach for Aggregate-Level Performance Metrics in Health Assessments
Stefano Goria, Roseline Polle, Salvatore Fara, Nicholas Cummins

Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI.
Yael Bensoussan, Satrajit Ghosh, Anais Rameau, Micah Boyer, Ruth Bahr, Stephanie Watts, Frank Rudzicz, Don Bolser, Jordan Lerner-Ellis, Shaheen Awan, Maria Powell, Jean-Christophe Belisle-Pipon, Vardit Ravitsky, Alistair Johnson, Alexandros Sigaras, Olivier Elemento, David Dorr, Philip Payne

Self-Supervised Embeddings for Detecting Individual Symptoms of Depression
Sri Harsha Dumpala, Katerina Dikaios, Abraham Nunes, Frank Rudzicz, Rudolf Uher, Sageev Oore

Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction
Daryush D. Mehta, Jarrad H. Van Stan, Hamzeh Ghasemzadeh, Robert E. Hillman

Predicting Acute Pain Levels Implicitly from Vocal Features
Jennifer Williams, Eike Schneiders, Henry Card, Tina Seabrooke, Beatrice Pakenham-Walsh, Tayyaba Azim, Lucy Valls-Reed, Ganesh Vigneswaran, John Robert Bautista, Rohan Chandra, Arya Farahi

Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis
Kubilay Can Demir, Belén Lojo Rodríguez, Tobias Weise, Andreas Maier, Seung Hee Yang

A Multimodal Framework for the Assessment of the Schizophrenia Spectrum
Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L.Kelly, Carol Espy-Wilson

Speech and Brain


Exploring the Complementary Nature of Speech and Eye Movements for Profiling Neurological Disorders
Yuzhe Wang, Anna Favaro, Thomas Thebaud, Jesus Villalba, Najim Dehak, Laureano Moro-Velazquez

Refining Self-supervised Learnt Speech Representation using Brain Activations
HengYu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling

Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder
Yuejiao Wang, Xianmin Gong, Lingwei Meng, Xixin Wu, Helen Meng

From Sound to Meaning in the Auditory Cortex: A Neuronal Representation and Classification Analysis
Kumar Neelabh, Vishnu Sreekumar

Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models
Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang

Toward Fully-End-to-End Listened Speech Decoding from EEG Signals
Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, Shrikanth Narayanan

Innovative Methods in Phonetics and Phonology


The Use of Phone Categories and Cross-Language Modeling for Phone Alignment of PanĂŁra
Emily P. Ahn, Eleanor Chodroff, Myriam Lapierre, Gina-Anne Levow

Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN
Sneha Ray Barman, Shakuntala Mahanta, Neeraj Kumar Sharma

Phonological Feature Detection for US English using the Phonet Library
Harsha Veena Tadavarthy, Austin Jones, Margaret E. L. Renwick

K-means and hierarchical clustering of f0 contours
Constantijn Kaland, Jeremy Steffman, Jennifer Cole

Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment
Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff

Using wav2vec 2.0 for phonetic classification tasks: methodological aspects
Lila Kim, CĂ©dric Gendrot

The sub-band cepstrum as a tool for locating local spectral regions of phonetic sensitivity: A first attempt with multi-speaker vowel data
Michael Lambropoulos, Frantz Clermont, Shunichi Ishihara

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator
Woo-Jin Chung, Hong-Goo Kang

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea PĂ©rez-Toro, Maria Schuster, Elmar Noeth, Bjoern Heismann, Andreas Maier, Seung Hee Yang

Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speech
Anna Oura, Hideaki Kikuchi, Tetsunori Kobayashi

Voice, Tones and F0


Impact of the tonal factor on diphthong realizations in Standard Mandarin with Generalized Additive Mixed Models
Chenyu Li, Jalal Al-Tamimi

A Study on the Information Mechanism of the 3rd Tone Sandhi Rule in Mandarin Disyllabic Words
Liu Xiaowang, Jinsong Zhang

Gender and age based f0-variation in the German Plapper Corpus
Melanie Weirich, Daniel Duran, Stefanie Jannedy

Voice quality in telephone speech: Comparing acoustic measures between VoIP telephone and high-quality recordings
Chenzi Xu, Jessica Wormald, Paul Foulkes, Philip Harrison, Vincent Hughes, Poppy Welch, Finnian Kelly, David van der Vloed

The Use of Modifiers and f0 in Remote Referential Communication with Human and Computer Partners
Iona Gessinger, Bistra Andreeva, Benjamin R. Cowan

Emotion Recognition: Resources and Benchmarks


EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain

INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition
Andreas Triantafyllopoulos, Anton Batliner, Simon Rampp, Manuel Milling, Björn Schuller

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark
Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed

WHiSER: White House Tapes Speech Emotion Recognition Corpus
Abinay Reddy Naini, Lucas Goncalves, Mary A. Kohler, Donita Robinson, Elizabeth Richerson, Carlos Busso

Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition
Siddique Latif, Raja Jurdak, Björn W. Schuller

Boosting Cross-Corpus Speech Emotion Recognition using CycleGAN with Contrastive Learning
Jincen Wang, Yan Zhao, Cheng Lu, Chuangao Tang, Sunan Li, Yuan Zong, Wenming Zheng

Speaker and Language Identification and Diarization


Multi-latency look-ahead for streaming speaker segmentation
Bilal Rahou, Hervé Bredin

Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment
Christoph Boeddeker, Tobias Cord-Landwehr, Reinhold Haeb-Umbach

ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings
Théo Mariotte, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas

Hybrid-Diarization System with Overlap Post-Processing for the DISPLACE 2024 Challenge
Gabriel PĂźrlogeanu, Octavian Pascu, Alexandru-Lucian Georgescu, Horia Cucu

The Second DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments
Shareef Babu Kalluri, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K T, S.R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024
Joonas Kalda, Tanel Alumae, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, Ricard Marxer

Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification
Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng

Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech
Martina Valente, Fabio Brugnara, Giovanni Morrone, Enrico Zovato, Leonardo Badino

AG-LSEC: Audio Grounded Lexical Speaker Error Correction
Rohit Paturi, Xiang Li, Sundararajan Srinivasan

Speaker Change Detection with Weighted-sum Knowledge Distillation based on Self-supervised Pre-trained Models
Hang Su, Yuxiang Kong, Lichun Fan, Peng Gao, Yujun Wang, Zhiyong Wu

SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization
Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Atsushi Ando, Ryo Masumura

Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework
Hokuto Munakata, Ryo Terashima, Yusuke Fujita

Audio-Text Retrieval


DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval
Yifei Xin, Xuxin Cheng, Zhihong Zhu, Xusheng Yang, Yuexian Zou

Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

Domain Adaptation for Contrastive Audio-Language Models
Soham Deshmukh, Rita Singh, Bhiksha Raj

tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models
Francesco Paissan, Elisabetta Farella

BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification
June-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung

Enhanced Feature Learning with Normalized Knowledge Distillation for Audio Tagging
Yuwu Tang, Ziang Ma, Haitao Zhang

Speech Enhancement


RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention
Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

DNN-based monaural speech enhancement using alternate analysis windows for phase and magnitude modification
Xi Liu, John H.L. Hansen

Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio
Li Li, Shogo Seki

Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression
Yixuan Zhang, Hao Zhang, Meng Yu, Dong Yu

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
Jizhen Li, Xinmeng Xu, Weiping Tu, Yuhong Yang, Rong Zhu

An Exploration of Length Generalization in Transformer-Based Speech Enhancement
Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah, Haizhou Li

Reducing Speech Distortion and Artifacts for Speech Enhancement by Loss Function
Haixin Guan, Wei Dai, Guangyong Wang, Xiaobin Tan, Peng Li, Jiaen Liang

Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments?
Candy Olivia Mawalim, Shogo Okada, Masashi Unoki

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

Speech Coding


TD-PLC: A Semantic-Aware Speech Encoding for Improved Packet Loss Concealment
Jinghong Zhang, Zugang Zhao, Yonghui Liu, Jianbing Liu, Zhiqiang He, Kai Niu

BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge Distillation
Zihan Zhang, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

On Improving Error Resilience of Neural End-to-End Speech Coders
Kishan Gupta, Nicola Pia, Srikanth Korse, Andreas Brendel, Guillaume Fuchs, Markus Multrus

Speech quality evaluation of neural audio codecs
Thomas Muller, Stephane Ragot, Laetitia Gros, Pierrick Philippe, Pascal Scalart

A Low-Bitrate Neural Audio Codec Framework with Bandwidth Reduction and Recovery for High-Sampling-Rate Waveforms
Yang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, Zhen-Hua Ling

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems
Haibin Wu, Yuan Tseng, Hung-yi Lee

Speech Synthesis: Expressivity and Emotion


GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
Donghyun Seong, Hoyoung Lee, Joon-Hyuk Chang

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models
Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng

Text-aware and Context-aware Expressive Audiobook Speech Synthesis
Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie

Controlling Emotion in Text-to-Speech with Natural Language Prompts
Thomas Bott, Florian Lux, Ngoc Thang Vu

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li

Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation
Pavan Kalyan, Preeti Rao, Preethi Jyothi, Pushpak Bhattacharyya

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder
Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis
Chin-Yun Yu, György Fazekas

Speech Synthesis: Tools and Data


SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark
Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings
Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M Khapra

FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen, Zhefeng Wang, Baoxing Huai

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana

1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis
Sewade Ogun, Abraham T. Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin Adewumi

SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis
Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari

Speech Synthesis: Singing Voice Synthesis


MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance
Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung Jin Choi, Nam Soo Kim

Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus With FIRNet Source-Filter Neural Vocoder
Takuma Okamoto, Yamato Ohtani, Sota Shimizu, Tomoki Toda, Hisashi Kawai

Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis
Taewoo Kim, Choonsang Cho, Young Han Lee

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing
Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe

X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning
Ji-Sang Hwang, Hyeongrae Noh, Yoonseok Hong, Insoo Oh

An End-to-End Approach for Chord-Conditioned Song Generation
Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu

LLM in ASR


Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng

Speech ReaLLM – Real-time Speech Recognition with Multimodal Language Models by Teaching the Flow of Time
Frank Seide, Yangyang Shi, Morrie Doulaty, Yashesh Gaur, Junteng Jia, Chunyang Wu

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition
Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie

Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models
Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang

Vision and Speech


AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
Jongsuk Kim, Jiwon Shin, Junmo Kim

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition
Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge
Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang

Spoken Document Summarization


Optimizing the role of human evaluation in LLM-based spoken document summarization systems
Margaret Kroll, Kelsey Kraus

Key-Element-Informed sLLM Tuning for Document Summarization
Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

An End-to-End Speech Summarization Using Large Language Model
Hengchao Shang, Zongyao Li, Jiaxin Guo, Shaojun Li, Zhiqiang Rao, Yuanchang Luo, Daimeng Wei, Hao Yang

Prompting Large Language Models with Audio for General-Purpose Speech Summarization
Wonjune Kang, Deb Roy

Real-time Speech Summarization for Medical Conversations
Khai Le-Duc, Khai-Nguyen Nguyen, Long Vo-Dang, Truong-Son Hy

Speech and Language in Health: from Remote Monitoring to Medical Conversations - 2 (Special Sessions)


It’s Time to Take Action: Acoustic Modeling of Motor Verbs to Detect Parkinson’s Disease
Daniel Escobar-Grisales, Cristian David RĂ­os-Urrego, Ilja Baumann, Korbinian Riedhammer, Elmar Noeth, Tobias Bocklet, Adolfo M. Garcia, Juan Rafael Orozco-Arroyave

Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models
Malo Maisonneuve, Corinne Fredouille, Muriel Lalain, Alain Ghio, Virginie Woisard

Macro-descriptors for Alzheimer’s disease detection using large language models
Catarina Botelho, John Mendonça, Anna Pompili, Tanja Schultz, Alberto Abad, Isabel Trancoso

Infusing Acoustic Pause Context into Text-Based Dementia Assessment
Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

Towards Scalable Remote Assessment of Mild Cognitive Impairment Via Multimodal Dialog
Oliver Roesler, Jackson Liscombe, Michael Neumann, Hardik Kothare, Abhishek Hosamath, Lakshmi Arbatti, Doug Habberstad, Christiane Suendermann-Oeft, Meredith Bartlett, Cathy Zhang, Nikhil Sukhdev, Kolja Wilms, Anusha Badathala, Sandrine Istas, Steve Ruhmel, Bryan Hansen, Madeline Hannan, David Henley, Arthur Wallace, Ira Shoulson, David Suendermann-Oeft, Vikram Ramanarayanan

Automatic recognition and detection of aphasic natural speech
Mara Barberis, Pieter De Clercq, Bastiaan Tamm, Hugo Van hamme, Maaike Vandermosten

When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition
Giulia Sanguedolce, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor, Fatemeh Geranmayeh

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass

How Consistent are Speech-Based Biomarkers in Remote Tracking of ALS Disease Progression Across Languages? A Case Study of English and Dutch
Hardik Kothare, Michael Neumann, Cathy Zhang, Jackson Liscombe, Jordi W J van Unnik, Lianne C M Botman, Leonard H van den Berg, Ruben P A van Eijk, Vikram Ramanarayanan

“So 
 my child 
 ” – How Child ADHD Influences the Way Parents Talk
Anika A. Spiesberger, Andreas Triantafyllopoulos, Alexander Kathan, Anastasia Semertzidou, Caterina Gawrilow, Tilman Reinelt, Wolfgang A. Rauch, Björn Schuller

Variability of speech timing features across repeated recordings: a comparison of open-source extraction techniques
Judith Dineley, Ewan Carr, Lauren L. White, Catriona Lucas, Zahia Rahman, Tian Pan, Faith Matcham, Johnny Downs, Richard J. Dobson, Thomas F. Quatieri, Nicholas Cummins

Zero-Shot End-To-End Spoken Question Answering In Medical Domain
Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian

Show and Tell 2


Custom wake word detection
Kesavaraj V, Charan Devarkonda, Vamshiraghusimha Narasinga, Anil Kumar Vuppala

Edged based audio-visual speech enhancement demonstrator
Song Chen, Mandar Gogate, Kia Dashtipour, Jasper Kirton-Wingate, Adeel Hussain, Faiyaz Doctor, Tughrul Arslan, Amir Hussain

Real-Time Gaze-directed speech enhancement for audio-visual hearing-aids
Arif Reza Anway, Bryony Buck, Mandar Gogate, Kia Dashtipour, Michael Akeroyd, Amir Hussain

Detection of background agents speech in contact centers
Abhishek Kumar, Srikanth Konjeti, Jithendra Vepa

Leveraging large language models for post-transcription correction in contact centers
Bramhendra Koilakuntla, Prajesh Rana, Paras Ahuja, Srikanth Konjeti, Jithendra Vepa

Understanding “understanding”: presenting a richly annotated multimodal corpus of dyadic interaction
Leonie Schade, Nico Dallmann, Olcay TĂŒk, Stefan Lazarov, Petra Wagner

A demonstrator for articulation-based command word recognition
Joao Vitor Possamai de Menezes, Arne-Lukas Fietkau, Tom Diener, Steffen Kurbis, Peter Birkholz

Pragmatically similar utterance finder demonstration
Nigel G. Ward, Andres Segura

Real-time scheme for rapid extraction of speaker embeddings in challenging recording conditions
Kai Liu, Ziqing Du, Zhou Huan, Xucheng Wan, Naijun Zheng

TEEMI: a speaking practice tool for L2 English learners
Szu-Yu Chen, Tien-Hong Lo, Yao-Ting Sung, Ching-Yu Tseng, Berlin Chen

Prosody


Automatic pitch accent classification through image classification
Na Hu, Hugo Schnack, Amalia Arvaniti

Form and Function in Prosodic Representation: In the Case of ‘ma’ in Tianjin Mandarin
Tianqi Geng, Hui Feng

On Comparing Time- and Frequency-Domain Rhythm Measures in Classifying Assamese Dialects
Joyshree Chakraborty, Leena Dihingia, Priyankoo Sarmah, Rohit Sinha

The prosody of the verbal prefix ge-: historical and experimental evidence
Chiara Riegger, Tina Bögel, George Walkden

Influences of Morphosyntax and Semantics on the Intonation of Mandarin Chinese Wh-indeterminates
Hongchen Wu, Jiwon Yun

Urdu Alternative Questions: A Hat Pattern
Benazir Mumtaz, Miriam Butt

Foundational Models for Deepfake and Spoofed Speech Detection


Spoofed Speech Detection with a Focus on Speaker Embedding
Hoan My Tran, David Guennec, Philippe Martin, Aghilas Sini, Damien Lolive, Arnaud Delhay, Pierre-François Marteau

Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection
Juan M. Martín-Doñas, Aitor Álvarez, Eros Rosello, Angel M. Gomez, Antonio M. Peinado

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection
Zihan Pan, Tianchi Liu, Hardik B. Sailor, Qiongqiong Wang

Adapter Learning from Pre-trained Model for Robust Spoof Speech Detection
Haochen Wu, Wu Guo, Shengyu Peng, Zhuhai Li, Jie Zhang

Speech Formants Integration for Generalized Detection of Synthetic Speech Spoofing Attacks
Kexu Liu, Yuanxin Wang, Shengchen Li, Xi Shao

Balance, Multiple Augmentation, and Re-synthesis: A Triad Training Strategy for Enhanced Audio Deepfake Detection
Thien-Phuc Doan, Long Nguyen-Vu, Kihun Hong, Souhwan Jung

Speaker Recognition 1


Fine-tune Pre-Trained Models with Multi-Level Feature Fusion for Speaker Verification
Shengyu Peng, Wu Guo, Haochen Wu, Zuoliang Li, Jie Zhang

Speaker Conditional Sinc-Extractor for Personal VAD
En-Lun Yu, Kuan-Hsun Ho, Jeih-weih Hung, Shih-Chieh Huang, Berlin Chen

Enhancing ECAPA-TDNN with Feature Processing Module and Attention Mechanism for Speaker Verification
Shiu-Hsiang Liou, Po-Cheng Chan, Chia-Ping Chen, Tzu-Chieh Lin, Chung-Li Lu, Yu-Han Cheng, Hsiang-Feng Chuang, Wei-Yu Chen

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms
Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu

Disentangled Representation Learning for Environment-agnostic Speaker Recognition
KiHyun Nam, Hee-Soo Heo, Jee-weon Jung, Joonson Chung

Multi-Channel Extension of Pre-trained Models for Speaker Verification
Ladislav MoĆĄner, Romain Serizel, LukĂĄĆĄ Burget, Oldƙich Plchot, Emmanuel Vincent, Junyi Peng, Jan ČernockĂœ

Efficient Integrated Features Based on Pre-trained Models for Speaker Verification
Yishuang Li, Wenhao Guan, Hukai Huang, Shiyu Miao, Qi Su, Lin Li, Qingyang Hong

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition
Tianhao Wang, Lantian Li, Dong Wang

DB-PMAE: Dual-Branch Prototypical Masked AutoEncoder with locality for domain robust speaker verification
Wei-lin Xie, Yu-Xuan Xi, Yan Song, Jian-tao Zhang, Hao-yu Song, Ian McLoughlin

Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language
Matthew Maciejewski, Dominik Klement, Ruizhe Huang, Matthew Wiesner, Sanjeev Khudanpur

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition
Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang

Source Separation 1


Noise-robust Speech Separation with Fast Generative Correction
Helin Wang, JesĂșs Villalba, Laureano Moro-Velazquez, Jiarui Hai, Thomas Thebaud, Najim Dehak

MSDET: Multitask Speaker Separation and Direction-of-Arrival Estimation Training
Roland Hartanto, Sakriani Sakti, Koichi Shinoda

Unsupervised Improved MVDR Beamforming for Sound Enhancement
Jacob Kealey, John R. Hershey, François Grondin

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation
Ke Chen, Jiaqi Su, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Zeyu Jin

Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments
Jihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang

Towards Audio Codec-based Speech Separation
Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma

Audio-Visual and Generative Speech Enhancement


Locally Aligned Rectified Flow Model for Speech Enhancement Towards Single-Step Diffusion
Zhengxiao Li, Nakamasa Inoue

Diffusion Gaussian Mixture Audio Denoise
Pu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang

An Analysis of the Variance of Diffusion-based Speech Enhancement
Bunlong Lay, Timo Gerkmann

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic

Complex Image-Generative Diffusion Transformer for Audio Denoising
Junhui Li, Pu Wang, Jialu Li, Youshan Zhang

Noise-aware Speech Enhancement using Diffusion Probabilistic Model
Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

Speech Privacy and Bandwidth Expansion


Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization
Mohammad Hassan Vali, Tom BÀckström

SilentCipher: Deep Audio Watermarking
Mayank Kumar Singh, Naoya Takahashi, Weihsiang Liao, Yuki Mitsufuji

Frequency-mix Knowledge Distillation for Fake Speech Detection
Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

A New Approach to Voice Authenticity
Nicolas M. MĂŒller, Piotr Kawa, Shen Hu, Matthias Neu, Jennifer Williams, Philip Sperl, Konstantin Böttinger

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking
Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

HarmoNet: Partial DeepFake Detection Network based on Multi-scale HarmoF0 Feature Fusion
Liwei Liu, Huihui Wei, Dongya Liu, Zhonghua Fu

Unmasking Neural Codecs: Forensic Identification of AI-compressed Speech
Denise Moussa, Sandra Bergmann, Christian Riess

SWiBE: A Parameterized Stochastic Diffusion Process for Noise-Robust Bandwidth Expansion
Yin-Tse Lin, Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee

MultiStage Speech Bandwidth Extension with Flexible Sampling Rate Control
Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling

MaskSR: Masked Language Model for Full-band Speech Restoration
Xu Li, Qirui Wang, Xiaoyu Liu

Speech Synthesis: Prosody


Word-level Text Markup for Prosody Control in Speech Synthesis
Yuliya Korotkova, Ilya Kalinovskiy, Tatiana Vakhrusheva

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Eva Szekely, Gustav Eje Henter

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao, Naoyuki Kanda

A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer
Himanshu Maurya, Atli Sigurgeirsson

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of Speech-Silence and Word-Punctuation
Jinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu

Accented Speech, Prosodic Features, Dialect, Emotion, Sound Classification


Improving Self-supervised Pre-training using Accent-Specific Codebooks
Darshan Prabhu, Abhishek Gupta, Omkar Nitsure, Preethi Jyothi, Sriram Ganapathy

Performant ASR Models for Medical Entities in Accented Speech
Tejumade Afonja, Tobi Olatunji, Sewade Ogun, Naome A. Etori, Abraham Owodunni, Moshood Yekini

LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems
Tahir Javed, Janki Nawale, Sakshi Joshi, Eldho George, Kaushal Bhogale, Deovrat Mehendale, Mitesh M. Khapra

LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech
Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition
Jiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu, Qingyang Hong, Lin Li

Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition
Ying Hu, Huamin Yang, Hao Huang, Liang He

Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning
Arnav Goel, Medha Hira, Anubha Gupta

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios
Hazim Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh

The Processing of Stress in End-to-End Automatic Speech Recognition Models
Martijn Bentum, Louis ten Bosch, Tom Lentz

LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection
Tuan Nguyen, Huy Dat Tran

Learning from memory-based models
Rhiannon Mogridge, Anton Ragni

Towards End-to-End Unified Recognition for Mandarin and Cantonese
Meiling Chen, Pengjie Liu, Heng Yang, Haofeng Wang

Neural Network Adaptation


Shared-Adapters: A Novel Transformer-based Parameter Efficient Transfer Learning Approach For Children’s Automatic Speech Recognition
Thomas Rolland, Alberto Abad

AdaRA: Adaptive Rank Allocation of Residual Adapters for Speech Foundation Model
Zhouyuan Huo, Dongseong Hwang, Gan Song, Khe Chai Sim, Weiran Wang

Leveraging Adapter for Parameter-Efficient ASR Encoder
Kyuhong Shim, Jinkyu Lee, Hyunjae Kim

Whisper Multilingual Downstream Task Tuning Using Task Vectors
Ji-Hun Kang, Jae-Hong Lee, Mun-Hak Lee, Joon-Hyuk Chang

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR
Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, ZongYao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition
Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei

ASR and LLMs


HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition
Ji Won Yoon, Beom Jun Woo, Nam Soo Kim

MaLa-ASR: Multimedia-Assisted LLM-Based ASR
Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen

Spoken-to-written text conversion with Large Language Model
HyunJung Choi, Muyeol Choi, Yohan Lim, Minkyu Lee, Seonhui Kim, Seung Yun, Donghyun Kim, SangHun Kim

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting
Zhiqi Ai, Zhiyong Chen, Shugong Xu

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

Speech Recognition Models are Strong Lip-readers
K R Prajwal, Triantafyllos Afouras, Andrew Zisserman

Pathological Speech Analysis 3


Towards Self-Attention Understanding for Automatic Articulatory Processes Analysis in Cleft Lip and Palate Speech
Ilja Baumann, Dominik Wagner, Maria Schuster, Korbinian Riedhammer, Elmar Noeth, Tobias Bocklet

Clever Hans Effect Found in Automatic Detection of Alzheimer’s Disease through Speech
Yin-Long Liu, Rui Feng, Jia-Hong Yuan, Zhen-Hua Ling

Leveraging Phonemic Transcription and Whisper toward Clinically Significant Indices for Automatic Child Speech Assessment
Yeh-Sheng Lin, Shu-Chuan Tseng, Jyh-Shing Roger Jang

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features
Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo

A Cluster-based Personalized Federated Learning Strategy for End-to-End ASR of Dementia Patients
Wei-Tung Hsu, Chin-Po Chen, Yun-Shao Lin, Chi-Chun Lee

A Comparative Analysis of Federated Learning for Speech-Based Cognitive Decline Detection
Stefan Kalabakov, Monica Gonzalez-Machorro, Florian Eyben, Björn W. Schuller, Bert Arnrich

Multimodal Digital Biomarkers for Longitudinal Tracking of Speech Impairment Severity in ALS: An Investigation of Clinically Important Differences
Michael Neumann, Hardik Kothare, Jackson Liscombe, Emma C.L. Leschly, Oliver Roesler, Vikram Ramanarayanan

Speech Disorders 3


Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design
Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models
Neil Shah, Shirish Karande, Vineet Gandhi

PARAN: Variational Autoencoder-based End-to-End Articulation-to-Speech System for Speech Intelligibility
Seyun Um, Doyeon Kim, Hong-Goo Kang

Acoustic changes in speech prosody produced by children with autism after robot-assisted speech training
Si Chen, Bruce Xiao Wang, Yitian Hong, Fang Zhou, Angel Chan, Po-yi Tang, Bin Li, Chunyi Wen, James Cheung, Yan Liu, Zhuoming Chen

Fine-Tuning Automatic Speech Recognition for People with Parkinson’s: An Effective Strategy for Enhancing Speech Technology Accessibility
Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech
Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert MacDonald, Katie Seaver, Richard Cave, Marilyn Ladewig, Rus Heywood, Jordan Green

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis
Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze

Wav2vec 2.0 Embeddings Are No Swiss Army Knife — A Case Study for Multiple Sclerosis
GĂĄbor Gosztolya, Mercedes VetrĂĄb, Veronika Svindt, Judit BĂłna, IldikĂł Hoffmann

Speech Recognition with Large Pretrained Speech Models for Under-represented Languages (Special Session)


Interface Design for Self-Supervised Speech Models
Yi-Jen Shih, David Harwath

Comparing Discrete and Continuous Space LLMs for Speech Recognition
Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu

Improving Whisper’s Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text
Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang

Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling
Kaushal Santosh Bhogale, Deovrat Mehendale, Niharika Parasa, Sathish Kumar Reddy G, Tahir Javed, Pratyush Kumar, Mitesh M. Khapra

Interleaved Audio/Audiovisual Transfer Learning for AV-ASR in Low-Resourced Languages
Zhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt

Adapter pre-training for improved speech recognition in unseen domains using low resource adapter tuning of self-supervised models
Sathvik Udupa, Jesuraj Bandekar, Saurabh Kumar, Deekshitha G, Sandhya B, Abhayjeet S, Savitha Murthy, Priyanka Pai, Srinivasa Raghavan, Raoul Nanavati, Prasanta Kumar Ghosh

Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper
Tianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie

Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on Northern SĂĄmi
Yaroslav Getman, Tamas Grosz, Katri Hiovain-Asikainen, Mikko Kurimo

Learn and Don’t Forget: Adding a New Language to ASR Foundation Models
Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, Mark J.F. Gales

Speech Processing Using Discrete Speech Units (Special Session)


TokSing: Singing Voice Synthesis based on Discrete Tokens
Yuning Wu, Chunlei Zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models
Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations
Kunal Dhawan, Nithin Rao Koluguri, Ante Jukić, Ryan Langman, Jagadeesh Balam, Boris Ginsburg

Databases and Progress in Methodology


VoxSim: A perceptual voice similarity dataset
Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim, Seongkyu Mun, Joon Son Chung

UY/CH-CHILD — A Public Chinese L2 Speech Database of Uyghur Children
Mewlude Nijat, Chen Chen, Dong Wang, Askar Hamdulla

State-of-the-art speech production MRI protocol for new 0.55 Tesla scanners
Prakash Kumar, Ye Tian, Yongwan Lim, Sophia X. Cui, Christina Hagedorn, Dani Byrd, Uttam K. Sinha, Shrikanth Narayanan, Krishna S. Nayak

DBD-CI: Doubling the Band Density for Bilateral Cochlear Implants
Mingyue Shi, Huali Zhou, Qinglin Meng, Nengheng Zheng

Leveraging Large Language Models to Refine Automatic Feedback Generation at Articulatory Level in Computer Aided Pronunciation Training
Huihang Zhong, Yanlu Xie, ZiJin Yao

Decoding Human Language Acquisition: EEG Evidence for Predictive Probabilistic Statistics in Word Segmentation
Bin Zhao, Mingxuan Huang, Chenlu Ma, Jinyi Xue, Aijun Li, Kunyu Xu

Articulation, Convergence and Perception


Behavioral evidence for higher speech rate convergence following natural than artificial time altered speech
Jérémy Giroud, Jessica Lei, Kirsty Phillips, Matthew H. Davis

A novel experimental design for the study of listener-to-listener convergence in phoneme categorization
Qingye Shen, Leonardo Lancia, Noel Nguyen

Cross-Attention-Guided WaveNet for EEG-to-MEL Spectrogram Reconstruction
Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, Guanglai Gao

What if HAL breathed? Enhancing Empathy in Human-AI Interactions with Breathing Speech Synthesis
NicolĂČ Loddo, Francisca Pessanha, Almila Akdag

Magnitude and timing of acceleration peaks in stressed and unstressed syllables
Malin Svensson Lundmark

Speech Emotion Recognition


ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets
Shahin Amiriparian, Filip PackaƄ, Maurice Gerczuk, Björn W. Schuller

Dataset-Distillation Generative Model for Speech Emotion Recognition
Fabian Ritter-Gutierrez, Kuan-Po Huang, Jeremy H. M. Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng-Siong Chng

DropFormer: A Dynamic Noise-Dropping Transformer for Speech Emotion Recognition
Jialong Mai, Xiaofen Xing, Weidong Chen, Xiangmin Xu

From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs
Minxue Niu, Mimansa Jaiswal, Emily Mower Provost

Self-Supervised Models in Speaker Recognition


Self-supervised speaker verification with relational mask prediction
Ju-ho Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Minjae Lee, Ha-Jin Yu

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models
Victor Miara, Theo Lepage, Reda Dehak

Improving Noise Robustness in Self-supervised Pre-trained Model for Speaker Verification
Chan-yeong Lim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Kyo-Won Koo, Seung-bin Kim, Ha-Jin Yu

On the impact of several regularization techniques on label noise robustness of self-supervised speaker verification systems
Abderrahim Fathan, Xiaolin Zhu, Jahangir Alam

Parameter-efficient Fine-tuning of Speaker-Aware Dynamic Prompts for Speaker Verification
Zhe Li, Man-wai Mak, Hung-yi Lee, Helen Meng

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models
Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

Speech Quality Assessment


Embedding Learning for Preference-based Speech Quality Assessment
ChengHung Hu, Yusuke Yasuda, Tomoki Toda

IndicMOS: Multilingual MOS Prediction for 7 Indian languages
Sathvik Udupa, Soumi Maiti, Prasanta Kumar Ghosh

Experimental evaluation of MOS, AB and BWS listening test designs
Dan Wells, Andrea Lorena Aldana Blanco, Cassia Valentini, Erica Cooper, Aidan Pine, Junichi Yamagishi, Korin Richmond

Enhancing No-Reference Speech Quality Assessment with Pairwise, Triplet Ranking Losses, and ASR Pretraining
Bao Thang Ta, Minh Tu Le, Van Hai Do, Huynh Thi Thanh Binh

Privacy and Security in Speech Communication 1


Harder or Different? Understanding Generalization of Audio Deepfake Detection
Nicolas M. MĂŒller, Nicholas Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger

Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset
Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki, Taiki Miyagawa

Robust spread spectrum speech watermarking using linear prediction and deep spectral shaping
David Looney, Nikolay D. Gaubitch

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection
Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Zhao Lv, Cunhang Fan

How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines
Ailin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung

RW-VoiceShield: Raw Waveform-based Adversarial Attack on One-shot Voice Conversion
Ching-Yu Yang, Shreya G. Upadhyay, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee

Speech Synthesis: Voice Conversion 2


Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech
Aleksei Gusev, Anastasia Avdeeva

Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion
Ji Sub Um, Hoirin Kim

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy
Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment
Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free Training
Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima

Residual Speaker Representation for One-Shot Voice Conversion
Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao

Disentangling prosody and timbre embeddings via voice conversion
Nicolas Gengembre, Olivier Le Blouch, CĂ©dric Gendrot

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance
Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, Lirong Dai

Speech Synthesis: Text Processing


A Language Modeling Approach to Diacritic-Free Hebrew TTS
Amit Roth, Arnon Turetzky, Yossi Adi

Exploring the Benefits of Tokenization of Discrete Acoustic Units
Avihu Dekel, Raul Fernandez

Homograph Disambiguation with Text-to-Text Transfer Transformer
MarkĂ©ta ƘezáčkovĂĄ, Daniel Tihelka, Jindƙich MatouĆĄek

Enhancing Japanese Text-to-Speech Accuracy with a Novel Combination Transformer-BERT-based G2P: Integrating Pronunciation Dictionaries and Accent Sandhi
Kiyoshi Kurihara, Masanori Sano

Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data
Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana

G2PA: G2P with Aligned Audio for Mandarin Chinese
Xingxing Yang

Learning Pronunciation from Other Accents via Pronunciation Knowledge Transfer
Siqi Sun, Korin Richmond

Positional Description for Numerical Normalization
Deepanshu Gupta, Javier Latorre

Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
Christina TĂ„nnander, Shivam Mehta, Jonas Beskow, Jens Edlund

Training Methods, Self-Supervised Learning, Adaptation


MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu, Maja Pantic

Speech and Language Recognition with Low-rank Adaptation of Pretrained Models
Amrutha Prasad, Srikanth Madikeri, Driss Khalil, Petr Motlicek, Christof Schuepbach

Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition
Kwangyoun Kim, Suwon Shon, Yi-Te Hsu, Prashant Sridhar, Karen Livescu, Shinji Watanabe

LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks
Amit Meghanani, Thomas Hain

Self-Train Before You Transcribe
Robert Flynn, Anton Ragni

Unsupervised Online Continual Learning for Automatic Speech Recognition
Steven Vander Eeckt, Hugo Van hamme

Dual-path Adaptation of Pretrained Feature Extraction Module for Robust Automatic Speech Recognition
Hao Shi, Tatsuya Kawahara

Hierarchical Multi-Task Learning with CTC and Recursive Operation
Nahomi Kusunoki, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

Boosting CTC-based ASR using inter-layer attention-based CTC loss
Keigo Hojo, Yukoh Wakabayashi, Kengo Ohta, Atsunori Ogawa, Norihide Kitaoka

Self-training ASR Guided by Unsupervised ASR Teacher
Hyung Yong Kim, Byeong-Yeol Kim, Yunkyu Lim, Jihwan Park, Shukjae Choi, Yooncheol Ju, Jinseok Park, Youshin Lim, Seung Woo Yu, Hanbin Lee, Shinji Watanabe

Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition
Yue Gu, Zhihao Du, Shiliang Zhang, jiqing Han, Yongjun He

Speaker Personalization for Automatic Speech Recognition using Weight-Decomposed Low-Rank Adaptation
George Joseph, Arun Baby

Online Subloop Search via Uncertainty Quantization for Efficient Test-Time Adaptation
Jae-Hong Lee, Sang-Eon Lee, Dong-Hyun Kim, DoHee Kim, Joon-Hyuk Chang

ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2vec2.0 Based ASR
Vishwanath Pratap Singh, Federico Malato, Ville HautamÀki, Md. Sahidullah, Tomi Kinnunen

Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech Recognition
Jeehye Lee, Hyeji Seo

Novel Architectures for ASR


Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer
Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

Quantifying Unintended Memorization in BEST-RQ ASR Encoders
Virat Shejwalkar, Om Thakkar, Arun Narayanan

SWAN: SubWord Alignment Network for HMM-free word timing estimation in end-to-end automatic speech recognition
Woo Hyun Kang, Srikanth Vishnubhotla, Rudolf Braun, Yogesh Virkar, Raghuveer Peri, Kyu J. Han

Multimodality and Foundation Models


Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models
Ziyun Cui, Chang Lei, Wen Wu, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang

Spoken Word2Vec: Learning Skipgram Embeddings from Speech
Mohammad Amaan Sayeed, Hanan Aldarmaki

SAMSEMO: New dataset for multilingual and multimodal emotion recognition
Pawel Bujnowski, Bartlomiej Kuzma, Bartlomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, Piotr Andruszkiewicz

LLM-Driven Multimodal Opinion Expression Identification
Bonian Jia, Huiyao Chen, Yueheng Sun, Meishan Zhang, Min Zhang

Zero-Shot Fake Video Detection by Audio-Visual Consistency
Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert
Han EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Ju, Tae-Hyun Oh

Spoken Dialogue Systems and Conversational Analysis 1


Autoregressive cross-interlocutor attention scores meaningfully capture conversational dynamics
Matthew McNeill, Rivka Levitan

ConvoCache: Smart Re-Use of Chatbot Responses
Conor Atkins, Ian Wood, Mohamed Ali Kaafar, Hassan Asghar, Nardine Basta, Michal Kepkowski

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue
Livia Qian, Gabriel Skantze

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah

Contextual Interactive Evaluation of TTS Models in Dialogue Systems
Siyang Wang, Éva SzĂ©kely, Joakim Gustafson

GSQA: An End-to-End Model for Generative Spoken Question Answering
Min-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee

Speech Technology


Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps
Mattias Nilsson, Riccardo Miccini, Clement Laroche, Tobias Piechowiak, Friedemann Zenke

Towards interfacing large language models with ASR systems using confidence measures and prompting
Maryam Naderi, Enno Hermann, Alexandre Nanchen, Sevada Hovsepyan, Mathew Magimai.-Doss

Text Injection for Neural Contextual Biasing
Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

Prompting Large Language Models with Mispronunciation Detection and Diagnosis Abilities
Minglin Wu, Jing Xu, Xixin Wu, Helen Meng

Acceleration of Posteriorgram-based DTW by Distilling the Class-to-class Distances Encoded in the Classifier Used to Calculate Posteriors
Haitong Sun, Jaehyun Choi, Nobuaki Minematsu, Daisuke Saito

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech
Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah

Transferable speech-to-text large language model alignment module
Boyong Wu, Chao Yan, Haoran Pu

Sparse Binarization for Fast Keyword Spotting
Jonathan Svirsky, Uri Shaham, Ofir Lindenbaum

Pathological Speech Analysis 2


Quantifying the effect of speech pathology on automatic and human speaker verification
Bence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan de Visscher, Max Witjes, Martijn Wieling, Defne Abur, Tomoki Toda

Investigation of Layer-Wise Speech Representations in Self-Supervised Learning Models: A Cross-Lingual Study in Detecting Depression
Bubai Maji, Rajlakshmi Guha, Aurobinda Routray, Shazia Nasreen, Debabrata Majumdar

Detection of Cognitive Impairment And Alzheimer’s Disease Using a Speech- and Language-Based Protocol
Tanya Talkar, Sherman Charles, Chelsea Krantsevich, Kan Kawabata

Analyzing Multimodal Features of Spontaneous Voice Assistant Commands for Mild Cognitive Impairment Detection
Nana Lin, Youxiang Zhu, Xiaohui Liang, John A. Batsis, Caroline Summerour

Prosody-Driven Privacy-Preserving Dementia Detection
Dominika Woszczyk, Ranya Aloufi, Soteris Demetriou

Voice Disorder Analysis: a Transformer-based Approach
Alkis Koudounas, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Elena Baralis

Speech Science, Speech Technology, and Gender (Special Session)


Challenges of German Speech Recognition: A Study on Multi-ethnolectal Speech Among Adolescents
Martha Schubert, Daniel Duran, Ingo Siegert

Just Because We Camp, Doesn’t Mean We Should: The Ethics of Modelling Queer Voices.
Atli Sigurgeirsson, Eddie L. Ungless

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis
Valentin Pelloin, LĂ©na Dodson, Émile Chapuis, Nicolas HervĂ©, David Doukhan

Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses
David Doukhan, Lena Dodson, Manon Conan, Valentin Pelloin, Aurélien Clamouse, Mélina Lepape, Géraldine Van Hille, Cécile Méadel, MarlÚne Coulomb-Gully

Acoustic Effects of Facial Feminisation Surgery on Speech and Singing: A Case Study
Cliodhna Hughes, Guy Brown, Ning Ma, Nicola Dibben

An inclusive approach to creating a palette of synthetic voices for gender diversity
Eva Szekely, Maxwell Hope

Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology
Robin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli

Voice Quality Variation in AAE: An Additional Challenge for Addressing Bias in ASR Models?
Li-Fang Lai, Nicole Holliday

Articulatory Configurations across Genders and Periods in French Radio and TV archives
Benjamin Elie, David Doukhan, RĂ©mi Uro, Lucas Ondel-Yang, Albert Rilliard, Simon Devauchelle

On the Encoding of Gender in Transformer-based ASR Representations
Aravind Krishnan, Badr M. Abdullah, Dietrich Klakow

Speech Production and Perception


Towards a Quantitative Analysis of Coarticulation with a Phoneme-to-Articulatory Model
Chaofei Fan, Jaimie M. Henderson, Chris Manning, Francis R. Willett

A comparative study of the impact of voiceless alveolar and palato-alveolar sibilants in English on lip aperture and protrusion during VCV production
Chetan Sharma, Vaishnavi Chandwanshi, Prasanta Kumar Ghosh

Measurement and simulation of pressure losses due to airflow in vocal tract models
Peter Birkholz, Patrick HĂ€sner

On The Performance of EMA-synchronized Speech and Stand-alone Speech in Acoustic-to-articulatory Inversion
Qiang Fang

Glottal inverse filtering and vocal tract tuning for the numerical simulation of vowel /a/ with different levels of vocal effort
Marc Freixes, Marc Arnela, Joan Claudi SocorĂł, Luis Joglar-Ongay, Oriol Guasch, Francesc AlĂ­as-Pujol

Temporal Co-Registration of Simultaneous Electromagnetic Articulography and Electroencephalography for Precise Articulatory and Neural Data Alignment
Daniel Friedrichs, Monica Lancheros, Sam Kirkham, Lei He, Andrew Clark, Clemens Lutz, Volker Dellwo, Steven Moran

Phonetics and Phonology: Segmentals and Suprasegmentals


Frication noise features of Polish voiceless dental fricative and affricate produced by children with and without speech disorder
Zuzanna Miodonska, Michal Kręcichwost, Ewa Kwaƛniok, Agata Sage, Pawel Badura

Key Acoustic Cues for the Realization of Metrical Prominence in Tone Languages: A Cross-Dialect Study
Yiying Hu, Hui Feng

Revisiting Pitch Jumps: F0 Ratio in Seoul Korean
Michaela Watkins, Paul Boersma, Silke Hamann

Aerodynamics of Sakata labial-velar oral stops
Lorenzo Maselli, VĂ©ronique Delvaux

Collecting Mandible Movement in Brazilian Portuguese
Donna Erickson, Albert Rilliard, Malin Svensson Lundmark, Adelaide Silva, Leticia Rebollo Couto, Oliver Niebuhr, JoĂŁo Antonio de Moraes

Pitch-driven adjustments in tongue positions: Insights from ultrasound imaging
May Pik Yu Chan, Jianjing Kuang

Topics in Paralinguistics


Speaking of Health: Leveraging Large Language Models to assess Exercise Motivation and Behavior of Rehabilitation Patients
Suhas BN, Amanda Rebar, Saeed Abdullah

Confidence Estimation for Automatic Detection of Depression and Alzheimer’s Disease Based on Clinical Interviews
Wen Wu, Chao Zhang, Philip C. Woodland

Who Finds This Voice Attractive? A Large-Scale Experiment Using In-the-Wild Data
Hitoshi Suda, Aya Watanabe, Shinnosuke Takamichi

Acoustical analysis of the initial phones in speech-laugh
Ryo Setoguchi, Yoshiko Arimoto

On Calibration of Speech Classification Models: Insights from Energy-Based Model Investigations
Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge
Rui Liu, Zening Ma

Emotion Recognition: Fairness, Variability, Uncertainty


Dual-Constrained Dynamical Neural ODEs for Ambiguity-aware Continuous Emotion Prediction
Jingyao Wu, Ting Dang, Vidhyasaharan Sethu, Eliathamby Ambikairajah

An Inter-Speaker Fairness-Aware Speech Emotion Regression Framework
Hsing-Hang Chou, Woan-Shiuan Chien, Ya-Tse Wu, Chi-Chun Lee

The Whole Is Bigger Than the Sum of Its Parts: Modeling Individual Annotators to Capture Emotional Variability
James Tavernor, Yara El-Tawil, Emily Mower Provost

Iterative Prototype Refinement for Ambiguous Speech Emotion Recognition
Haoqin Sun, Shiwan Zhao, Xiangyu Kong, Xuechen Wang, Hui Wang, Jiaming Zhou, Yong Qin

An Investigation of Group versus Individual Fairness in Perceptually Fair Speech Emotion Recognition
Woan-Shiuan Chien, Chi-Chun Lee

Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition
Oliver SchrĂŒfer, Manuel Milling, Felix Burkhardt, Florian Eyben, Björn Schuller

Speech emotion recognition with deep learning beamforming on a distant human-robot interaction scenario
Ricardo GarcĂ­a, Rodrigo Mahu, NicolĂĄs GrĂĄgeda, Alejandro Luzanto, Nicolas Bohmer, Carlos Busso, NĂ©stor Becerra Yoma

Speaker Verification


Challenging margin-based speaker embedding extractors by using the variational information bottleneck
Themos Stafylakis, Anna Silnova, Johan Rohdin, Oldƙich Plchot, Lukáơ Burget

Collaborative Contrastive Learning for Hypothesis Domain Adaptation
Jen-Tzung Chien, I-Ping Yeh, Man-Wai Mak

Extraction of interpretable and shared speaker-specific speech attributes through binary auto-encoder
Imen Ben-Amor, Jean-Francois Bonastre, Salima Mdhaffar

Reshape Dimensions Network for Speaker Recognition
Ivan Yakovlev, Rostislav Makarov, Andrei Balykin, Pavel Malov, Anton Okhotnikov, Nikita Torgashov

To what extent can ASV systems naturally defend against spoofing attacks?
Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Siddhant Arora, Junichi Yamagishi, Joon Son Chung

ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency
Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Junjie Li

Spatial Audio and Acoustics


Classification of Room Impulse Responses and its application for channel verification and diarization
Yuri Khokhlov, Tatiana Prisyach, Anton Mitrofanov, Dmitry Dutov, Igor Agafonov, Tatiana Timofeeva, Aleksei Romanenko, Maxim Korenevsky

RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation
Liam Kelley, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii

Novel-view Acoustic Synthesis From 3D Reconstructed Rooms
Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang

Spatial Acoustic Enhancement Using Unbiased Relative Harmonic Coefficients
Liang Tao, Maoshen Jia, Yonggang Hu, Changchun Bao

Design of Feedback Active Noise Cancellation Filter Using Nested Recurrent Neural Networks
Alireza Bayestehtashk, Amit Kumar, Mike Wurtz

Neuromorphic Keyword Spotting with Pulse Density Modulation MEMS Microphones
Sidi Yaya Arnaud Yarga, Sean U N Wood

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification
Jacob Bitterman, Daniel Levi, Hilel Hagai Diamandi, Sharon Gannot, Tal Rosenwein

Generative Models for Speech and Audio


ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

Audio Editing with Non-Rigid Text Prompts
Francesco Paissan, Luca Della Libera, Zhepei Wang, Paris Smaragdis, Mirco Ravanelli, Cem Subakan

Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice
Shubham Gupta, Mirco Ravanelli, Pascal Germain, Cem Subakan

Exploring compressibility of transformer based text-to-music (TTM) models
Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator
Jaewon Kim, Won-Gook Choi, Seyun Ahn, Joon-Hyuk Chang

Retrieval-Augmented Classifier Guidance for Audio Generation
Ho-Young Choi, Won-Gook Choi, Joon-Hyuk Chang

Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters
Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti

PAM: Prompting Audio-Language Models for Audio Quality Assessment
Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

Speech and Audio Modelling


GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model
Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

Gender and Language Identification in Multilingual Models of Speech: Exploring the Genericity and Robustness of Speech Representations
SĂ©verine Guillaume, Maxime Fily, Alexis Michaud, Guillaume Wisniewski

Neural Compression Augmentation for Contrastive Audio Representation Learning
Zhaoyu Wang, Haohe Liu, Harry Coppock, Björn Schuller, Mark D. Plumbley

Post-Net: A linguistically inspired sequence-dependent transformed neural architecture for automatic syllable stress detection
Sai Harshitha Aluru, Jhansi Mallela, Chiranjeevi Yarra

Multi-Channel Speech Enhancement


Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers
Marvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses
Zhongweiyang Xu, Ali Aroudi, Ke Tan, Ashutosh Pandey, Jung-Suk Lee, Buye Xu, Francesco Nesta

Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective Baseline
Shiran Aziz, Yossi Adi, Shmuel Peleg

DeFTAN-AA: Array Geometry Agnostic Multichannel Speech Enhancement
Dongheon Lee, Jung-Woo Choi

SA-MF: A Novel Self-Attention Mechanism for Multifeature Fusion in Speech Enhancement Networks
Ruizhe Wang

PLDNet: PLD-Guided Lightweight Deep Network Boosted by EfïŹcient Attention for Handheld Dual-Microphone Speech Enhancement
Nan Zhou, Youhai Jiang, Jialin Tan, Chongmin Qi

Speech Synthesis: Paradigms and Methods 1


Highly Intelligible Speaker-Independent Articulatory Synthesis
Charles McGhee, Kate Knill, Mark Gales

An Attribute Interpolation Method in Speech Synthesis by Model Merging
Masato Murata, Koichi Miyazaki, Tomoki Koriyama

Low-dimensional Style Token Control for Hyperarticulated Speech Synthesis
Miku Nishihara, Dan Wells, Korin Richmond, Aidan Pine

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida

ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis
Changhwan Kim

Multi-modal Adversarial Training for Zero-Shot Voice Cloning
John Janiczek, Dading Chong, Dongyang Dai, Arlo Faria, Chao Wang, Tao Wang, Yuzong Liu

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu

Modeling Vocal Tract Like Acoustic Tubes Using the Immersed Boundary Method
Rongshuai Wu, Debasish Ray Mohapatra, Sidney Fels

Speech Synthesis: Paradigms and Methods 2


Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis
Théodor Lemerle, Nicolas Obin, Axel Roebel

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Boris Ginsburg

Synthesizing Long-Form Speech merely from Sentence-Level Corpus with Content Extrapolation and LLM Contextual Enrichment
Shijie Lai, Minglu He, Zijing Zhao, Kai Wang, Hao Huang, Jichen Yang

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency
Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis
Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Trung Hieu Nguyen, Jia Qi Yip, Bin Ma

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model
Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim

FastLips: an End-to-End Audiovisual Text-to-Speech System with Lip Features Prediction for Virtual Avatars
Martin Lenglet, Olivier Perrotin, Gerard Bailly

Neural Network Architectures for ASR 1


Contemplative Mechanism for Speech Recognition: Speech Encoders can Think
Tien-Ju Yang, Andrew Rosenberg, Bhuvana Ramabhadran

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding
Takafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami

RepCNN: Micro-sized, Mighty Models for Wakeword Detection
Arnav Kundu, Prateeth Nayak, Priyanka Padmanabhan, Devang Naik

Conformer without Convolutions
Matthijs Van keirsbilck, Alexander Keller

Linear-Complexity Self-Supervised Learning for Speech Processing
Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

Error Correction and Rescoring


SALSA: Speedy ASR-LLM Synchronous Aggregation
Ashish Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi

LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition
Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo

HypR: A comprehensive study for ASR hypothesis revising with a reference corpus
Yi-Wei Wang, Ke-Han Lu, Kuan-Yu Chen

Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition
Yuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang

Transformer-based Model for ASR N-Best Rescoring and Rewriting
Iwen E Kang, Christophe Van Gysel, Man-Hung Siu

RASU: Retrieval Augmented Speech Understanding through Generative Modeling
Hao Yang, Min Zhang, Minghan Wang, Jiaxin Guo

Spoken Language Understanding


Towards Unified Evaluation of Continual Learning in Spoken Language Understanding
Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj

Convolutional gated MLP and attention improve end-to-end spoken language understanding
Beida Zheng, Mijit Ablimit, Hankiz Yilahun, Askar Hamdulla

An Uyghur Extension to the MASSIVE Multi-lingual Spoken Language Understanding Corpus with Comprehensive Evaluations
Ainikaerjiang Aimaiti, Di Wu, Liting Jiang, Gulinigeer Abudouwaili, Hao Huang, Wushour Silamu

This Paper Had the Smartest Reviewers - Flattery Detection Utilising an Audio-Textual Transformer-Based Approach
Lukas Christ, Shahin Amiriparian, Friederike Hawighorst, Ann-Kathrin Schill, Angelo Boutalikakis, Lorenz Graf-Vlachy, Andreas König, Björn Schuller

Unified Framework for Spoken Language Understanding and Summarization in Task-Based Human Dialog processing
Eunice Akani, Frederic Bechet, BenoĂźt Favre, Romain Gemignani

Automated Human-Readable Label Generation in Open Intent Discovery
Grant Anderson, Emma Hart, Dimitra Gkatzia, Ian Beaver

Applying Reinforcement Learning and Multi-Generators for Stage Transition in an Emotional Support Dialogue System
Jeremy Chang, Kuan-Yu Chen, Chung-Hsien Wu

Spoken Dialogue Systems and Conversational Analysis 2


Target conversation extraction: Source separation using turn-taking dynamics
Tuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Emre Sefik Eskimez, Takuya Yoshioka, Shyamnath Gollakota

Investigating the Influence of Stance-Taking on Conversational Timing of Task-Oriented Speech
Sara Ng, Gina-Anne Levow, Mari Ostendorf, Richard Wright

Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content
RĂ©mi Uro, Marie Tahon, David Doukhan, Antoine Laurent, Albert Rilliard

Utilization of Text Data for Response Timing Detection in Attentive Listening
Yu Watanabe, Koichiro Ito, Shigeki Matsubara

Backchannel prediction, based on who, when and what
Yo-Han Park, Wencke Liermann, Yong-Seok Choi, Seung Hi Kim, Jeong-Uk Bang, Seung Yun, Kong Joo Lee

Uh, um and mh: Are filled pauses prone to conversational converge?
Mathilde Hutin, Junfei Hu, Liesbeth Degand

Investigation of look-ahead techniques to improve response time in spoken dialogue system
Masaya Ohagi, Tomoya Mizumoto, Katsumasa Yoshikawa

Computational Models of Human Language Acquisition, Perception, and Production (Special Session)


Information-theoretic hypothesis generation of relative cue weighting for the voicing contrast
Annika Heuser, Jianjing Kuang

Neurocomputational model of speech recognition for pathological speech detection: a case study on Parkinson’s disease speech detection
Sevada Hovsepyan, Mathew Magimai.-Doss

Simulating articulatory trajectories with phonological feature interpolation
Angelo Ortiz Tandazo, Thomas Schatz, Thomas Hueber, Emmanuel Dupoux

A Pilot Study of GSLM-based Simulation of Foreign Accentuation Only Using Native Speech Corpora
Kentaro Onda, Joonyong Park, Nobuaki Minematsu, Daisuke Saito

Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizations
Guillem Bonafos, Clara Bourot, Pierre Pudlo, Jean-Marc Freyermuth, Laurence Reboul, Samuel Tronçon, Arnaud Rey

A data-driven model of acoustic speech intelligibility for optimization-based models of speech production
Benjamin Elie, Juraj Simko, Alice Turk

The Difficulty and Importance of Estimating the Lower and Upper Bounds of Infant Speech Exposure
Joseph Coffey, Okko RÀsÀnen, Camila Scaff, Alejandrina Cristia

Spoken-Term Discovery using Discrete Speech Units
Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau, Herman Kamper

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations
Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater

Show and Tell 3


OCEAN-AI: open multimodal framework for personality traits assessment and HR-processes automatization
Elena Ryumina, Dmitry Ryumin, Alexey Karpov

VoxMed: one-step respiratory disease classifier using digital stethoscope sounds
Paridhi Mundra, Manik Sharma, Yashwardhan Chaudhuri, Orchid Chetia Phukan, Arun Balaji Buduru

AVR: synergizing foundation models for audio-visual humor detection
Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma

ASGIR: audio spectrogram transformer guided classification and information retrieval for birds
Yashwardhan Chaudhuri, Paridhi Mundra, Arnesh Batra, Orchid Chetia Phukan, Arun Balaji Buduru

PERSONA: an application for emotion recognition, gender recognition and age estimation
Devyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma

NeuRO: an application for code-switched autism detection in children
Mohd Mujtaba Akhtar, Girish, Orchid Chetia Phukan, Muskaan Singh

ComFeAT: combination of neural and spectral features for improved depression detection
Orchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma

The reasonable effectiveness of speaker embeddings for violence detection
Sarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

ATTEST: an analytics tool for the testing and evaluation of speech technologies
Dmitrii Obukhov, Marcel de Korte, Andrey Adaschik

PhoneViz: exploring alignments at a glance
Margot Masson, Erfan A. Shams, Iona Gessinger, Julie Carson-Berndsen

Gryannote open-source speaker diarization labeling tool
Clément Pages, Hervé Bredin

A toolkit for joint speaker diarization and identification with application to speaker-attributed ASR
Giovanni Morrone, Enrico Zovato, Fabio Brugnara, Enrico Sartori, Leonardo Badino

Phonetics, Phonology and Prosody


Speaker Detection by the Individual Listener and the Crowd: Parametric Models Applicable to Bonafide and Deepfake Speech
Tomi H. Kinnunen, Rosa Gonzalez HautamÀki, Xin Wang, Junichi Yamagishi

NumberLie: a game-based experiment to understand the acoustics of deception and truthfulness
Alessandro De Luca, Andrew Clark, Volker Dellwo

Preservation, conservation and phonetic study of the voices of Italian poets: A study on the seven years of the VIP archive
Federico Lo Iacono, Valentina Colonna, Antonio Romano

Do Speaker-dependent Vowel Characteristics depend on Speech Style?
Nicolas Audibert, Cecile Fougeron, Christine Meunier

A comparison of voice similarity through acoustics, human perception and deep neural network (DNN) speaker verification systems
Suyuan Liu, Molly Babel, Jian Zhu

Evaluating Italian Vowel Variation with the Recurrent Neural Network Phonet
Austin Jones, Margaret E. L. Renwick

Prosodic marking of syntactic boundaries in Khoekhoe
Kira Tulchynska, Sylvanus Job, Alena Witzlack-Makarevich, Margaret Zellers

New Avenues in Emotion Recognition


Can Modelling Inter-Rater Ambiguity Lead To Noise-Robust Continuous Emotion Predictions?
Ya-Tse Wu, Jingyao Wu, Vidhyasaharan Sethu, Chi-Chun Lee

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition
Ziping Zhao, Tian Gao, Haishuai Wang, Björn Schuller

Multimodal Fusion of Music Theory-Inspired and Self-Supervised Representations for Improved Emotion Recognition
Xiaohan Shi, Xingfeng Li, Tomoki Toda

Enrolment-based personalisation for improving individual-level fairness in speech emotion recognition
Andreas Triantafyllopoulos, Björn Schuller

Keep, Delete, or Substitute: Frame Selection Strategy for Noise-Robust Speech Emotion Recognition
Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso

Hierarchical Distribution Adaptation for Unsupervised Cross-corpus Speech Emotion Recognition
Cheng Lu, Yuan Zong, Yan Zhao, Hailun Lian, Tianhua Qi, Björn Schuller, Wenming Zheng

Speaker Diarization 2


Variable Segment Length and Domain-Adapted Feature Optimization for Speaker Diarization
Chenyuan Zhang, Linkai Luo, Hong Peng, Wei Wen

Efficient Speaker Embedding Extraction Using a Twofold Sliding Window Algorithm for Speaker Diarization
Jeong-Hwan Choi, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, Hank Liao

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control
Alexander Blatt, Aravind Krishnan, Dietrich Klakow

On the calibration of powerset speaker diarization models
Alexis Plaquet, Hervé Bredin

Specializing Self-Supervised Speech Representations for Speaker Segmentation
Séverin Baroudi, Thomas Pellegrini, Hervé Bredin

Speaker Recognition 2


On the Usefulness of Speaker Embeddings for Speaker Retrieval in the Wild: A Comparative Study of x-vector and ECAPA-TDNN Models
Erfan Loweimi, Mengjie Qian, Kate Knill, Mark Gales

W-GVKT: Within-Global-View Knowledge Transfer for Speaker Verification
Zezhong Jin, Youzhi Tu, Man-Wai Mak

CEC: A Noisy Label Detection Method for Speaker Recognition
Yao Shen, Yingying Gao, Yaqian Hao, Chenguang Hu, Fulin Zhang, Junlan Feng, Shilei Zhang

Disentangling Age and Identity with a Mutual Information Minimization for Cross-Age Speaker Verification
Fengrun Zhang, Wangjin Zhou, Yiming Liu, Wang Geng, Yahui Shan, Chen Zhang

Contrastive Learning and Inter-Speaker Distribution Alignment Based Unsupervised Domain Adaptation for Robust Speaker Verification
Zuoliang Li, Wu Guo, Bin Gu, Shengyu Peng, Jie Zhang

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models
Minh Nguyen, Franck Dernoncourt, Seunghyun Yoon, Hanieh Deilamsalehy, Hao Tan, Ryan Rossi, Quan Hung Tran, Trung Bui, Thien Huu Nguyen

Attention-augmented X-vectors for the Evaluation of Mimicked Speech Using Sparse Autoencoder-LSTM framework
Bhasi K. C., Rajeev Rajan, Noumida A

Speech and Audio Analysis


Predefined Prototypes for Intra-Class Separation and Disentanglement
Antonio Almudévar, Théo Mariotte, Alfonso Ortega, Marie Tahon, Luis Vicente, Antonio Miguel, Eduardo Lleida

VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic Features
Tomoki Koriyama

A Transformer-Based Voice Activity Detector
Biswajit Karan, Joshua Jansen van VĂŒren, Febe de Wet, Thomas Niesler

XANE: eXplainable Acoustic Neural Embeddings
Sri Harsha Dumpala, Dushyant Sharma, Chandramouli Shama Sastry, Stanislav Kruchinin, James Fosburgh, Patrick A. Naylor

A comparative analysis of sequential models that integrate syllable dependency for automatic syllable stress detection
Jhansi Mallela, Sai Harshitha Aluru, Chiranjeevi Yarra

Motion Based Audio-Visual Segmentation
Jiahao Li, Miao Liu, Shu Yang, Jing Wang, Xiang Xie

Speech Quality and Intelligibility: Prediction and Enhancement


Transfer Learning from Whisper for Microscopic Intelligibility Prediction
Paul Best, Santiago Cuervo, Ricard Marxer

Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata
Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

No-Reference Speech Intelligibility Prediction Leveraging a Noisy-Speech ASR Pre-Trained Model
Haolan Wang, Amin Edraki, Wai-Yip Chan, IvĂĄn LĂłpez-Espejo, Jesper Jensen

The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement
Danilo de Oliveira, Simon Welker, Julius Richter, Timo Gerkmann

Enhancing Non-Matching Reference Speech Quality Assessment through Dynamic Weight Adaptation
Bao Thang Ta, Van Hai Do, Huynh Thi Thanh Binh

Exploring Sentence Type Effects on the Lombard Effect and Intelligibility Enhancement: A Comparative Study of Natural and Grid Sentences
Hongyang Chen, Yuhong Yang, Zhongyuan Wang, Weiping Tu, Haojun Ai, Cedar Lin

Speech Synthesis: Vocoders


FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter
Yuanjun Lv, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

QGAN: Low Footprint Quaternion Neural Vocoder for Speech Synthesis
Aryan Chaudhary, Vinayak Abrol

JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis
Hyunjae Cho, Junhyeok Lee, Wonbin Jung

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder
Rubing Shen, Yanzhen Ren, Zongkun Sun

QHM-GAN: Neural Vocoder based on Quasi-Harmonic Modeling
Shaowen Chen, Tomoki Toda

BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation
Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

ASR Model Training Methods


Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality
Tina Raissi, Christoph LĂŒscher, Simon Berger, Ralf SchlĂŒter, Hermann Ney

ASTRA: Aligning Speech and Text Representations for Asr without Sampling
Neeraj Gaur, Rohan Agrawal, Gary Wang, Parisa Haghani, Andrew Rosenberg, Bhuvana Ramabhadran

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss
Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

Learnable Layer Selection and Model Fusion for Speech Self-Supervised Learning Models
Sheng-Chieh Chiu, Chia-Hua Wu, Jih-Kang Hsieh, Yu Tsao, Hsin-Min Wang

Sequential Editing for Lifelong Training of Speech Recognition Models
Devang Kulshreshtha, Nikolaos Pappas, Brady Houston, Saket Dingliwal, Srikanth Ronanki

Cross-Modality Diffusion Modeling and Sampling for Speech Recognition
Chia-Kai Yeh, Chih-Chun Chen, Ching-Hsien Hsu, Jen-Tzung Chien

Cross-Lingual and Multilingual Processing


A Parameter-efficient Language Extension Framework for Multilingual ASR
Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR
Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen

mHuBERT-147: A Compact Multilingual HuBERT Model
Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu

All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scale
Vasista Sai Lodagala, Abhishek Biswas, Shoutrik Das, Jordan F, S Umesh

A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages
Nikhil Jakhar, Sudhanshu Srivastava, Arun Baby

Integrating Speech Self-Supervised Learning Models and Large Language Models for ASR
Ling Dong, Zhengtao Yu, Wenjun Wang, Yuxin Huang, Shengxiang Gao, Guojiang Zhou

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Krishna C. Puvvada, Piotr ƻelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data
Georgios Paraskevopoulos, Chara Tsoukala, Athanasios Katsamanis, Vassilis Katsouros

Speech Recognition for Greek Dialects: A Challenging Benchmark
Socrates Vakirtzian, Chara Tsoukala, Stavros Bompolas, Katerina Mouzou, Vivian Stamou, Georgios Paraskevopoulos, Antonios Dimakis, Stella Markantonatou, Angela Ralli, Antonios Anastasopoulos

LUPET: Incorporating Hierarchical Information Path into Multilingual ASR
Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges
Per E Kummervold, Javier de la Rosa, Freddy Wetjen, Rolv-Arild Braaten, Per Erik Solberg

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios
Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe

Enhancing Neural Transducer for Multilingual ASR with Synchronized Language Diarization
Amir Hussein, Desh Raj, Matthew Wiesner, Daniel Povey, Paola Garcia, Sanjeev Khudanpur

SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR
Shuaishuai Ye, Shunfei Chen, Xinhui Hu, Xinkang Xu

Speech Assessment


Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies
Chung-Wen Wu, Berlin Chen

Context-Aware Speech Recognition Using Prompts for Language Learners
Jian Cheng

A Dataset and Two-pass System for Reading Miscue Detection
Raj Gothi, Rahul Kumar, Mildred Pereira, Nagesh Nayak, Preeti Rao

Oversampling, Augmentation and Curriculum Learning for Speaking Assessment with Limited Training Data
Tin Mei Lun, Ekaterina Voskoboinik, Ragheb Al-Ghezi, Tamas Grosz, Mikko Kurimo

Analysis and Visualization of Directional Diversity in Listening Fluency of World Englishes Speakers in the Framework of Mutual Shadowing
Yu Tomita, Yingxiang Gao, Nobuaki Minematsu, Noriko Nakanishi, Daisuke Saito

Quantifying the Role of Textual Predictability in Automatic Speech Recognition
Sean Robertson, Gerald Penn, Ewan Dunbar

Question Answering from Speech and Spoken Dialogue Systems


TM-PATHVQA: 90000+ Textless Multilingual Questions for Medical Visual Question Answering
Tonmoy Rajkhowa, Amartya Roy Chowdhury, Sankalp Nagaonkar, Achyut Mani Tripathi, Mahadeva Prasanna

Towards Multilingual Audio-Visual Question Answering
Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

Reinforcement Learning from Answer Reranking Feedback for Retrieval-Augmented Answer Generation
Minh Nguyen, Toan Quoc Nguyen, Kishan KC, Zeyu Zhang, Thuy Vu

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models
Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

On the Use of Plausible Arguments in Explainable Conversational AI
Martina Di Bratto, Maria Di Maro, Antonio Origlia

Rapport-Driven Virtual Agent: Rapport Building Dialogue Strategy for Improving User Experience at First Meeting
Muhammad Yeza Baihaqi, Angel Garcia Contreras, Seiya Kawano, Koichiro Yoshino

Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval
Lifeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu

Spoken Dialogue Systems and Conversational Analysis 3


MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in Conversation
Zilong Huang, Man-Wai Mak, Kong Aik Lee

Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation
Haoxiang Shi, Ziqi Liang, Jun Yu

Participant-Pair-Wise Bottleneck Transformer for Engagement Estimation from Video Conversation
Keita Suzuki, Nobukatsu Hojo, Kazutoshi Shinoda, Saki Mizuno, Ryo Masumura

Well, what can you do with messy data? Exploring the prosody and pragmatic function of the discourse marker “well” with found data and speech synthesis
Johannah O’Mahony, Catherine Lai, Éva SzĂ©kely

Learning from Multiple Annotator Biased Labels in Multimodal Conversation
Kazutoshi Shinoda, Nobukatsu Hojo, Saki Mizuno, Keita Suzuki, Satoshi Kobashikawa, Ryo Masumura

Non-Linear Inference Time Intervention: Improving LLM Truthfulness
Jakub Hoscilowicz, Adam Wiacek, Jan Chojnacki, Adam Cieslak, Leszek Michon, Artur Janicki

Evaluating Speech Recognition Performance Towards Large Language Model Based Voice Assistants
Zhe Liu, Suyoun Kim, Ozlem Kalinli

Dysarthric Speech Assessment


Automatic Assessment of Dysarthria using Speech and synthetically generated Electroglottograph signal
Fathima Zaheera, Supritha Shetty, Gayadhar Pradhan, Deepak K T

CDSD: Chinese Dysarthria Speech Database
Yan Wan, Mengyi Sun, Xinchen Kang, Jingting Li, Pengfei Guo, Ming Gao, Su-Jing Wang

Exploring Syllable Discriminability during Diadochokinetic Task with Increasing Dysarthria Severity for Patients with Amyotrophic Lateral Sclerosis
Neelesh Samptur, Tanuka Bhattacharjee, Anirudh Chakravarty K, Seena Vengalil, Yamini Belur, Atchayaram Nalini, Prasanta Kumar Ghosh

Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models
Matthew Perez, Aneesha Sampath, Minxue Niu, Emily Mower Provost

Electroglottography for the assessment of dysphonia in Parkinson’s disease and multiple system atrophy
Khalid Daoudi, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Margherita Fabbri, Anne Pavy-Le Traon, Olivier Rascol, Virginie Woisard, Wassilios G. Meissner

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng

Spoken Language Models for Universal Speech Processing (Special Session)


On the Effectiveness of Acoustic BPE in Decoder-Only TTS
Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

Exploring In-Context Learning of Textless Speech Language Model for Speech Classification Tasks
Kai-Wei Chang, Ming-Hao Hsu, Shan-Wen Li, Hung-yi Lee

Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee

Can Large Language Models Understand Spatial Audio?
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe, Karen Livescu

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning
Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li

NAST: Noise Aware Speech Tokenization for Speech Language Models
Shoval Messica, Yossi Adi

Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer
Slava Shechtman, Avihu Dekel

L1/L2 Acquisition and Cross-Linguistic Factors


Acquisition of high vowel devoicing in Japanese: A production experiment with three and four year olds
Hyun Kyung Hwang, Manami Hirayama

The Production of Contrastive Focus by 7 to 13-year-olds Learning Mandarin Chinese
Zimeng Li, Zhongxuan Mao, Shengting Shen, Ivan Yuen, Ping Tang

Cross-Linguistic Intelligibility of Non-Compositional Expressions in Spoken Context
Iuliia Zaitova, Irina Stenger, Wei Xue, Tania Avgustinova, Bernd Möbius, Dietrich Klakow

On the relationship between speech production and vocabulary size in 3-5 year olds
Alexis DeMaere, Nicole van Rootselaar, Fangfang Li, Robbin Gibb, Claudia L. R. Gonzalez

Towards Classifying Mother Tongue from Infant Cries - Findings Substantiating Prenatal Learning Theory
Tim Polzehl, Tim Herzig, Friedrich Wicke, Kathleen Wermke, Razieh Khamsehashari, Michiko Dahlem, Sebastian Möller

Effect of Complex Boundary Tones on Tone Identification: An Experimental Study with Mandarin-speaking Preschool Children
Aijun Li, Jun Gao, Zhiwei Wang

Ethnolinguistic Identification of Vietnamese-German Heritage Speech
Thanh Lan Truong, Andrea Weber

Experimental Phonetics and Laboratory Phonology


Age-related Differences in Acoustic Cues for the Perception of Checked Syllables in Shengzhou Wu
Bingliang Zhao, Jiangping Kong, Xiyu Wu

Quantity-sensitivity affects recall performance of word stress
Constantijn Kaland, Maria Lialiou

Phonological Symmetry Does Not Predict Generalization of Perceptual Adaptation to Vowels
Zuheyra Tokac, Jennifer Cole

Perceptual Learning in Lexical Tone: Phonetic Similarity vs. Phonological Categories
Ariëlle Reitsema, Chenxin Li, Leanne van Lambalgen, Laura Preining, Saskia Galindo Jong, Qing Yang, Xinyi Wen, Yiya Chen

Modeling probabilistic reduction across domains with Naive Discriminative Learning
Anna Stein, Kevin Tang

Do we EXPECT TO find phonetic traces for syntactic traces?
Jonathan Him Nok Lee, Mark Liberman, Martin Salzmann

Speaker recognition evaluation and resources


VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark
Yuke Lin, Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research
Wiebke Hutiri, Tanvina Patel, Aaron Yi Ding, Odette Scharenborg

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction
Shuai Wang, Ke Zhang, Shaoxiong Lin, Junjie Li, Xuefei Wang, Meng Ge, Jianwei Yu, Yanmin Qian, Haizhou Li

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Alex Gichamba, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

Active Speaker Detection in Fisheye Meeting Scenes with Scene Spatial Spectrums
Xinghao Huang, Weiwei Jiang, Long Rao, Wei Xu, Wenqing Cheng

VSASV: a Vietnamese Dataset for Spoofing-Aware Speaker Verification
Vu Hoang, Viet Thanh Pham, Hoa Nguyen Xuan, Pham Nhi, Phuong Dat, Thi Thu Trang Nguyen

Speech Type Classification


E-ODN: An Emotion Open Deep Network for Generalised and Adaptive Speech Emotion Recognition
Liuxian Ma, Lin Shen, Ruobing Li, Haojie Zhang, Kun Qian, Bin Hu, Björn W. Schuller, Yoshiharu Yamamoto

Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment
Joseph Liu, Mahesh Kumar Nandwana, Janne Pylkkönen, Hannes Heikinheimo, Morgan McGuire

AraOffence: Detecting Offensive Speech Across Dialects in Arabic Media
Youssef Nafea, Shady Shehata, Zeerak Talat, Ahmed Aboeitta, Ahmed Sharshar, Preslav Nakov

CogniVoice: Multimodal and Multilingual Fusion Networks for Mild Cognitive Impairment Assessment from Spontaneous Speech
Jiali Cheng, Mohamed Elgaar, Nidhi Vakil, Hadi Amiri

Speech Topic Classification Based on Multi-Scale and Graph Attention Networks
Fangjing Niu, Xiaozhe Qi, Xinya Chen, Liang He

Enhancing Speech and Music Discrimination Through the Integration of Static and Dynamic Features
Liangwei Chen, Xiren Zhou, Qiang Tu, Huanhuan Chen

Target Speaker Extraction


Binaural Selective Attention Model for Target Speaker Extraction
Hanyu Meng, Qiquan Zhang, Xiangyu Zhang, Vidhyasaharan Sethu, Eliathamby Ambikairajah

All Neural Low-latency Directional Speech Extraction
Ashutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu

Centroid Estimation with Transformer-Based Speaker Embedder for Robust Target Speaker Extraction
Woon-Haeng Heo, Joongyu Maeng, Yoseb Kang, Namhyun Cho

Knowledge boosting during low-latency inference
Vidya Srinivas, Malek Itani, Tuochao Chen, Emre Sefik Eskimez, Takuya Yoshioka, Shyamnath Gollakota

Unified Audio Visual Cues for Target Speaker Extraction
Tianci Wu, Shulin He, Jiahui Pan, Haifeng Huang, Zhijian Mo, Xueliang Zhang

Target Speaker Extraction with Curriculum Learning
Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

Speech Synthesis: Voice Conversion 3


SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion
Bingsong Bai, Fengping Wang, Yingming Gao, Ya Li

Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline
Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, İsmail Rasim Ülgen, Carlos Busso, Berrak Sisman

PRVAE-VC2: Non-Parallel Voice Conversion by Distillation of Speech Representations
Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Yuto Kondo

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
Xinlei Niu, Jing Zhang, Charles Patrick Martin

DreamVoice: Text-Guided Voice Conversion
Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali

Hear Your Face: Face-based voice conversion with F0 estimation
Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee

Accent Conversion with Articulatory Representations
Yashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, Andrea Fanelli

USD-AC: Unsupervised Speech Disentanglement for Accent Conversion
Jen-Hung Huang, Wei-Tsung Lee, Chung-Hsien Wu

Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion
Hiroki Kanagawa, Yusuke Ijima

Speech Synthesis: Paradigms and Methods 3


SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

Sample-Efficient Diffusion for Text-To-Speech Synthesis
Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu

Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy Transcriptions
Jingyi Feng, Yusuke Yasuda, Tomoki Toda

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech
Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon

PitchFlow: adding pitch control to a Flow-matching based TTS model
Tasnima Sadekova, Mikhail Kudinov, Vadim Popov, Assel Yermekova, Artem Khrapov

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance
Jinhyeok Yang, Junhyeok Lee, Hyeong-Seok Choi, Seunghoon Ji, Hyeongju Kim, Juheon Lee

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems
Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen

Privacy and Security in Speech Communication 2


Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example
Suhita Ghosh, Melanie Jouaiti, Arnab Das, Yamini Sinha, Tim Polzehl, Ingo Siegert, Sebastian Stober

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding
Rui Wang, Liping Chen, Kong Aik Lee, Zhen-Hua Ling

Probing the Feasibility of Multilingual Speaker Anonymization
Sarina Meyer, Florian Lux, Ngoc Thang Vu

DiffVC+: Improving Diffusion-based Voice Conversion for Speaker Anonymization
Fan Huang, Kun Zeng, Wei Zhu

Streaming ASR


Learning from Back Chunks: Acquiring More Future Knowledge for Streaming ASR Models via Self Distillation
Yuting Yang, Guodong Ma, Yuke Li, Binbin Du, Haoqi Zhu, Liang Ruan

Decoder-only Architecture for Streaming End-to-end Speech Recognition
Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

TfCleanformer: A streaming, array-agnostic, full- and sub-band modeling front-end for robust ASR
Jens Heitkaemper, Joe Caroselli, Arun Narayanan, Nathan Howard

Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking
Khanh Le, Duc Chau

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection
Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li

Computational Resource Constrained ASR


Dynamic Data Pruning for Automatic Speech Recognition
Qiao Xiao, Pingchuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

Mitigating Overfitting in Structured Pruning of ASR Models with Gradient-Guided Parameter Regularization
Dong-Hyun Kim, Joon-Hyuk Chang

SparseWAV: Fast and Accurate One-Shot Unstructured Pruning for Large Speech Foundation Models
Tianteng Gu, Bei Liu, Hang Shao, Yanmin Qian

One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model
Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

USM RNN-T model weights binarization
Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models
Tzu-Quan Lin, Hung-yi Lee, Hao Tang

RepTor: Re-parameterizable Temporal Convolution for Keyword Spotting via Differentiable Kernel Search
Eunik Park, Daehyun Ahn, Hyungjun Kim

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting
Shuai Wang, Dehao Zhang, Kexin Shi, Yuchen Wang, Wenjie Wei, Jibin Wu, Malu Zhang

ED-sKWS: Early-Decision Spiking Neural Networks for Rapid, and Energy-Efficient Keyword Spotting
Zeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li

A Small and Fast BERT for Chinese Medical Punctuation Restoration
Tongtao Ling, Yutao Lai, Lei Chen, Shilei Huang, Yi Liu

Evaluation of Speech Technology Systems


Quantification of stylistic differences in human- and ASR-produced transcripts of African American English
Annika Heuser, Tyler Kendall, Miguel del Rio, Quinn McNamara, Nishchal Bhandari, Corey Miller, MigĂŒel JettĂ©

Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications
Korbinian Kuhn, Verena Kersken, Gottfried Zimmermann

Comparing ASR Systems in the Context of Speech Disfluencies
Maria Teleki, Xiangjue Dong, Soohwan Kim, James Caverlee

Deep Prosodic Features in Tandem with Perceptual Judgments of Word Reduction for Tone Recognition in Conversed Speech
Xiang-Li Lu, Yi-Fen Liu

SeMaScore: A new evaluation metric for automatic speech recognition tasks
Zitha Sasindran, Harsha Yelchuri, T. V. Prabhakar

Neural Network Training for Speech Recognition


Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition
Jingjing Xu, Wei Zhou, Zijian Yang, Eugen Beck, Ralf SchlĂŒter

Revisiting Convolution-free Transformer for Speech Recognition
Zejiang Hou, Goeric Huybrechts, Anshu Bhatia, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff

Optimizing Large-Scale Context Retrieval for End-to-End ASR
Zhiqi Huang, Diamantino Caseiro, Kandarp Joshi, Christopher Li, Pat Rondon, Zelin Wu, Petr Zadrazil, Lillian Zhou

Self-Supervised Speech Representations are More Phonetic than Semantic
Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, Shinji Watanabe

Enhancing CTC-based speech recognition with diverse modeling units
Shiyi Han, Mingbin Xu, Zhihong Lei, Zhen Huang, Xingyu Na

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation
Eungbeom Kim, Hantae Kim, Kyogu Lee

Leveraging Large Language Models and Contextual Features for Phonetic Analysis (Special Session)


Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0
Marianne de Heer Kloots, Willem Zuidema

Exploring Pre-trained Speech Model for Articulatory Feature Extraction in Dysarthric Speech Using ASR
Yuqin Lin, Longbiao Wang, Jianwu Dang, Nobuaki Minematsu

Exploring Self-Supervised Speech Representations for Cross-lingual Acoustic-to-Articulatory Inversion
Yun Hao, Reihaneh Amooie, Wietse de Vries, Thomas Tienkamp, Rik van Noord, Martijn Wieling

Are Articulatory Feature Overlaps Shrouded in Speech Embeddings?
Erfan A. Shams, Iona Gessinger, Patrick Cormac English, Julie Carson-Berndsen

Searching for Structure: Appraising the Organisation of Speech Features in wav2vec 2.0 Embeddings
Patrick Cormac English, John D. Kelleher, Julie Carson-Berndsen

Responsible Speech Foundation Models (Special Session)


Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction
Daniela A. Wiepert, Rene L. Utianski, Joseph R. Duffy, John L. Stricker, Leland R. Barnard, David T. Jones, Hugo Botha

Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models
Dominik Wagner, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet

Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems
Ajinkya Kulkarni, Atharva Kulkarni, Miguel Couceiro, Isabel Trancoso

Emo-bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition
Yi-Cheng Lin, Haibin Wu, Huang-Cheng Chou, Chi-Chun Lee, Hung-yi Lee

On the social bias of speech self-supervised models
Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee

Self-supervised Speech Representations Still Struggle with African American Vernacular English
Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen

Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng

Multimodal Paralinguistics


LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information
Yunrui Cai, Zhiyong Wu, Jia Jia, Helen Meng

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition
Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li

Prompt Link Multimodal Fusion in Multimodal Sentiment Analysis
Kang Zhu, Cunhang Fan, Jianhua Tao, Zhao Lv

A multimodal analysis of different types of laughter expression in conversational dialogues
Kexin Wang, Carlos Ishi, Ryoko Hayashi

Tackling Missing Modalities in Audio-Visual Representation Learning Using Masked Autoencoders
Georgios Chochlakis, Chandrashekhar Lavania, Prashant Mathur, Kyu J. Han

Enhancing Multimodal Emotion Recognition through ASR Error Compensation and LLM Fine-Tuning
Jehyun Kyung, Serin Heo, Joon-Hyuk Chang

Bridging Emotions Across Languages: Low Rank Adaptation for Multilingual Speech Emotion Recognition
Lucas Goncalves, Donita Robinson, Elizabeth Richerson, Carlos Busso

Automatic Emotion Recognition


A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition
Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee

Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?
Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition
Haiyang Sun, Fulin Zhang, Yingying Gao, Shilei Zhang, Zheng Lian, Junlan Feng

Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations
Bulat Khaertdinov, Pedro Jeruis, Annanda Sousa, Enrique Hortal

Acoustic Event Detection, Segmentation and Classification


FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani

LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound Classification
Li Xiao, Lucheng Fang, Yuhong Yang, Weiping Tu

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

Robust Laughter Segmentation with Automatic Diverse Data Synthesis
Taisei Omine, Kenta Akita, Reiji Tsuruno

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing
Martin Lebourdais, Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega

Predicting Heart Activity from Speech using Data-driven and Knowledge-based features
Gasser Elbanna, Zohreh Mostaani, Mathew Magimai.-Doss

Measuring acoustic dissimilarity of hierarchical markers in task-oriented dialogue with MFCC-based dynamic time warping
Natalia Morozova, Guanghao You, Sabine Stoll, Adrian Bangerter

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness
Sai Srujana Buddi, Satyam Kumar, Utkarsh Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

Generalized Fake Audio Detection via Deep Stable Learning
Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Xin Qi, Yi Lu, Shuchen Shi

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar, Ognjen Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor
Yongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He

Multi-label Bird Species Classification from Field Recordings using Mel_Graph-GCN Framework
Noumida A, Rajeev Rajan

Speech and Audio Modelling


DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation
Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

Leveraging Language Model Capabilities for Sound Event Detection
Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang

Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
Xuenan Xu, Pingyue Zhang, Ming Yan, Ji Zhang, Mengyue Wu

LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
Wenhao Guan, Kaidi Wang, Wangjin Zhou, Yang Wang, Feng Deng, Hui Wang, Lin Li, Qingyang Hong, Yong Qin

DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech
Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian SchĂŒldt, Saikat Chatterjee

Blind Zero-Shot Audio Restoration: A Variational Autoencoder Approach for Denoising and Inpainting
Veranika Boukun, Jakob Drefs, Jörg LĂŒcke

Fake Audio Detection


Towards generalisable and calibrated audio deepfake detection with self-supervised representations
Octavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, Horia Cucu

Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy
Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua Tao

Enhancing Partially Spoofed Audio Localization with Boundary-aware Attention Mechanism
Jiafeng Zhong, Bin Li, Jiangyan Yi

Singing Voice Graph Modeling for SingFake Detection
Xuanjun Chen, Haibin Wu, Roger Jang, Hung-yi Lee

Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection
Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi

One-class learning with adaptive centroid shift for audio deepfake detection
Hyun Myung Kim, Kangwook Jang, Hoirin Kim

Deep Learning-Based Speech Enhancement: Approaches, Scalability, and Evaluation


VoiCor: A Residual Iterative Voice Correction Framework for Monaural Speech Enhancement
Rui Cao, Tianrui Wang, Meng Ge, Andong Li, Longbiao Wang, Jianwu Dang, Yungang Jia

Personalized Speech Enhancement Without a Separate Speaker Embedding Model
Tanel PĂ€rnamaa, Ando Saabas

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement
Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation
Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

Speech Synthesis: Other Topics 1


LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis
Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang

Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation
Miseul Kim, Soo-Whan Chung, Youna Ji, Hong-Goo Kang, Min-Seok Choi

PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model
Shuhua Li, Qirong Mao, Jiatong Shi

Towards realtime co-speech gestures synthesis using STARGATE
Louis Abel, Vincent Colotte, Slim Ouni

PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation
Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang

Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control
Jaeuk Lee, Sohee Jang, Joon-Hyuk Chang

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis
Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang

Zero-shot Out-of-domain is No Joke: Lessons Learned in the VoiceMOS 2023 MOS Prediction Challenge
Marie Kuneơová, Jan Lehečka, Josef Michálek, Jindrich Matousek, Jan Ơvec

Towards a General-Purpose Model of Perceived Pragmatic Similarity
Nigel G. Ward, Andres Segura, Alejandro Ceballos, Divette Marco

Speech Synthesis: Other Topics 2


Enabling Conversational Speech Synthesis using Noisy Spontaneous Data
Liisa RĂ€tsep, Rasmus Lellep, Mark Fishel

Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech
Dong Yang, Tomoki Koriyama, Yuki Saito

H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech
Donghyun Seong, Joon-Hyuk Chang

Bilingual and Code-switching TTS Enhanced with Denoising Diffusion Model and GAN
Huai-Zhe Yang, Chia-Ping Chen, Shan-Yun He, Cheng-Ruei Li

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics
Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

UNIQUE : Unsupervised Network for Integrated Speech Quality Evaluation
Juhwan Yoon, WooSeok Ko, Seyun Um, Sungwoong Hwang, Soojoong Hwang, Changhwan Kim, Hong-Goo Kang

FVTTS : Face Based Voice Synthesis for Text-to-Speech
Minyoung Lee, Eunil Park, Sungeun Hong

Speech synthesis: Cross-lingual and multilingual aspects


Meta Learning Text-to-Speech Synthesis in over 7000 Languages
Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

Improving Multilingual Text-to-Speech with Mixture-of-Language-Experts and Accent Disentanglement
Jing Wu, Ting Chen, Minchuan Chen, Wei Hu, Shaojun Wang, Jing Xiao

Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models
Jing Xu, Minglin Wu, Xixin Wu, Helen Meng

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber

X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion
Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro

Noise, Far-Field, Multi-Talker, Enhancement, Audio Classification


RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios
Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Multi-Channel Multi-Speaker ASR Using Target Speaker’s Solo Segment
Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition
William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription
Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Peer, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka

Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network
Yehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement
Daniel Haider, Felix Perfler, Vincent Lostanlen, Martin Ehler, Peter Balazs

DGSRN: Noise-Robust Speech Recognition Method with Dual-Path Gated Spectral Refinement Network
Wenjun Wang, Shangbin Mo, Ling Dong, Zhengtao Yu, Junjun Guo, Yuxin Huang

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation
Riyansha Singh, Parinita Nema, Vinod K Kurmi

Bird Whisperer: Leveraging Large Pre-trained Acoustic Model for Bird Call Classification
Muhammad Umer Sheikh, Hassan Abid, Bhuiyan Sanjid Shafique, Asif Hanif, Muhammad Haris Khan

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling
Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

wTIMIT2mix: A Cocktail Party Mixtures Database to Study Target Speaker Extraction for Normal and Whispered Speech
Marvin Borsdorf, Zexu Pan, Haizhou Li, Tanja Schultz

Self-Supervised Learning for ASR


What happens in continued pre-training? Analysis of self-supervised speech models with continued pre-training for colloquial Finnish ASR
Yaroslav Getman, Tamas Grosz, Mikko Kurimo

Self-Supervised Learning for ASR Pre-Training with Uniquely Determined Target Labels and Controlling Cepstrum Truncation for Speech Augmentation
Akihiro Kato, Hiroyuki Nagano, Kohei Chike, Masaki Nose

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

Balanced-Wav2Vec: Enhancing Stability and Robustness of Representation Learning Through Sample Reweighting Techniques
Mun-Hak Lee, Jae-Hong Lee, DoHee Kim, Ye-Eun Ko, Joon-Hyuk Chang

Spoken Term Detection and Speech Retrieval


Few-Shot Keyword Spotting from Mixed Speech
Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla

Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units
Bolaji Yusuf, Jan Honza Cernocky, Murat Saraçlar

2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval
Jiajun He, Tomoki Toda

GPA: Global and Prototype Alignment for Audio-Text Retrieval
Yuxin Xie, Zhihong Zhu, Xianwei Zhuang, Liming Liang, Zhichang Wang, Yuexian Zou

Few-Shot Keyword-Incremental Learning with Total Calibration
Ilseok Kim, Ju-Seok Seong, Joon-Hyuk Chang

Leveraging Speech Data Diversity to Document Indigenous Heritage and Culture
Allahsera Tapo, Éric Le Ferrand, Zoey Liu, Christopher Homan, Emily Prud’hommeaux

Speech Disorders 1


Missingness-resilient Video-enhanced Multimodal Disfluency Detection
Payal Mohapatra, Shamika Likhite, Subrata Biswas, Bashima Islam, Qi Zhu

AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection
Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li

Analyzing Speech Motor Movement using Surface Electromyography in Minimally Verbal Adults with Autism Spectrum Disorder
Wazeer Zulfikar, Nishat Protyasha, Camila Canales, Heli Patel, James Williamson, Laura Sarnie, Lisa Nowinski, Nataliya Kosmyna, Paige Townsend, Sophia Yuditskaya, Tanya Talkar, Utkarsh Oggy Sarawgi, Christopher McDougle, Thomas Quatieri, Pattie Maes, Maria Mody

Prosody of speech production in latent post-stroke aphasia
Cong Zhang, Tong Li, Gayle DeDe, Christos Salis

MMSD-Net: Towards Multi-modal Stuttering Detection
Liangyu Nie, Sudarsana Reddy Kadiri, Ruchit Agrawal

Large Language Models for Dysfluency Detection in Stuttered Speech
Dominik Wagner, Sebastian P. Bayerl, Ilja Baumann, Elmar Noeth, Korbinian Riedhammer, Tobias Bocklet

Connecting Speech-science and Speech-technology for Children’s Speech (Special Session)


Preliminary Investigation of Psychometric Properties of a Novel Multimodal Dialog Based Affect Production Task in Children and Adolescents with Autism
Carly Demopoulos, Linnea Lampinen, Cristian Preciado, Hardik Kothare, Vikram Ramanarayanan

Training speech-breathing coordination in computer-assisted reading
Delphine Charuau, Andrea Briglia, Erika Godde, GĂ©rard Bailly

How Does Alignment Error Affect Automated Pronunciation Scoring in Children’s Speech?
Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha

Examining Vocal Tract Coordination in Childhood Apraxia of Speech with Acoustic-to-Articulatory Speech Inversion Feature Sets
Nina R. Benway, Jonathan L. Preston, Carol Espy-Wilson

Children’s Speech Recognition through Discrete Token Enhancement
Vrunda N. Sukhadia, Shammur Absar Chowdhury

Bridging Child-Centered Speech Language Identification and Language Diarization via Phonetics
Yujia Wang, Hexin Liu, Leibny Paola Garcia

Reading Miscue Detection in Primary School through Automatic Speech Recognition
Lingyun Gao, Cristian Tejedor-Garcia, Helmer Strik, Catia Cucchiarini

Automatic Evaluation of a Sentence Memory Test for Preschool Children
Ilja Baumann, Nicole Unger, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis
Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning
Lucas Block Medin, Thomas Pellegrini, Lucile Gelin

Benchmarking Children’s ASR with Supervised and Self-supervised Speech Foundation Models
Ruchao Fan, Natarajan Balaji Shankar, Abeer Alwan

Introduction To Partial Fine-tuning: A Comprehensive Evaluation Of End-to-end Children’s Automatic Speech Recognition Adaptation
Thomas Rolland, Alberto Abad

Improving child speech recognition with augmented child-like speech
Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch
Thomas Graave, Zhengyang Li, Timo Lohrenz, Tim Fingscheidt

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions
Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan

Show and Tell 4


IIITH Ucchar e-Sudharak: an automatic English pronunciation corrector for school-going children with a teacher in the loop
Meenakshi Sirigiraju, Arjun Rajasekar, Abhishikth Meejuri, Chiranjeevi Yarra

Speech enabled visual acuity test
Boon Peng Yap, Kok Liang Tan, Zhenghao Li, Rong Tong

A ChatGPT-based oral Q&A practice system for first-time student participants in international conferences
Mayuko Aiba, Daisuke Saito, Nobuaki Minematsu

Visual scene display application for augmentative and alternative communication
Karthik Venkat Sridaran, Raja Praveen, Reuben T Varghese, Ajish K Abraham, Shankar R, Winnie Rachel Cherian

CALL system using pitch-accent feature representations reflecting listeners’ subjective adequacy
Ikuyo Masuda-Katsuse, Ayako Shirose

The speech motor chaining web app for speech motor learning
Jonathan L Preston, Nina R Benway, Nathan Prestopnik, Nathan Preston

Visualization for improving foreign language pronunciation
Charlotte Yoder, Karrie Karahalios, Mark Hasegawa-Johnson, Shreyansh Agrawal

CaptainA self-study mobile app for practising speaking: task completion assessment and feedback with generative AI
Nhan Phan, Anna von Zansen, Maria Kautonen, TamĂĄs GrĂłsz, Mikko Kurimo