Title: Explainability for Speech Models: On the Challenges of Acoustic Feature Selection
Authors: Dennis Fucci, Beatrice Savoldi, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli
Published: 2024-12-04
Link: https://clic2024.ilc.cnr.it/wp-content/uploads/2024/12/44_main_long.pdf
Abstract
Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI) has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to enhance the depth and informativeness of explanations for speech.
A quick overview via quotes (bold mine)
it is essential to recognize that the effectiveness of feature attribution explanations relies not only on the techniques themselves but also on the informativeness of the input features used as explanatory variables. If an explanation highlights unintelligible or poorly informative features, it does little to enhance the understanding of the modelâs behavior [1]. This can undermine key principles in XAI, such as accuracyâthe property of correctly reflecting the factors that led the model to a specific decision including all relevant informationâand meaningfulnessâthe property of offering explanations that are comprehensible to the user [24]. The properties of accuracy and meaningfulness can be associated with those of faithfulness and plausibility, respectively [25, 26].
this paper reflects on the impact of the chosen acoustic features in explaining the rationale behind speech models, aiming to gain a deeper understanding of the trade-offs associated with acoustic features
Linguistic content: What is said Paralinguistics cues: How it is said
The frequency dimension also plays a vital role in shaping suprasegmental aspects of speechâbroader phenomena that span multiple segmentsâsuch as intonation, obtained by varying pitch
Acoustic correlates:
- time, or the sequential occurrence of sounds
- intensity, corresponding to the energy level of the wave due to the strength of molecular vibration, which we perceived as loudness;
- frequency, regarding the rate of vibrations produced by the vocal cordsâ interpreted as pitchâand whose modulation is responsible for shaping the type of speech sound
Pitch, for instance, has a distinctive function in tonal languages, where it is used to distinguish lexical or grammatical meaning [41]. But even in non-tonal languages, these prosodic elements are indispensable to delivering different meanings and intents, as the reader can perceive by reading out loud two contrastive sentences such as: âYou got the joke rightâ and âYou got the joke, right?â, where pauses and prosody play pivotal roles.
Speech Representations
- Waveform - âThis type of representation is leveraged by models like Wav2vec [6].â
- Spectrogram - âarticulation of sounds produces time-frequency patterns which are visible as darker regions [36]. Prominent examples of state-of-the-art models leveraging spectrograms are Whisper [9] and SeamlessM4T [44]â
- MFCC - âeach coefficient captures important details about how the frequency content of the signal changes over time. Like spectrograms, MFCCs offer information about both frequency and time, but in a more compact form. MFCCs are commonly used in the implementation of ASR models within popular toolkits like Kaldi5 [45] and Mozilla DeepSpeech6â
For human understanding, however, they actually vary in terms of informativeness with respect to the acoustic correlates discussed in §2. Indeed, although both intensity and frequency are somewhat discernible in waveforms, qualitative distinctions of patterns specific to pitch or phoneme frequencies are rarely feasible [36]. Comparatively, spectrograms and MFCCs are richer and more descriptive, because they capture the multiple dimensions of time, frequency, and intensity with finer detail. Still, spectrograms are more conducive to phonetic analyses, given the established knowledge in analyzing frequency patterns over time within this representation [36] In contrast, MFCCs are rarely used for phonetic analysis [46].
Richness of Explanations
To prioritize explanation accuracy and conduct analyses considering the crucial role of acoustic correlates such as frequency, it is advisable to take into account all dimensions embedded in the speech representation.
owing to the compatibility of current models with various representation types, the explanations generated are inevitably confined by the specific input features provided to the model
[about a model which accepts waveform input only] previous works by Wu et al. [31] and Wu et al. [32], based on waveforms solely focus on the temporal dimension to explain ASR
Various techniques exist to analyze how models extract relevant patterns from waveforms through convolutions:
- Interpretable Convolutional Filters with SincNet
- Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings
two models tested by Wu et al. [31], namely, DeepSpeech [51] and Sphinx [52], are fed with spectrograms and MFCCs, respectively. However, explanations based on raw waveforms are provided for these models. This inconsistency between the features used in explanations and those used by the models inevitably offers only a partial overview of the modelsâ behavior and limits the exploration of important acoustic aspects. This, in turn, can impact the accuracy of the explanations, which ideally should encompass all relevant information.
the works of Markert et al. [30], who provide explanations that account for the most influential elements in MFCCs, as well as Trinh and Mandel [29] and Becker et al. [27], who base the explanations on spectrograms
Trinh and Mandel [29] demonstrated that neural ASR models focus on high-energy time-frequency regions for transcription
Becker et al. [27] found that lower frequency ranges, typically associated with pitch, exhibit higher attribution scores in speaker gender classification tasks [27], showing some alignment with human speech processing.
I guess lower pitch is the actual vocal chordsâ resonance and the higher pitch (as in a mel-spectrogram) are the formants and harmonics?question
Granularity of Explanations
Explanations should be obtained with low-level units to avoid biasing explanations towards human understanding.
Wu et al. [32] and Pastor et al. [28] resort to the alignment of audio to text, either for individual phonemes or words, respectively, and apply explainability techniques to such units. While this approach helps decipher the contribution of input features based on more intuitive linguistic units, it diverges from how current models process speech features in small frames and samples [43]. This divergence risks overlooking the modelâs behavior and compromises the accuracy and effectiveness of the explanations. For instance, whether ASR systems rely on shorter or longer time intervals than individual words remains unclear [29]. Therefore, analyzing this aspect requires a more granular approach at the time level.