🪴 Anil's Garden

❯

Explainability for Speech Models: On the Challenges of Acoustic Feature Selection

28 Jul 20256 min read

paper
speech
interpretability
fbk
annotated
question

Title: Explainability for Speech Models: On the Challenges of Acoustic Feature Selection
Authors: Dennis Fucci, Beatrice Savoldi, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli
Published: 2024-12-04
Link: https://clic2024.ilc.cnr.it/wp-content/uploads/2024/12/44_main_long.pdf

Abstract

Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI) has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to enhance the depth and informativeness of explanations for speech.

A quick overview via quotes (bold mine)

it is essential to recognize that the effectiveness of feature attribution explanations relies not only on the techniques themselves but also on the informativeness of the input features used as explanatory variables. If an explanation highlights unintelligible or poorly informative features, it does little to enhance the understanding of the model’s behavior [1]. This can undermine key principles in XAI, such as accuracy—the property of correctly reflecting the factors that led the model to a specific decision including all relevant information—and meaningfulness—the property of offering explanations that are comprehensible to the user [24]. The properties of accuracy and meaningfulness can be associated with those of faithfulness and plausibility, respectively [25, 26].

this paper reflects on the impact of the chosen acoustic features in explaining the rationale behind speech models, aiming to gain a deeper understanding of the trade-offs associated with acoustic features

Linguistic content: What is said Paralinguistics cues: How it is said

The frequency dimension also plays a vital role in shaping suprasegmental aspects of speech—broader phenomena that span multiple segments—such as intonation, obtained by varying pitch

Acoustic correlates:

time, or the sequential occurrence of sounds
intensity, corresponding to the energy level of the wave due to the strength of molecular vibration, which we perceived as loudness;
frequency, regarding the rate of vibrations produced by the vocal cords— interpreted as pitch—and whose modulation is responsible for shaping the type of speech sound

Pitch, for instance, has a distinctive function in tonal languages, where it is used to distinguish lexical or grammatical meaning [41]. But even in non-tonal languages, these prosodic elements are indispensable to delivering different meanings and intents, as the reader can perceive by reading out loud two contrastive sentences such as: “You got the joke right” and “You got the joke, right?”, where pauses and prosody play pivotal roles.

Speech Representations

Waveform - “This type of representation is leveraged by models like Wav2vec [6].”
Spectrogram - “articulation of sounds produces time-frequency patterns which are visible as darker regions [36]. Prominent examples of state-of-the-art models leveraging spectrograms are Whisper [9] and SeamlessM4T [44]”
MFCC - “each coefficient captures important details about how the frequency content of the signal changes over time. Like spectrograms, MFCCs offer information about both frequency and time, but in a more compact form. MFCCs are commonly used in the implementation of ASR models within popular toolkits like Kaldi5 [45] and Mozilla DeepSpeech6”

For human understanding, however, they actually vary in terms of informativeness with respect to the acoustic correlates discussed in §2. Indeed, although both intensity and frequency are somewhat discernible in waveforms, qualitative distinctions of patterns specific to pitch or phoneme frequencies are rarely feasible [36]. Comparatively, spectrograms and MFCCs are richer and more descriptive, because they capture the multiple dimensions of time, frequency, and intensity with finer detail. Still, spectrograms are more conducive to phonetic analyses, given the established knowledge in analyzing frequency patterns over time within this representation [36] In contrast, MFCCs are rarely used for phonetic analysis [46].

Richness of Explanations

To prioritize explanation accuracy and conduct analyses considering the crucial role of acoustic correlates such as frequency, it is advisable to take into account all dimensions embedded in the speech representation.

owing to the compatibility of current models with various representation types, the explanations generated are inevitably confined by the specific input features provided to the model

[about a model which accepts waveform input only] previous works by Wu et al. [31] and Wu et al. [32], based on waveforms solely focus on the temporal dimension to explain ASR

Various techniques exist to analyze how models extract relevant patterns from waveforms through convolutions:

Interpretable Convolutional Filters with SincNet
Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings

two models tested by Wu et al. [31], namely, DeepSpeech [51] and Sphinx [52], are fed with spectrograms and MFCCs, respectively. However, explanations based on raw waveforms are provided for these models. This inconsistency between the features used in explanations and those used by the models inevitably offers only a partial overview of the models’ behavior and limits the exploration of important acoustic aspects. This, in turn, can impact the accuracy of the explanations, which ideally should encompass all relevant information.

the works of Markert et al. [30], who provide explanations that account for the most influential elements in MFCCs, as well as Trinh and Mandel [29] and Becker et al. [27], who base the explanations on spectrograms

Trinh and Mandel [29] demonstrated that neural ASR models focus on high-energy time-frequency regions for transcription

Becker et al. [27] found that lower frequency ranges, typically associated with pitch, exhibit higher attribution scores in speaker gender classification tasks [27], showing some alignment with human speech processing.

I guess lower pitch is the actual vocal chords’ resonance and the higher pitch (as in a mel-spectrogram) are the formants and harmonics?question

Granularity of Explanations

Explanations should be obtained with low-level units to avoid biasing explanations towards human understanding.

Wu et al. [32] and Pastor et al. [28] resort to the alignment of audio to text, either for individual phonemes or words, respectively, and apply explainability techniques to such units. While this approach helps decipher the contribution of input features based on more intuitive linguistic units, it diverges from how current models process speech features in small frames and samples [43]. This divergence risks overlooking the model’s behavior and compromises the accuracy and effectiveness of the explanations. For instance, whether ASR systems rely on shorter or longer time intervals than individual words remains unclear [29]. Therefore, analyzing this aspect requires a more granular approach at the time level.

Graph View

A quick overview via quotes (bold mine)
Speech Representations
Richness of Explanations
Granularity of Explanations

Backlinks

Speech and Audio
eXplainability

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋