Title: Leveraging Content and Acoustic Representations for Speech Emotion Recognition
Authors: Soumya Dutta, Sriram Ganapathy
Published: 9th September 2024 (Monday) @ 12:46:32
Link: http://arxiv.org/abs/2409.05566v2

Abstract

Speech emotion recognition (SER), the task of identifying the expression of emotion from spoken content, is challenging due to the difficulty in extracting representations that capture emotional attributes from speech. The scarcity of labeled datasets further complicates the challenge where large models are prone to over-fitting. In this paper, we propose CARE (Content and Acoustic Representations of Emotions), where we design a dual encoding scheme which emphasizes semantic and acoustic factors of speech. While the semantic encoder is trained using distillation from utterance-level text representations, the acoustic encoder is trained to predict low-level frame-wise features of the speech signal. The proposed dual encoding scheme is a base-sized model trained only on unsupervised raw speech. With a simple light-weight classification model trained on the downstream task, we show that the CARE embeddings provide effective emotion recognition on a variety of datasets. We compare the proposal with several other self-supervised models as well as recent large-language model based approaches. In these evaluations, the proposed CARE is shown to be the best performing model based on average performance across 8 diverse datasets. We also conduct several ablation studies to analyze the importance of various design choices.


Nice section outlining previous work on Self-supervision for Speech Emotion Recognition (“SER”):

One of the earliest self-supervised model for the task of speech emotion recognition was proposed by Pascual et al. [27]. This consisted of processing a speech signal by the SincNet model [24] followed by trainable convolutional blocks to predict a number of speech features such as the waveform, mel-frequency cepstral coefficients (MFCCs), pitch etc. Ravanelli et al. [23] further modified this model by adding more self-supervised tasks such as predicting FBANK and Gammatone features [28] to develop the PASE+ model. Among the general purpose speech SSL models that were proposed over the years, WavLM [16], was shown to outperform other models such as HuBERT [15] and wav2vec2.0 [14] for emotion recognition. Vesper [20] used a modified masking strategy to emphasize high pitch/energy regions of speech—known indicators of emotion—and derived targets for these masked regions from a WavLM teacher model. This approach allowed to learn a smaller student model with enhanced performance for SER. A similar strategy was employed by Ma et al. in emotion2vec [21], which utilized a pre-trained data2vec model as the teacher. Emotion2vec [21] also learns a global embedding to enhance SER performance.

[back to their work] In contrast, the proposed CARE model integrates semantic content along with acoustic features to perform emotion recognition. A distillation loss from a text model enables the semantic encoding in a parameter efficient way.

Also a really nice quick summary of speech language models and Speech-text Aligned Representations.

The alignment of speech and text modalities has received renewed attention for speech representation learning.

  • The SONAR model [29] aligns a speech encoder with textual representations at the utterance level.

With the increasing prominence of large language models (LLMs), recent approaches have integrated speech encoders with LLMs. Notably:

  • the SALMONN model by Tang et al. [30] introduced an audio encoder consisting of Whisper model and a music encoder along with the LLaMA language model [31].
  • Hu et al. [32] proposed WavLLM, combining Whisper and WavLM encoders with the LLaMA model.

These LLM-based approaches harness aligned speech-text representations and enable promptbased applications. However, their substantial model sizes (e.g., 7B parameters for SALMONN [30]) present significant computational demands for both training and inference. In contrast, the CARE model achieves superior performance on various downstream datasets with a much smaller size of 160M parameters.

Figure 1 - Note: This figure is basically the headline results as well (y-axis) - they outperform lots of bigger models (e.g. SALMONN 13B) with their approach of using those models’ representations and training classification heads on top (not just using SALMONN zero-shot, like they tried at first because it gives very inconsistent results).

Their angle / “USP”

The landscape of various SER methods is summarized in Fig. 1. We highlight a clear gap in current modeling frameworks: models either prioritize efficiency with limited performance (those in the lower end of the x-axis), or focus on maximizing performance with increased memory and compute requirements (typically based on LLMs). To address this gap, we propose CARE, that combines the computational efficiency of smaller models with the high performance of large-scale systems, thereby providing a superior trade-off between efficiency and performance.

Proposed Approach / Method

Block diagram of the proposed CARE model.

Block diagram of the proposed CARE model. An ASR system is used to generate the transcripts for the pre-training data which is passed through a pre-trained RoBERTa model to generate targets for the semantic encoder. The acoustic encoder of the model is trained with PASE+ features as targets. Blocks in blue indicate either frozen components or those with no learnable parameters. For the semantic encoder the transformer layers are frozen while the convolutional adapters are trained. As the dimension of the output from the acoustic encoder is 768, a FC layer is attached to match the PASE+ feature dimension of 256. This FC layer and the average pool block after the semantic encoder are not used during inference.

  • Semantic “supervision” they extract contextual word-level embeddings from the transcripts using a pre-trained RoBERTa model [33] and mean-pool these embeddings to obtain a single feature vector representing the entire transcript
    • motivation for doing sentence-level: sentence level representation for text is more appropriate for the task of emotion recognition as established by Fan et al. [35]
    • These utterance-level embeddings serve as the supervisory signal, or “teacher,” for the semantic encoder in our CARE model. We denote this utterance level embeddings by
  • Acoustic “supervision” frame-level target is chosen for the acoustic encoder
    • mean-pooled speech representations can capture a wide range of properties—such as speaker identity, accent, and language—beyond emotion alone [36]. Moreover, emotion in speech is often contained in the change of different parameters such as pitch, rhythm or energy from one frame to another [6].
    • straightforward approach for these frame level acoustic targets would involve masking parts of the speech signal and reconstructing them, an approach adopted in both Vesper [20] and emotion2vec [21]
      • random masking is less effective for emotion recognition than selectively masking high-energy or high-pitch regions, as demonstrated by Chen et al. [20].
    • This highlights the role of low-level speech descriptors—such as filter-bank energies and pitch—as they are rich with emotional cues.
    • we choose to predict PASE+ features, which encompass filter-bank energies, pitch, and other low-level descriptors essential for capturing emotion.
    • Specifically, we use frame-level PASE+ features with 256 dimensions as targets for the acoustic encoder in our CARE model.
      • These features are down-sampled by a factor of 2, producing target descriptors at a frequency of 50 Hz. We denote the acoustic targets from the PASE+ model by
  • MSE losses for the frame-wise acoustic (PASE+ vs encoding) and utterance/sentence-level (RoBERTa vs encoder) losses

Results

Results Summary - Table II

Comparison with other works for downstream datasets. # indicates the models where the downstream dataset is contained in the pre-training dataset. results in bold indicate the best performing model, while those underlined indicate the second-best model. All numbers are weighted f1-scores computed over 5 random initializations (mean and standard deviation shown). Number of parameters used during inference are also mentioned. vesper [20] is not compared due to unavailability of the model weights.

Downsteam Datasets

  1. IEMOCAP: video conversations with 10 emotion labels - they merge (per previous similar work) into four categories: “angry”, “happy”, “sad”, “neutral” and “excited” (excited and happy are comprised into “happy”)
  2. MELD: video clippings from Friends (yes, the 90s sitcom) with 7-way emotion labels
  3. CMU-MOSI: utterances labelled with sentiment - they binarise this (neg vs positive)
  4. DAIC-WOZ: benchmark for depression detection - they do some sampling to correct data imbalance
  5. RAVDESS-SONG: 1012 song recordings by 23 different singers. Each recording in this dataset is sung in one of six different emotions, namely, “neutral”, “calm”, “happy”, “sad”, “angry” and “fear”
  6. CaFE: Canadian French emotion recognition with 7-way labels
  7. EmoDB: 535 utterances from 10 speakers in German with 7-way emotion labels