Title: LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
Authors: Soumya Dutta, Sriram Ganapathy
Published: 20th January 2025 (Monday) @ 12:56:02
Link: http://arxiv.org/abs/2501.11468v1

Abstract

Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pre-trained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achieves state-of-the-art results on two out of these three datasets.


Proposed Approach / Method

Block diagram of the proposed model. The pre-training stage is shown in the grey box at the top. An ASR system is used to generate the transcripts for the pre-training data which are annotated by a large language model (LLM) as positive, negative or neutral sentiment. These “silver” labels with the text transcripts form the supervised training dataset for RoBERTa-large model. A frozen CARE model [35] is used for extracting audio embeddings. Both the text and speech embeddings thus use only unsupervised data. The MERITS-L model is trained in three stages (denoted as Stage I, II and III in the diagram), wherein the models trained in a particular stage are kept frozen for subsequent stages.

Experimental Setup

  • Pre-training data: The MSP-PODCAST corpus is used for the task of pre-training
    • Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings
      • A total of 149, 307 speech-turns amounting to 230 hours of emotional speech data is used.
      • Out of the total number of samples, 80% of the data is randomly chosen as the training set while the remaining 20% serves as the validation set.
    • Whisper-large-v3 used for generating the transcripts
    • These transcripts are annotated using GPT-3.5 Turbo
      • Once the transcripts are generated by the Whipser model, the GPT-3.5 Turbo model is prompted with You are a sentiment classification bot. Given the [sentence], classify as positive, negative or neutral sentiment. Please give the sentiment and no extra text as output.
  • Emotion Recognition in Conversations fine-tuning dataset
    • IEMOCAP dataset
    • MELD
    • CMU-MOSI

Ablations

Importance of hierarchical training

In order to understand the impact of hierarchical training, we combine Stages II and III of MERITS-L. The impact of such a training philosophy is shown in Fig.4. The impact of this change in the training methodology is found to be the most in the case of IEMOCAP where the performance of MERITS-L drops from 86.48% to 82.91%. A performance drop of around 2% in absolute terms is also noticed for MELD and CMU-MOSI. The benefits from the hierarchical training arises from the fact that the end-to-end training of the different components of the model often leads to over-fitting as these datasets are relatively small in size.