Title: SpiRit-LM: Interleaved Spoken and Written Language Model
Authors: Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux
Published: 8th February 2024 (Thursday) @ 15:39:32
Link: http://arxiv.org/abs/2402.05755v1
Abstract
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
- Weights: Requested access via Metaâs form for Spirit LM (default email address)
- Code: https://github.com/facebookresearch/spiritlm
- released with massive Advancing Machine Intelligence release on 18th Oct. 2024: Sharing new research, models, and datasets from Meta FAIR
Baselines §4.1
- Speech-only baselines:
- Speech-text baselines:
- Cascaded models - top line scores / best expected performance
- Text-to-text: Llama 2-7B
- Speech-to-speech: ASR Whisper + Llama 2-7B + MMS-TTS
Spirit LM Expressive (Expressive Speech Tokenization; §3.2)
- Mix 3 types of tokens into a single sequence of tokens by sorting the tokens with their corresponding timestamps (Figure 1.c)
- HuBERT tokens @ 25hz
- pitch tokens @ 12.5hz
- style tokens @ 1hz
- deduplicate HuBERT and pitch tokens
- Example input sequence:
[SPEECH][St10][Pi0][Hu28][Hu22][Pi14][Hu15] [Pi32][Hu78][Hu234][Hu468]
First pass notes
- Comment: Speech LMs trail text (text-only) LMs in capturing sematics when trained on comparable amounts of data
- Nguyen 2020 and 2023b
- Expressivity evaluation relies on speech resynthesis task - âmeasures how well the resynthesized speech is compared with the original audio in terms of preserved content, expressive style, and pitchâ
- Nguyen 2023a
- Data:
- Unimodal text
- Unimodal speech
- aligned speech-text with [Speech] / [Text] tags - inputs look like this:
"[TEXT]this is a text sentence"
or"[SPEECH][Hu262][Hu208][Hu499][Hu105]"
- They use Pratap et al. (2023) for word-level alignment of speech + text
- code: https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html - PyTorch tutorial!
- Use HuBERT tokens + pitch and style tokens from Kharitonov 2022
- Pitch tokens from Polyak et al. 2021 - code: https://github.com/facebookresearch/speech-resynthesis
- Style tokens from Duquenne et al. 2023
- Expresso dataset
- For speech decoder used HiFi-GAN trained on Expresso, conditioned on Hubert
- SpiRit-LM
- condition speech decoder - HiFi-GAN - on HuBERT tokens + 1-hot speaker embedding (Expresso voices)
- SpiRit-LM-Expressive
- condition speech decoder - HiFi-GAN - on HuBERT tokens + 1-hot speaker embedding (Expresso voices) + pitch tokens + style tokens
- Optimisation: Add new embeddings for speech tokens to Llama 2 and continue pre-training with final learning rate (3.0e-5)
- RoPE embeddings per Llama - w increased base frequency 1e5 instead of 1e4 - benefits long context modelling
- Evaluation
- Speech only: sWUGGY, sBLIMP, StoryCloze, and speech classification tasks
- Topic-Story-Cloze
- MMLU
- Evaluation - Speech-to-Text and Text-to-Speech tasks
- use LibriSpeech clean and other test sets
- ASR for SpiRit-LM
- Word error rate (WER) between generated and gold transcriptions
- Text-to-speech: systemâs ability to generate inputted text - transcribe with Whisper and compare Character error rate of transcript vs original
Briefly, sWUGGY measures if the model can discriminate between existing spoken words and non-words (e.g., âbrickâ vs. âblickâ). sBLIMP measures if the model can distinguish between a spoken grammatically correct sentence and an ungrammatical spoken variant of the same sentence (e.g., âcats are lazyâ vs. âcats is lazyâ).
Ablations:
- Interleaving is essential - experiments presenting concatenated inputs at sentence- or word-level
- layer-wise feature similarity in network is higher e.g. 0.6 vs 0.2 cosine similarity when using interleaving vs not
- just continuing pre-training on speech tokens is bad - 6 point drop in spoken StoryCloze
- Note / Intuition: modelling raw speech is more costly for SpiRit-LM-Expressive - since it integrates pitch and style tokens in the input / output which are extra embeddings and increase sequence length
- explanation / intuition offered for slight degradation of SpiRit-LM-Expressive resp. SpiRit-LM in (s)WUGGY, (s)BLIMP, (Topic-)StoryCloze and MMLU (Table 4 of the paper)
- Cross-modal StoryCloze: performance on text is always better, irrespective of whether prompt was speech or text
Responsible Evaluation in Speech and Text:
- evaluate with MuTox and eTox like for Seamless M4T / Seamless Communication
- Supply a prompt with the template âWhat do you think about [PLURAL NOUN PHRASE]â with an adjective and a noun (e.g. âdisabled parentsâ)
- systems produce toxic content in response to this prompt - take a mean (percentage) of toxic generations
- SpiRit-LM is more toxic in S âĄïž S than T âĄïž T
- more toxic content in the speech training dataset - they leave mitigation to future work
Design of the Speech Encoder
§3.1: Speech Encoder:
We use the same HuBERT model as in TWIST (Hassid et al., 2023), which is trained on a mixture of datasets: Multilingual LibriSpeech (Pratap et al., 2020), Vox Populi (Wang et al., 2021), Common Voice (Ardila et al., 2020), Spotify (Clifton et al., 2020), and Fisher (Cieri et al., 2004).
The HuBERT model was trained for 4 iterations, with a downsampling factor of 640, resulting in a sample rate of 25hz.
For the quantization, we utilized k-means 500 units from TWIST as base units and trained a feed-forward quantizer using data-invariant augmentation technique from Gat et al. (2023).
We finally obtained a vocabulary of 501 semantic speech tokens.