Title: Decoding speech perception from non-invasive brain recordings
Authors: Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, Jean-Rémi King
Published: 2023-10-05
Link: https://www.nature.com/articles/s42256-023-00714-5
Abstract
Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in this regard: deep-learning algorithms trained on intracranial recordings can now start to decode elementary linguistic features such as letters, words and audio-spectrograms. However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here we introduce a model trained with contrastive learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto-encephalography or electro-encephalography while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of magneto-encephalography signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and with up to 80% in the best participants—a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model with a variety of baselines highlights the importance of a contrastive objective, pretrained representations of speech and a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder’s predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk of brain surgery.
Main result
Model accurately identifies, from three seconds of non-invasive recordings of brain activity, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities. This performance, sometimes reaching 80% in the best participants, allows the decoding of perceived words and phrases that are absent from the training set.
Architecture / Method
- conv-based brain module
- speech modules:
- speech-brain reps aligned with CLIP loss
How is language represented in the brain?
- Separating noise and signal in brain recordings not the only challenge
- Determining representations most suitable for decoding is an unresolved problem because . The nature of these representations in terms of their acoustic, phonetic, lexical and semantic properties remains poorly known.
- Previous studies primarily used supervised models targeting well-defined features of language: individual letters, phonemes or frequency bands of the audio spectrogram (references 12,23,24,72,75,76,77,78,79,80)
- approach successful but bottlenecks decoding speed e.g. if “patient” is spelling words out
- Alternative: classify a small set of words (references 26,28,81,82,83)
- difficult to scale to a vocabulary size adequate for natural language
- Word semantics may be directly decoded from functional MRI signals (refs 84,85,86,87,88,89)
- corresponding performances currently remain modest at the single-trial level
Contemporary State of the Art (Invasive Methods)
- Willett et al. 12 decoded 90 characters per minute with a 94% accuracy (roughly 15–18 words per minute) from a patient with a spinal-cord injury, recorded in the motor cortex during 10 hours of writing sessions.
- Moses et al. [13] decoded 15.2 words per minute (with 74.4% accuracy, and using a vocabulary of 50 words) in a patient with anarthria and a BCI implanted in the sensorimotor cortex, recorded over 48 sessions spanning over 22 hours.
- Metzger et al. 18 recently showed that a patient with severe limb and vocal-tract paralysis and a BCI implanted in the sensorimotor cortex could efficiently spell words using a code word that represented each English letter (for example, ‘alpha’ for ‘a’)
- character error rate of 6.13%
- speed of 29.4 characters per minute
Such invasive recordings face a major practical challenge: these high-quality signals require brain surgery and can be difficult to maintain chronically.
Datasets
We test our approach on four public datasets, two based on MEG recordings and two based on EEG recordings. All datasets and their corresponding studies were approved by the relevant ethics committee and are publicly available for fundamental research purposes. Informed consent was obtained from each human research participant. We provide an overview of the main characteristics of the datasets in Table 1, including the number of training and test segments and vocabulary size over both splits. For all datasets, healthy adult volunteers passively listened to speech sounds (accompanied by some memory or comprehension questions to ensure participants were attentive), while their brain activity was recorded with MEG or EEG. In Schoffelen et al.32, Dutch-speaking participants listened to decontextualized Dutch sentences and word lists (Dutch sentences for which the words are randomly shuffled). The study was approved by the local ethics committee (the local Committee on Research Involving Human Subjects in the Arnhem–Nijmegen region). The data are publicly and freely available after registration on the Donders Repository. In Gwilliams et al. 33, English-speaking participants listened to four fictional stories from the Masc corpus57 in two identical sessions of 1 hour30. The study was approved by the institutional review board ethics committee of New York University Abu Dhabi. In Broderick et al. 58, English-speaking participants listened to extracts of The Old Man and the Sea. The study was approved by the ethics committees of the School of Psychology at Trinity College Dublin and the Health Sciences Faculty at Trinity College Dublin. In Brennan and Hale31, English-speaking participants listened to a chapter of Alice in Wonderland. See Supplementary Section A.1 for more details. The study was approved by the University of Michigan Health Sciences and Behavioral Sciences institutional review board (HUM00081060).
Data availability
The data from Schoffelen et al.32 were provided (in part) by the Donders Institute for Brain, Cognition and Behaviour with a ‘RU-DI-HD-1.0’ licence. The data for Gwilliams et al.33 are available under the CC0 1.0 Universal licence. The data for Broderick et al.58 are available under the same licence. Finally, the data from Brennan and Hale31 are available under the CC BY 4.0 licence. All audio files were provided by the authors of each dataset.