Title: Representing Speech Through Autoregressive Prediction of Cochlear Tokens
Authors: Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel L. K. Yamins
Published: 15th August 2025 (Friday) @ 17:06:04
Link: http://arxiv.org/abs/2508.11598v1
Abstract
We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStreamâs strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the modelâs predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.