🪴 Anil's Garden

Home

❯

Research

❯

Speech and Audio

❯

Speech and Audio - Tokenizers (Tokenisers)

23 Nov 20253 min read

✨ Discrete Audio Tokens Empirical Study from Discrete Audio Tokens More Than a Survey!

Speech and Audio Tokenizers

HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
1. relies on SEANet A Multi-modal Speech Enhancement Network modules
2. uses the decoder of MelGAN Generative Adversarial Networks for Conditional Waveform Synthesis (and encoder is mirror of decoder)
3. uses U-Net Convolutional Networks for Biomedical Image Segmentation architecture
Moshi a speech-text foundation model for real-time dialogue
WavLM Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing - need to include this one as Moshi’s semantic tokens are just a distillation - ergo a worse encoding of - this. Specifically they used microsoft/wavlm-large (Hugging Face) per p.12 of Moshi a speech-text foundation model for real-time dialogue
mHuBERT-147 A Compact Multilingual HuBERT Model
SemantiCodec An Ultra Low Bitrate Semantic Audio Codec for General Sound
FocalCodec Low-Bitrate Speech Coding via Focal Modulation Networks
StableToken A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
SNAC Multi-Scale Neural Audio Codec - used by Mini-Omni and Mini-Omni2 Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
DC-Spin A Speaker-invariant Speech Tokenizer for Spoken Language Models
TS3-Codec Transformer-Based Simple Streaming Single Codec
PLACEHOLDER hertz-dev - Standard Intelligence
FunCodec A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec ?
SyllableLM Learning Coarse Semantic Units for Speech Language Models ?
WavTokenizer an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
BASE TTS Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data ?
Continuous Speech Tokenizer in Text To Speech ?
dMel Speech Tokenization made Simple
neuphonic neucodec A package for NeuCodec a 50hz, 0.8kbps, 24kHz audio codec.

To overcome the limitations of conventional speech tokenizers, which separately capture information for understanding or generation task, we propose a dualcodebook speech tokenizer framework in Step-Audio similar to ARCON (Ming et al., 2024). This approach employs two distinct tokenizers, linguistic and semantic, to better represent speech features. The linguistic tokenizer is utilized to extract structured, high-level representations, including phonemic and linguistic features, whereas the semantic tokenizer is designed to encode both semantic and coarse-grained acoustic characteristics.

For linguistic tokenization, we utilize the output from the Paraformer (Z. Gao, Zhang, McLoughlin, & Yan, 2022) encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic tokenization, we employ CosyVoice’s (Du, Chen, et al., 2024) tokenizer, specifically designed to efficiently 5 encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz. The linguistic tokenizer employs a codebook size of 1024, while the semantic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details.

To effectively integrate these two tokenization schemes, we implement a tokenlevel interleaving approach inspired by SpiritLM (Nguyen et al., 2024). Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two linguistic tokens are paired with three semantic tokens.

— Step-Audio Unified Understanding and Generation in Intelligent Speech Interaction §3.1

Graph View

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Speech and Audio - Tokenizers (Tokenisers)

Speech and Audio Tokenizers

Graph View

Backlinks