Speech and Audio Tokenizers

  1. HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
  2. SpeechTokenizer Unified Speech Tokenizer for Speech Large Language Models
    1. relies on SEANet A Multi-modal Speech Enhancement Network modules
    2. uses the decoder of MelGAN Generative Adversarial Networks for Conditional Waveform Synthesis (and encoder is mirror of decoder)
    3. uses U-Net Convolutional Networks for Biomedical Image Segmentation architecture
  3. Moshi a speech-text foundation model for real-time dialogue
  4. WavLM Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing - need to include this one as Moshi’s semantic tokens are just a distillation - ergo a worse encoding of - this. Specifically they used microsoft/wavlm-large (Hugging Face) per p.12 of Moshi a speech-text foundation model for real-time dialogue
  5. mHuBERT-147 A Compact Multilingual HuBERT Model
  6. SemantiCodec An Ultra Low Bitrate Semantic Audio Codec for General Sound
  7. FocalCodec Low-Bitrate Speech Coding via Focal Modulation Networks
  8. StableToken A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
  9. SNAC Multi-Scale Neural Audio Codec - used by Mini-Omni and Mini-Omni2 Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
  10. DC-Spin A Speaker-invariant Speech Tokenizer for Spoken Language Models
  11. TS3-Codec Transformer-Based Simple Streaming Single Codec
  12. PLACEHOLDER hertz-dev - Standard Intelligence
  13. FunCodec A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec ?
  14. SyllableLM Learning Coarse Semantic Units for Speech Language Models ?
  15. WavTokenizer an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
  16. BASE TTS Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data ?
  17. Continuous Speech Tokenizer in Text To Speech ?
  18. dMel Speech Tokenization made Simple
  19. neuphonic neucodec A package for NeuCodec a 50hz, 0.8kbps, 24kHz audio codec.

To overcome the limitations of conventional speech tokenizers, which separately capture information for understanding or generation task, we propose a dualcodebook speech tokenizer framework in Step-Audio similar to ARCON (Ming et al., 2024). This approach employs two distinct tokenizers, linguistic and semantic, to better represent speech features. The linguistic tokenizer is utilized to extract structured, high-level representations, including phonemic and linguistic features, whereas the semantic tokenizer is designed to encode both semantic and coarse-grained acoustic characteristics.

For linguistic tokenization, we utilize the output from the Paraformer (Z. Gao, Zhang, McLoughlin, & Yan, 2022) encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic tokenization, we employ CosyVoice’s (Du, Chen, et al., 2024) tokenizer, specifically designed to efficiently 5 encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz. The linguistic tokenizer employs a codebook size of 1024, while the semantic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details.

To effectively integrate these two tokenization schemes, we implement a tokenlevel interleaving approach inspired by SpiritLM (Nguyen et al., 2024). Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two linguistic tokens are paired with three semantic tokens.

— Step-Audio Unified Understanding and Generation in Intelligent Speech Interaction §3.1