Title: PLACEHOLDER hertz-dev - Standard Intelligence Authors: Standard Intelligence Published: 2024-11-03 Link: https://si.inc/hertz-dev/
Abstract
This entry is a placeholder. See đ Introducing hertz-dev - Standard Intelligence
Code and models available at GitHub link above: https://github.com/Standard-Intelligence/hertz-dev?tab=readme-ov-file
Main releases
- hertz-codec:Â convolutional audio autoencoder
- encodes mono, 16kHz speech to 8 Hz latent representation at approx. 1kbps bitrate
- outperforms Soundstream and Encodec at 6kbps; on par with DAC at 8kbps in subjective evaluations
- fewer tokens per second than any popular tokenizer (âcritical for language modelingâ)
- 5 million encoder parameters; 95 million decoder parameters
- hertz-vae:Â 1.8 billion parameter transformer decoder
- acts as a learned prior for the audio VAE
- context of 8192 sampled latent representations 17 minutes
- predicts next encoded audio frame as a mixture of Gaussians
- 15 bits of quantized information from the next token act as semantic scaffolding to steer the generation in a streamable manner
- hertz-dev:Â 6.6 billion parameter transformer stack
- primary checkpoint is partially initialized from weights of a pre-trained language model
- question Which LM do they initialize with?
- trained for a single epoch on 500B tokens with a 2048-token (4 minute) context length
- âWeâre also publishing an ablation of the language model initialization which is similarly trained on 500B tokens.â
- Hertz-dev has a theoretical latency of 65ms and a real-world average latency of 120ms on a RTX 4090
- question What accounts for the difference in theoretical vs practical latency? Iâve heard these two figures reported in parallel quite frequently.
- This latency is about 2x lower than any other public model in the world
- question Is it lower than Moshi?
- primary checkpoint is partially initialized from weights of a pre-trained language model