Title: PLACEHOLDER hertz-dev - Standard Intelligence Authors: Standard Intelligence Published: 2024-11-03 Link: https://si.inc/hertz-dev/

Abstract


This entry is a placeholder. See 👉 Introducing hertz-dev - Standard Intelligence

Code and models available at GitHub link above: https://github.com/Standard-Intelligence/hertz-dev?tab=readme-ov-file


Main releases

  • hertz-codec: convolutional audio autoencoder
    • encodes mono, 16kHz speech to 8 Hz latent representation at approx. 1kbps bitrate
    • outperforms Soundstream and Encodec at 6kbps; on par with DAC at 8kbps in subjective evaluations
    • fewer tokens per second than any popular tokenizer (“critical for language modeling”)
    • 5 million encoder parameters; 95 million decoder parameters
  • hertz-vae: 1.8 billion parameter transformer decoder
    • acts as a learned prior for the audio VAE
    • context of 8192 sampled latent representations 17 minutes
    • predicts next encoded audio frame as a mixture of Gaussians
    • 15 bits of quantized information from the next token act as semantic scaffolding to steer the generation in a streamable manner
  • hertz-dev: 6.6 billion parameter transformer stack
    • primary checkpoint is partially initialized from weights of a pre-trained language model
      • question Which LM do they initialize with?
    • trained for a single epoch on 500B tokens with a 2048-token (4 minute) context length
    • “We’re also publishing an ablation of the language model initialization which is similarly trained on 500B tokens.”
    • Hertz-dev has a theoretical latency of 65ms and a real-world average latency of 120ms on a RTX 4090
      • question What accounts for the difference in theoretical vs practical latency? I’ve heard these two figures reported in parallel quite frequently.
    • This latency is about 2x lower than any other public model in the world