Introducing hertz-dev - Standard Intelligence

Excerpt

For the last few months, we at Standard Intelligence have been researching scalable cross-modality learning. We’re excited to announce that we’re open-sourcing current checkpoints of our full-duplex, audio-only transformer base model, hertz-dev, with a total of 8.5 billion parameters.


For the last few months, we at Standard Intelligence have been researching scalable cross-modality learning. We’re excited to announce that we’re open-sourcing current checkpoints of our full-duplex, audio-only transformer base model, hertz-dev, with a total of 8.5 billion parameters.

  • hertz-codec: a convolutional audio autoencoder that takes mono, 16kHz speech and transforms it into a 8 Hz latent representation at about 1kbps bitrate. The codec at 1kbps outperforms Soundstream and Encodec at 6kbps and is on par with DAC at 8kbps in subjective evaluations, while having lower tokens per second than any popular tokenizer, critical for language modeling. The codec has 5 million encoder parameters and 95 million decoder parameters.
  • hertz-vae: a 1.8 billion parameter transformer decoder which acts as a learned prior for the audio VAE. The model uses a context of 8192 sampled latent representations (17 minutes) and predicts the next encoded audio frame as a mixture of gaussians. 15 bits of quantized information from the next token act as semantic scaffolding to steer the generation in a streamable manner.
  • hertz-dev: a 6.6 billion parameter transformer stack. The primary checkpoint is partially initialized from the weights of a pre-trained language model and then trained for a single epoch on 500B tokens with a 2048-token (4 minute) context length. We’re also publishing an ablation of the language model initialization which is similarly trained on 500B tokens.

Hertz-dev is the first publicly released audio base model of its kind. Base models are uniquely valuable as a research product because they accurately model the distribution of the data that they were trained on, as opposed to models that have had substantial RL tuning done to collapse their generation distributions. This makes base models the best starting point to fine-tune for a large number of different tasks.

Hertz-dev has a theoretical latency of 65ms and a real-world average latency of 120ms on a RTX 4090. This is about 2x lower latency than any public model in the world—a prerequisite for a model that can interact with you in human-like ways instead of what feels like a delayed, choppy phone call. We’re currently training a larger, more advanced version of Hertz, which will use a scaled base model recipe and RL tuning to substantially improve the raw capabilities and final coherence of the model. Hertz-dev is a glimpse at the future of real-time voice interaction, and is the easiest conversational audio model in the world for researchers to fine-tune and build on top of.

Sample Generations

To demonstrate the audio modeling capabilities of hertz-dev, we sample both one-channel and two-channel generations as well as a live conversation between the model and a human.

One-channel

Your browser does not support the audio element.

Your browser does not support the audio element.

Your browser does not support the audio element.

Your browser does not support the audio element.

Two-channel

Your browser does not support the audio element.

Your browser does not support the audio element.

Your browser does not support the audio element.

Interactive

Your browser does not support the audio element.

9 seconds of prompt included.