Imagine a professional musician being able to explore new compositions without having to play a single note on an instrument. Or an indie game developer populating virtual worlds with realistic sound effects and ambient noise on a shoestring budget. Or a small business owner adding a soundtrack to their latest Instagram post with ease. Thatâs the promise of AudioCraft â our simple framework that generates high-quality, realistic audio and music from text-based user inputs after training on raw audio signals as opposed to MIDI or piano rolls.
RECOMMENDED READS
- Introducing CM3leon, a more efficient, state-of-the-art generative model for text and images
- Introducing Voicebox: The first generative AI model for speech to generalize across tasks with state-of-the-art performance
- Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
AudioCraft consists of three models: MusicGen, AudioGen, and EnCodec. MusicGen, which was trained with Meta-owned and specifically licensed music, generates music from text-based user inputs, while AudioGen, which was trained on public sound effects, generates audio from text-based user inputs. Today, weâre excited to release an improved version of our EnCodec decoder, which allows for higher quality music generation with fewer artifacts; our pre-trained AudioGen model, which lets you generate environmental sounds and sound effects like a dog barking, cars honking, or footsteps on a wooden floor; and all of the AudioCraft model weights and code. The models are available for research purposes and to further peopleâs understanding of the technology. Weâre excited to give researchers and practitioners access so they can train their own models with their own datasets for the first time and help advance the state of the art.
From text to audio with ease
In recent years, generative AI models including language models have made huge strides and shown exceptional abilities: from the generation of a wide-variety of images and video from text descriptions exhibiting spatial understanding to text and speech models that perform machine translation or even text or speech dialogue agents. Yet while weâve seen a lot of excitement around generative AI for images, video, and text, audio has always seemed to lag a bit behind. Thereâs some work out there, but itâs highly complicated and not very open, so people arenât able to readily play with it.
Generating high-fidelity audio of any kind requires modeling complex signals and patterns at varying scales. Music is arguably the most challenging type of audio to generate because itâs composed of local and long-range patterns, from a suite of notes to a global musical structure with multiple instruments. Generating coherent music with AI has often been addressed through the use of symbolic representations like MIDI or piano rolls. However, these approaches are unable to fully grasp the expressive nuances and stylistic elements found in music. More recent advances leverage self-supervised audio representation learning and a number of hierarchical or cascaded models to generate music, feeding the raw audio into a complex system in order to capture long-range structures in the signal while generating quality audio. But we knew that more could be done in this field.
The AudioCraft family of models is capable of producing high-quality audio with long-term consistency, and it can be easily interacted with through a natural interface. With AudioCraft, we simplify the overall design of generative models for audio compared to prior work in the field â giving people the full recipe to play with the existing models that Meta has been developing over the past several years while also empowering them to push the limits and develop their own models.
AudioCraft works for music and sound generation and compression â all in the same place. Because itâs easy to build on and reuse, people who want to build better sound generators, compression algorithms, or music generators can do it all in the same code base and build on top of what others have done.
And while a lot of work went into making the models simple, the team was equally committed to ensuring that AudioCraft could support the state of the art. People can easily extend our models and adapt them to their use cases for research. There are nearly limitless possibilities once you give people access to the models to tune them to their needs. And thatâs what we want to do with this family of models: give people the power to extend their work.
A simple approach to audio generation
Generating audio from raw audio signals is challenging as it requires modeling extremely long sequences. A typical music track of a few minutes sampled at 44.1 kHz (which is the standard quality of music recordings) consists of millions of timesteps. In comparison, text-based generative models like Llama and Llama 2 are fed with text processed as sub-words that represent just a few thousands of timesteps per sample.
To address this challenge, we learn discrete audio tokens from the raw signal using the EnCodec neural audio codec, which gives us a new fixed âvocabularyâ for music samples. We can then train autoregressive language models over these discrete audio tokens to generate new tokens and new sounds and music when converting the tokens back to the audio space with EnCodecâs decoder.
Learning audio tokens from the waveform
EnCodec is a lossy neural codec that was trained specifically to compress any kind of audio and reconstruct the original signal with high fidelity. It consists of an autoencoder with a residual vector quantization bottleneck that produces several parallel streams of audio tokens with a fixed vocabulary. The different streams capture different levels of information of the audio waveform, allowing us to reconstruct the audio with high fidelity from all the streams.
Training audio language models
We then use a single autoregressive language model to recursively model the audio tokens from EnCodec. We introduce a simple approach to leverage the internal structure of the parallel streams of tokens and show that with a single model and elegant token interleaving pattern, our approach efficiently models audio sequences, simultaneously capturing the long-term dependencies in the audio and allowing us to generate high-quality sound.
Generating audio from text descriptions
Text Prompt: Whistling with wind blowing
Text Prompt: Sirens and a humming engine approach and pass
With AudioGen, we demonstrated that we can train AI models to perform the task of text-to-audio generation. Given a textual description of an acoustic scene, the model can generate the environmental sound corresponding to the description with realistic recording conditions and complex scene context.
Text Prompt: Pop dance track with catchy melodies, tropical percussions, and upbeat rhythms, perfect for the beach
Text Prompt: Earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves
MusicGen is an audio generation model specifically tailored for music generation. Music tracks are more complex than environmental sounds, and generating coherent samples on the long-term structure is especially important when creating novel musical pieces. MusicGen was trained on roughly 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music owned by Meta or licensed specifically for this purpose.
Building on this research
Our team continues working on the research behind advanced generative AI audio models. As part of this AudioCraft release, we further provide new approaches to push the quality of synthesized audio through a diffusion-based approach for discrete representation decoding. We plan to keep investigating better controllability of generative models for audio, exploring additional conditioning methods, and pushing the ability of models to capture even longer range dependencies. Finally, we will continue investigating the limitations and biases of such models trained on audio.
The team is working to improve the current models by boosting their speed and efficiency from a modeling perspective and improving the way we control these models, which will open up new use cases and possibilities.
Responsibility and transparency as the cornerstones of our research
Itâs important to be open about our work so the research community can build on it and continue the important conversations weâre having about how to build AI responsibly. We recognize that the datasets used to train our models lack diversity. In particular, the music dataset used contains a larger portion of western-style music and only contains audio-text pairs with text and metadata written in English. By sharing the code for AudioCraft, we hope other researchers can more easily test new approaches to limit or eliminate potential bias in and misuse of generative models.
The importance of open source
Responsible innovation canât happen in isolation. Open sourcing our research and resulting models helps ensure that everyone has equal access.
Weâre making the models available to the research community at several sizes and sharing AudioGen and MusicGen model cards that detail how we built the models in keeping with our approach to Responsible AI practices. Our audio research framework and training code is released under the MIT license to enable the broader community to reproduce and build on top of our work. And through the development of more advanced controls, we hope that such models can become useful to both music amateurs and professionals.
Having a solid open source foundation will foster innovation and complement the way we produce and listen to audio and music in the future: think rich bedtime story readings with sound effects and epic music. With even more controls, we think MusicGen can turn into a new type of instrument â just like synthesizers when they first appeared.
We see the AudioCraft family of models as tools for musiciansâ and sound designersâ professional toolboxes in that they can provide inspiration, help people quickly brainstorm, and iterate on their compositions in new ways.
Rather than keeping the work as an impenetrable black box, being open about how we develop these models and ensuring that theyâre easy for people to use â whether itâs researchers or the music community as a whole â helps people understand what these models can do, understand what they canât do, and be empowered to actually use them.
In the future, generative AI could help people vastly improve iteration time by allowing them to get feedback faster during the early prototyping and grayboxing stages â whether theyâre a large AAA developer building worlds for the metaverse, a musician (amateur, professional, or otherwise) working on their next composition, or a small or medium-sized business owner looking to up-level their creative assets. AudioCraft is an important step forward in generative AI research. We believe the simple approach we developed to successfully generate robust, coherent, and high-quality audio samples will have a meaningful impact on the development of advanced human-computer interaction models considering auditory and multi-modal interfaces. And we canât wait to see what people create with it.
This blog post was made possible by the work of: Yossi Adi, Jade Copet, Alexandre DĂ©fossez, Itai Gat, David Kant, Felix Kreuk, Rashel Moritz, Tal Remez, Robin San Roman, Gabriel Synnaeve, and Mary Williamson.