Title: Simple and Controllable Music Generation
Authors: Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez
Published: 8th June 2023 (Thursday) @ 15:31:05
Link: http://arxiv.org/abs/2306.05284v1
Abstract
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft.
- music gen requires 44 or 48kHz sampling rate as more of the frequency spectrum used - c.f. speech (fine at 16kHz)
Approaches to use of several audio/speech tokens (e.g. at different levels of granularity)
- Kharitonov et al. [2022], Kreuk et al. [2022] proposed modeling multi-streams of speech tokens in parallel following a delay approach, i.e., introduce offsets between the different streams.
- Agostinelli et al. [2023] proposed representing musical segments using multiple sequences of discrete tokens at different granularity and model them using a hierarchy of autoregressive models.
- In parallel, Donahue et al. [2023] follows a similar approach but for the task of singing to accompaniment generation.
- Recently, Wang et al. [2023] proposed tackling this problem in two stages: (i) modeling the first stream of tokens only; (ii) then, apply a post-network to jointly model the rest of the streams in a non-autoregressive manner.
- Contributions
- We introduce a simple and efficient model to generate high quality music at 32 kHz. We show that MUSICGEN can generate consistent music with a single-stage language model through an efficient codebook interleaving strategy.
- We present a single model to perform both text and melody-conditioned generation and demonstrate that the generated audio is coherent with the provided melody and faithful to the text conditioning information.
- We provide extensive objective and human evaluations on the key design choices behind our method.