Amphion - Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation.
“Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.”
never used this
s3prl: Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit, and are utilized in various downstream tasks.
pedalboard is a Python library for working with audio: reading, writing, rendering, adding effects, and more. It supports most popular audio file formats and a number of common audio effects out of the box, and also allows the use of VST3® and Audio Unit formats for loading third-party software instruments and effects.
pedalboard was built by Spotify’s Audio Intelligence Lab to enable using studio-quality audio effects from within Python and TensorFlow. Internally at Spotify, pedalboard is used for data augmentation to improve machine learning models and to help power features like Spotify’s AI DJ and AI Voice Translation.
Spleeter is Deezer source separation library with pretrained models written in Python and uses Tensorflow
It makes it easy to train source separation model (assuming you have a dataset of isolated sources), and provides already trained state of the art model for performing various flavour of separation:
Vocals / drums / bass / other separation (4 stems)
Vocals / drums / bass / piano / other separation (5 stems)
Essentia - Open-source C++ library for audio and music analysis, description, synthesis and music information retrieval
Essentia is not a framework, but rather a collection of algorithms (plus some infrastructure) wrapped in a library, designed with a focus on the robustness, performance, and optimality of the provided algorithms, including computational speed and memory usage, as well as ease of use.
Contains algorithms for: audio input/output, standard digital signal processing blocks, statistical characterization of data, a large variety of spectral, temporal, tonal, and high-level music descriptors, and tools for inference with deep learning models
Python and JavaScript bindings
CLI tools and third-party extensions
large part of Essentia’s algorithms is well-suited for real-time applications.
Fun fact: The PyTorch-Kaldi Speech Recognition Toolkit developed by Mirco Ravanelli is a precursor to SpeechBrain. (PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition systems. The DNN part is managed by PyTorch, while feature extraction, label computation, and decoding are performed with the Kaldi toolkit.)
Name (“Kaldi”): According to legend, Kaldi was the Ethiopian goatherder who discovered the coffee plant
Goal is to have modern and flexible code, written in C++, that is easy to modify and extend
Features include:
Code-level integration with Finite State Transducers (FSTs) - compiles against the OpenFst toolkit (uses it as a library)
includes a matrix library that wraps BLAS and LAPACK
Lhotse - Lhotse is a Python library aiming to make speech and audio data preparation flexible and accessible to a wider community. Alongside k2, it is a part of the next generation Kaldi speech processing library
Like Kaldi, Lhotse provides standard data preparation recipes, but extends that with a seamless PyTorch integration through task-specific Dataset classes. The data and meta-data are represented in human-readable text manifests and exposed to the user through convenient Python classes.
Goal: seamlessly integrate Finite State Automaton (FSA) and Finite State Transducer (FST) algorithms into autograd-based machine learning toolkits like PyTorch and TensorFlow
For speech recognition applications, this should make it easy to interpolate and combine various training objectives such as cross-entropy, CTC and MMI and to jointly optimize a speech recognition system with multiple decoding passes including lattice rescoring and confidence estimation
OpenFst Library - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs)