• Amphion - Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation.
    • “Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.”
    • never used this
  • s3prl: Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit, and are utilized in various downstream tasks.
  • PortAudio - an Open-Source Cross-Platform Audio API
  • sounddevice Python package
  • spotify/pedalboard: 🎛 🔊 A Python library for working with audio.
    • pedalboard is a Python library for working with audio: reading, writing, rendering, adding effects, and more. It supports most popular audio file formats and a number of common audio effects out of the box, and also allows the use of VST3® and Audio Unit formats for loading third-party software instruments and effects.
    • pedalboard was built by Spotify’s Audio Intelligence Lab to enable using studio-quality audio effects from within Python and TensorFlow. Internally at Spotify, pedalboard is used for data augmentation to improve machine learning models and to help power features like Spotify’s AI DJ and AI Voice Translation.
  • spotify/klio: Smarter data pipelines for audio. - process audio files – or any binary files – easily and at scale
    • Klio jobs are opinionated data pipelines in Python (streaming or batch) built upon Apache Beam and tuned for audio and binary file processing.
  • deezer/spleeter: Deezer source separation library including pretrained models.
    • Spleeter is Deezer source separation library with pretrained models written in Python and uses Tensorflow
    • It makes it easy to train source separation model (assuming you have a dataset of isolated sources), and provides already trained state of the art model for performing various flavour of separation:
      • Vocals (singing voice) / accompaniment separation (2 stems)
      • Vocals / drums / bass / other separation (4 stems)
      • Vocals / drums / bass / piano / other separation (5 stems)
  • Essentia - Open-source C++ library for audio and music analysis, description, synthesis and music information retrieval
    • Essentia is not a framework, but rather a collection of algorithms (plus some infrastructure) wrapped in a library, designed with a focus on the robustnessperformance, and optimality of the provided algorithms, including computational speed and memory usage, as well as ease of use.
    • Contains algorithms for: audio input/output, standard digital signal processing blocks, statistical characterization of data, a large variety of spectral, temporal, tonal, and high-level music descriptors, and tools for inference with deep learning models
    • Python and JavaScript bindings
    • CLI tools and third-party extensions
    • large part of Essentia’s algorithms is well-suited for real-time applications.
  • Librosa
  • SpeechBrain Open-Source Conversational AI for Everyone
  • ESPnet: end-to-end speech processing toolkit: end-to-end speech processing toolkit covering: automatic speech recognition (end-to-end), text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding
  • Kaldi: toolkit for speech recognition written in C++
    • Kaldi Kaldi
    • Kaldi code
    • Kaldi for Dummies tutorial
    • Name (“Kaldi”): According to legend, Kaldi was the Ethiopian goatherder who discovered the coffee plant
    • Goal is to have modern and flexible code, written in C++, that is easy to modify and extend
    • Features include:
      • Code-level integration with Finite State Transducers (FSTs) - compiles against the OpenFst toolkit (uses it as a library)
      • includes a matrix library that wraps BLAS and LAPACK
  • Lhotse - Lhotse is a Python library aiming to make speech and audio data preparation flexible and accessible to a wider community. Alongside k2, it is a part of the next generation Kaldi speech processing library
    • Like Kaldi, Lhotse provides standard data preparation recipes, but extends that with a seamless PyTorch integration through task-specific Dataset classes. The data and meta-data are represented in human-readable text manifests and exposed to the user through convenient Python classes.
  • k2
    • Goal: seamlessly integrate Finite State Automaton (FSA) and Finite State Transducer (FST) algorithms into autograd-based machine learning toolkits like PyTorch and TensorFlow
    • For speech recognition applications, this should make it easy to interpolate and combine various training objectives such as cross-entropy, CTC and MMI and to jointly optimize a speech recognition system with multiple decoding passes including lattice rescoring and confidence estimation
  • OpenFst Library - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs)
  • python_speech_features
  • Audacity (link is to manual)
  • FFmpeg - multimedia framework, able to decode, encode, transcode, mux, demux, stream, filter and play
  • ffprobe - a multimedia stream analyser
  • Montreal Forced Aligner (MFA)
  • torchaudio and docs