🪴 Anil's Garden

❯

❯

Multimodality

18 Jul 20254 min read

topic
multimodality

Vision-Speech Models Teaching Speech Models to Converse about Images - MoshiVis
LLaVA-NeXT-Interleave Tackling Multi-image, Video, and 3D in Large Multimodal Models - LLaVA-NeXT
ASIF Coupled Data Turns Unimodal Models to Multimodal Without Training from GLADIA under Emanuele Rodolà
BLIP-2 Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - Q-former
CoCa Contrastive Captioners are Image-Text Foundation Models
Cobra Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
ConTextual Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
EMO Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
EMOVA Empowering Language Models to See, Hear and Speak with Vivid Emotions
Flamingo a Visual Language Model for Few-Shot Learning
How do Multimodal Foundation Models Encode Text and Speech An Analysis of Cross-Lingual and Cross-Modal Representations
Hyperbolic Learning with Multimodal Large Language Models
ImageBind One Embedding Space To Bind Them All - an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together
Improved Baselines with Visual Instruction Tuning - LLaVa 1.5?
InternVL Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL2
Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
It’s Never Too Late Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
LLaVA-Phi Efficient Multi-Modal Assistant with Small Language Model
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction - AV-HuBERT
Learning Transferable Visual Models From Natural Language Supervision - CLIP
LiT Zero-Shot Transfer with Locked-image text Tuning
Locked-Image Tuning Adding Language Understanding to Image Models
MERLOT Multimodal Neural Script Knowledge Models - MERLOT
MERLOT Reserve Neural Script Knowledge through Vision and Language and Sound - MERLOT Reserve
MM-LLMs Recent Advances in MultiModal Large Language Models
Mamba in Speech Towards an Alternative to Self-Attention
More than Words In-the-Wild Visually-Driven Prosody for Text-to-Speech
MouSi Poly-Visual-Expert Vision-Language Models
Multimodal Few-Shot Learning with Frozen Language Models
Needle In A Multimodal Haystack
ONE-PEACE Exploring One General Representation Model Toward Unlimited Modalities
On Compositions of Transformations in Contrastive Self-Supervised Learning - Meta paper from 2020 using cross-modal, audio-video alignment as a surrogate (pretext) task for pre-training backbones per Meta’s release
OneLLM One Framework to Align All Modalities with Language
OpenFlamingo An Open-Source Framework for Training Large Autoregressive Vision-Language Models
PaliGemma A versatile 3B VLM for transfer
Relative representations enable zero-shot latent space communication - Rodolà’s GLADIA group
SONAR Sentence-Level Multimodal and Language-Agnostic Representations
SPECTRUM Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
Sigmoid Loss for Language Image Pre-Training - SigLIP
SlowFast-LLaVA A Strong Training-Free Baseline for Video Large Language Models
SpiRit-LM Interleaved Spoken and Written Language Model
Textually Pretrained Speech Language Models - TWIST
TinyLLaVA A Framework of Small-scale Large Multimodal Models
Unified Video-Language Pre-training with Synchronized Audio - VLSA
VAST A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VCoder Versatile Vision Encoders for Multimodal Large Language Models
VideoPrism A Foundational Visual Encoder for Video Understanding
Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
Visual Instruction Tuning - LLaVa
What matters when building vision-language models - Idefics2
X-LLM Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
xGen-MM (BLIP-3) A Family of Open Large Multimodal Models - BLIP-3

There’s plenty of multimodality literature in Vision and under themultimodality tag, likely including some I neglected to add above.

Multimodal datasets are under Multimodal Datasets

Surveys

Next Token Prediction Towards Multimodal Intelligence A Comprehensive Survey
A Survey on Multimodal Large Language Models
An Introduction to Vision-Language Modeling
Multimodal Machine Learning A Survey and Taxonomy
A Survey of State of the Art Large Vision Language Models Alignment, Benchmark, Evaluations and Challenges
Large Vision-Language Model Alignment and Misalignment A Survey Through the Lens of Explainability

Evaluation

MSTS: MSTS A Multimodal Safety Test Suite for Vision-Language Models from Paul Röttger, Giuseppe Attanasio and co.
MMMU: MMMU A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
ScienceQA: Learn to Explain Multimodal Reasoning via Thought Chains for Science Question Answering
SciFIBench: SciFIBench Benchmarking Large Multimodal Models for Scientific Figure Interpretation

Implementation (Code)

PyTorch Multimodal
- Common Layers e.g. Q-former from BLIP-2

Graph View

Surveys
Evaluation
Implementation (Code)

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋