- Vision-Speech Models Teaching Speech Models to Converse about Images - MoshiVis
- LLaVA-NeXT-Interleave Tackling Multi-image, Video, and 3D in Large Multimodal Models - LLaVA-NeXT
- ASIF Coupled Data Turns Unimodal Models to Multimodal Without Training from GLADIA under Emanuele RodolĂ
- BLIP-2 Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - Q-former
- CoCa Contrastive Captioners are Image-Text Foundation Models
- Cobra Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- ConTextual Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
- EMO Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
- EMOVA Empowering Language Models to See, Hear and Speak with Vivid Emotions
- Flamingo a Visual Language Model for Few-Shot Learning
- How do Multimodal Foundation Models Encode Text and Speech An Analysis of Cross-Lingual and Cross-Modal Representations
- Hyperbolic Learning with Multimodal Large Language Models
- ImageBind One Embedding Space To Bind Them All - an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together
- Improved Baselines with Visual Instruction Tuning - LLaVa 1.5?
- InternVL Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- InternVL2
- Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
- Itâs Never Too Late Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
- LLaVA-Phi Efficient Multi-Modal Assistant with Small Language Model
- Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction - AV-HuBERT
- Learning Transferable Visual Models From Natural Language Supervision - CLIP
- LiT Zero-Shot Transfer with Locked-image text Tuning
- Locked-Image Tuning Adding Language Understanding to Image Models
- MERLOT Multimodal Neural Script Knowledge Models - MERLOT
- MERLOT Reserve Neural Script Knowledge through Vision and Language and Sound - MERLOT Reserve
- MM-LLMs Recent Advances in MultiModal Large Language Models
- Mamba in Speech Towards an Alternative to Self-Attention
- More than Words In-the-Wild Visually-Driven Prosody for Text-to-Speech
- MouSi Poly-Visual-Expert Vision-Language Models
- Multimodal Few-Shot Learning with Frozen Language Models
- Needle In A Multimodal Haystack
- ONE-PEACE Exploring One General Representation Model Toward Unlimited Modalities
- On Compositions of Transformations in Contrastive Self-Supervised Learning - Meta paper from 2020 using cross-modal, audio-video alignment as a surrogate (pretext) task for pre-training backbones per Metaâs release
- OneLLM One Framework to Align All Modalities with Language
- OpenFlamingo An Open-Source Framework for Training Large Autoregressive Vision-Language Models
- PaliGemma A versatile 3B VLM for transfer
- Relative representations enable zero-shot latent space communication - RodolĂ âs GLADIA group
- SONAR Sentence-Level Multimodal and Language-Agnostic Representations
- SPECTRUM Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
- Sigmoid Loss for Language Image Pre-Training - SigLIP
- SlowFast-LLaVA A Strong Training-Free Baseline for Video Large Language Models
- SpiRit-LM Interleaved Spoken and Written Language Model
- Textually Pretrained Speech Language Models - TWIST
- TinyLLaVA A Framework of Small-scale Large Multimodal Models
- Unified Video-Language Pre-training with Synchronized Audio - VLSA
- VAST A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
- VCoder Versatile Vision Encoders for Multimodal Large Language Models
- VideoPrism A Foundational Visual Encoder for Video Understanding
- Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
- Visual Instruction Tuning - LLaVa
- What matters when building vision-language models - Idefics2
- X-LLM Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
- xGen-MM (BLIP-3) A Family of Open Large Multimodal Models - BLIP-3
Thereâs plenty of multimodality literature in Vision and under themultimodality tag, likely including some I neglected to add above.
Multimodal datasets are under Multimodal Datasets
Surveys
- Next Token Prediction Towards Multimodal Intelligence A Comprehensive Survey
- A Survey on Multimodal Large Language Models
- An Introduction to Vision-Language Modeling
- Multimodal Machine Learning A Survey and Taxonomy
Evaluation
- MSTS: MSTS A Multimodal Safety Test Suite for Vision-Language Models from Paul Röttger, Giuseppe Attanasio and co.
- MMMU: MMMU A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
- ScienceQA: Learn to Explain Multimodal Reasoning via Thought Chains for Science Question Answering
- SciFIBench: SciFIBench Benchmarking Large Multimodal Models for Scientific Figure Interpretation
Implementation (Code)
- PyTorch Multimodal
- Common Layers e.g. Q-former from BLIP-2