🪴 Anil's Garden

❯

❯

Vision

16 Oct 20256 min read

topic
vision

Making the V in VQA Matter Elevating the Role of Image Understanding in Visual Question Answering
Video Instruction Tuning With Synthetic Data - “Llava Video” - Saul/Patrick (not Video-LLaVa)
LLaVA-OneVision Easy Visual Task Transfer
NeXT Improved reasoning, OCR, and world knowledge - LLaVA-NeXT
LLaVA-NeXT-Interleave Tackling Multi-image, Video, and 3D in Large Multimodal Models - LLaVA-NeXT
Ovis Structural Embedding Alignment for Multimodal Large Language Model - Ovis
mPLUG-DocOwl 1 5 Unified Structure Learning for OCR-free Document Understanding
OneChart Purify the Chart Structural Extraction via One Auxiliary Token
CAT Content-Adaptive Image Tokenization
The Best of Both Worlds Integrating Language Models and Diffusion Models for Video Generation

A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
Adaptive deconvolutional networks for mid and high level feature learning
An Action Is Worth Multiple Words Handling Ambiguity in Action Recognition
An Image is Worth 16x16 Words Transformers for Image Recognition at Scale - Vision Transformer
Big Transfer (BiT) General Visual Representation Learning - BiT
Bootstrap your own latent A new approach to self-supervised Learning
DETRs with Collaborative Hybrid Assignments Training
Deformable DETR Deformable Transformers for End-to-End Object Detection
Distribution Fields for Tracking - Laura Sevilla-Lara’s most cited work
End-to-End Dense Video Captioning with Parallel Decoding - PDVC
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
End-to-End Object Detection with Transformers - DETR
End-to-end Temporal Action Detection with Transformer - TadTR
Exploring Simple Siamese Representation Learning
Goku Flow Based Video Generative Foundation Models - Goku
Human Action Localization with Sparse Spatial Supervision
HYperbolic Self-Paced Learning for Self-Supervised Skeleton-based Action Representations
Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
In Defense of Grid Features for Visual Question Answering
JetFormer An Autoregressive Generative Model of Raw Images and Text - JetFormer
LOCATE Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
Learn2Augment Learning to Composite Videos for Data Augmentation in Action Recognition
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
Learning Transferable Visual Models From Natural Language Supervision
MERLOT Multimodal Neural Script Knowledge Models - MERLOT
MERLOT Reserve Neural Script Knowledge through Vision and Language and Sound - MERLOT Reserve
MiniGPT-4 Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT4-Video Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
MLP-Mixer An all-MLP Architecture for Vision - MLP-Mixer
Momentor Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Momentum Contrast for Unsupervised Visual Representation Learning - MoCo
Neural Motifs Scene Graph Parsing with Global Context
Otter A Multi-Modal Model with In-Context Instruction Tuning
PaLI A Jointly-Scaled Multilingual Language-Image Model
PaliGemma A versatile 3B VLM for transfer
PaliGemma 2 A Family of Versatile VLMs for Transfer
Patch n’ Pack NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Principles of Visual Tokens for Efficient Video Understanding from Xinyue Hao and Laura Sevilla-Lara
Pyramid Feature Attention Network for Saliency detection
Qwen2-VL Enhancing Vision-Language Model’s Perception of the World at Any Resolution
Rethinking Spatiotemporal Feature Learning Speed-Accuracy Trade-offs in Video Classification
SPECTRUM Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
STAR A Benchmark for Situated Reasoning in Real-World Videos
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval
Unified Video-Language Pre-training with Synchronized Audio - VLSA
Unsupervised Visual Representation Learning by Context Prediction
Vid2Seq Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Video Swin Transformer
Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models
Video-LLaVA Learning United Visual Representation by Alignment Before Projection
VideoBERT A Joint Model for Video and Language Representation Learning
VideoChat Chat-Centric Video Understanding
VideoOFA Two-Stage Pre-Training for Video-to-Text Generation
Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
VideoMAE Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Visual Prompt Tuning
Visualizing and Understanding Convolutional Networks
VideoLlamas
- VideoLLaMA 3 Frontier Multimodal Foundation Models for Image and Video Understanding
- VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
- Video-LLaMA An Instruction-tuned Audio-Visual Language Model for Video Understanding
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision
[[ $in f t y$ -Video A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation]]
Fully Convolutional Networks for Semantic Segmentation

Dataset papers are under Vision Datasets

Evaluation

Visual Commonsense Reasoning (VCR): From Recognition to Cognition Visual Commonsense Reasoning
- VCR Leaderboard: https://visualcommonsense.com/leaderboard/
SODA: SODA Story Oriented Dense Video Captioning Evaluation Framework
The Second Perception Test Challenge - ECCV Workshop 2024
STAR: STAR A Benchmark for Situated Reasoning in Real-World Videos
Artificial Analysis Video Generation Arena Leaderboard
MSVD [Chen and Dolan, 2011]
MSR-VTT: MSR-VTT A Large Video Description Dataset for Bridging Video and Language
TGIF TGIF A New Dataset and Benchmark on Animated GIF Description
TVQA: TVQA Localized, Compositional Video Question Answering
MVBench: MVBench A Comprehensive Multi-modal Video Understanding Benchmark
Charades-STA - moment retrieval benchmark
QVHighlights - moment retrieval benchmark
ActivityNet - moment retrieval benchmark
QVHighlights - “challenging long-video multi-moment benchmark”
STAR: STAR A Benchmark for Situated Reasoning in Real-World Videos
EPIC-Kitchens [26]
LSMDC-FiB [96]
MSR-VTT QA [120]
DeVAn: DeVAn Dense Video Annotation for Video-Language Models
HACS: HACS Human Action Clips and Segments Dataset for Recognition and Temporal Localization
THUMOS: The THUMOS Challenge on Action Recognition for Videos in the Wild
SOK-Bench: SOK-Bench A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
VideoVista: VideoVista A Versatile Benchmark for Video Understanding and Reasoning
Video-Bench: Video-Bench A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
ReXTime: ReXTime A Benchmark Suite for Reasoning-Across-Time in Videos
Next QA
EgoQA
Video MME
MovieChat

Surveys

An Introduction to Vision-Language Modeling
A Survey of Visual Transformers
Large Vision-Language Model Alignment and Misalignment A Survey Through the Lens of Explainability - presented by Sonal Sannigrahi in Sardine Weekly Meeting on 2025-05-08
A Survey of State of the Art Large Vision Language Models Alignment, Benchmark, Evaluations and Challenges - came across this when looking for the above

Tasks

In the following, reference numbers are from HowTo100M Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips:

text-to-video retrieval [25, 32, 54, 55, 63]
text-based action or event localization [15]
video captioning [36, 61]
video question answering [51, 63]
video summarization with natural language [38]

Implementation

torchvision - PyTorch

Models and pre-trained weights
- The following video classification models are available, with or without pre-trained weights:
Datasets

Resources 📚

Stanford CS231n: Deep Learning for Computer Vision
Math Behind CNNs for Image Processing Svitla Systems
Video Understanding - OpenAI Cookbook entry from Nov 6, 2023: Processing and narrating a video with GPT’s visual capabilities and the TTS API OpenAI Cookbook
Launchpad Reading Group videos

Graph View

Evaluation
Surveys
Tasks
Implementation
torchvision - PyTorch
Resources 📚

Backlinks

Datasets
Multimodality

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋