- Video Instruction Tuning With Synthetic Data - âLlava Videoâ - Saul/Patrick (not Video-LLaVa)
- LLaVA-OneVision Easy Visual Task Transfer
- NeXT Improved reasoning, OCR, and world knowledge - LLaVA-NeXT
- LLaVA-NeXT-Interleave Tackling Multi-image, Video, and 3D in Large Multimodal Models - LLaVA-NeXT
- Ovis Structural Embedding Alignment for Multimodal Large Language Model - Ovis
- mPLUG-DocOwl 1 5 Unified Structure Learning for OCR-free Document Understanding
- OneChart Purify the Chart Structural Extraction via One Auxiliary Token
- CAT Content-Adaptive Image Tokenization
- The Best of Both Worlds Integrating Language Models and Diffusion Models for Video Generation
- A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
- Adaptive deconvolutional networks for mid and high level feature learning
- An Action Is Worth Multiple Words Handling Ambiguity in Action Recognition
- An Image is Worth 16x16 Words Transformers for Image Recognition at Scale - Vision Transformer
- Big Transfer (BiT) General Visual Representation Learning - BiT
- Bootstrap your own latent A new approach to self-supervised Learning
- DETRs with Collaborative Hybrid Assignments Training
- Deformable DETR Deformable Transformers for End-to-End Object Detection
- Distribution Fields for Tracking - Laura Sevilla-Laraâs most cited work
- End-to-End Dense Video Captioning with Parallel Decoding - PDVC
- End-to-End Learning of Visual Representations from Uncurated Instructional Videos
- End-to-End Object Detection with Transformers - DETR
- End-to-end Temporal Action Detection with Transformer - TadTR
- Exploring Simple Siamese Representation Learning
- Goku Flow Based Video Generative Foundation Models - Goku
- Human Action Localization with Sparse Spatial Supervision
- HYperbolic Self-Paced Learning for Self-Supervised Skeleton-based Action Representations
- Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
- In Defense of Grid Features for Visual Question Answering
- JetFormer An Autoregressive Generative Model of Raw Images and Text - JetFormer
- LOCATE Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
- Learn2Augment Learning to Composite Videos for Data Augmentation in Action Recognition
- Learning Action Changes by Measuring Verb-Adverb Textual Relationships
- Learning Transferable Visual Models From Natural Language Supervision
- MERLOT Multimodal Neural Script Knowledge Models - MERLOT
- MERLOT Reserve Neural Script Knowledge through Vision and Language and Sound - MERLOT Reserve
- MiniGPT-4 Enhancing Vision-Language Understanding with Advanced Large Language Models
- MiniGPT4-Video Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
- MLP-Mixer An all-MLP Architecture for Vision - MLP-Mixer
- Momentor Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
- Momentum Contrast for Unsupervised Visual Representation Learning - MoCo
- Neural Motifs Scene Graph Parsing with Global Context
- Otter A Multi-Modal Model with In-Context Instruction Tuning
- PaLI A Jointly-Scaled Multilingual Language-Image Model
- PaliGemma A versatile 3B VLM for transfer
- PaliGemma 2 A Family of Versatile VLMs for Transfer
- Patch nâ Pack NaViT, a Vision Transformer for any Aspect Ratio and Resolution
- Principles of Visual Tokens for Efficient Video Understanding from Xinyue Hao and Laura Sevilla-Lara
- Pyramid Feature Attention Network for Saliency detection
- Qwen2-VL Enhancing Vision-Language Modelâs Perception of the World at Any Resolution
- Rethinking Spatiotemporal Feature Learning Speed-Accuracy Trade-offs in Video Classification
- SPECTRUM Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
- Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
- STAR A Benchmark for Situated Reasoning in Real-World Videos
- The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval
- Unified Video-Language Pre-training with Synchronized Audio - VLSA
- Unsupervised Visual Representation Learning by Context Prediction
- Vid2Seq Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
- Video Swin Transformer
- Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models
- Video-LLaVA Learning United Visual Representation by Alignment Before Projection
- VideoBERT A Joint Model for Video and Language Representation Learning
- VideoChat Chat-Centric Video Understanding
- VideoOFA Two-Stage Pre-Training for Video-to-Text Generation
- Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
- VideoMAE Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- Visual Prompt Tuning
- Visualizing and Understanding Convolutional Networks
- VideoLlamas
- What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision
- [[-Video A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation]]
Dataset papers are under Vision Datasets
Evaluation
- Visual Commonsense Reasoning (VCR): From Recognition to Cognition Visual Commonsense Reasoning
- VCR Leaderboard: https://visualcommonsense.com/leaderboard/
- SODA: SODA Story Oriented Dense Video Captioning Evaluation Framework
- The Second Perception Test Challenge - ECCV Workshop 2024
- STAR: STAR A Benchmark for Situated Reasoning in Real-World Videos
- Artificial Analysis Video Generation Arena Leaderboard
- MSVD [Chen and Dolan, 2011]
- MSR-VTT: MSR-VTT A Large Video Description Dataset for Bridging Video and Language
- TGIF TGIF A New Dataset and Benchmark on Animated GIF Description
- TVQA: TVQA Localized, Compositional Video Question Answering
- MVBench: MVBench A Comprehensive Multi-modal Video Understanding Benchmark
- Charades-STA - moment retrieval benchmark
- QVHighlights - moment retrieval benchmark
- ActivityNet - moment retrieval benchmark
- QVHighlights - âchallenging long-video multi-moment benchmarkâ
- STAR: STAR A Benchmark for Situated Reasoning in Real-World Videos
- EPIC-Kitchens [26]
- LSMDC-FiB [96]
- MSR-VTT QA [120]
- DeVAn: DeVAn Dense Video Annotation for Video-Language Models
- HACS: HACS Human Action Clips and Segments Dataset for Recognition and Temporal Localization
- THUMOS: The THUMOS Challenge on Action Recognition for Videos in the Wild
- SOK-Bench: SOK-Bench A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
- VideoVista: VideoVista A Versatile Benchmark for Video Understanding and Reasoning
- Video-Bench: Video-Bench A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
- ReXTime: ReXTime A Benchmark Suite for Reasoning-Across-Time in Videos
- Next QA
- EgoQA
- Video MME
- MovieChat
Surveys
Tasks
In the following, reference numbers are from HowTo100M Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips:
- text-to-video retrieval [25, 32, 54, 55, 63]
- text-based action or event localization [15]
- video captioning [36, 61]
- video question answering [51, 63]
- video summarization with natural language [38]
Implementation
torchvision - PyTorch
- Models and pre-trained weights
- The following video classification models are available, with or without pre-trained weights:
- Datasets
Resources đ
- Stanford CS231n: Deep Learning for Computer Vision
- Math Behind CNNs for Image Processing Svitla Systems
- Video Understanding - OpenAI Cookbook entry from Nov 6, 2023: Processing and narrating a video with GPTâs visual capabilities and the TTS API OpenAI Cookbook
- Launchpad Reading Group videos