- A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
- Adaptive deconvolutional networks for mid and high level feature learning
- Bootstrap your own latent A new approach to self-supervised Learning
- DETRs with Collaborative Hybrid Assignments Training
- Deformable DETR Deformable Transformers for End-to-End Object Detection
- Distribution Fields for Tracking - Laura Sevilla-Laraâs most cited work
- End-to-End Dense Video Captioning with Parallel Decoding
- End-to-End Learning of Visual Representations from Uncurated Instructional Videos
- End-to-End Object Detection with Transformers - DETR
- Exploring Simple Siamese Representation Learning
- Human Action Localization with Sparse Spatial Supervision
- Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
- LOCATE Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
- Learn2Augment Learning to Composite Videos for Data Augmentation in Action Recognition
- Learning Action Changes by Measuring Verb-Adverb Textual Relationships
- Learning Transferable Visual Models From Natural Language Supervision
- MERLOT Multimodal Neural Script Knowledge Models
- MERLOT Reserve Neural Script Knowledge through Vision and Language and Sound
- Momentor Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
- Momentum Contrast for Unsupervised Visual Representation Learning
- Neural Motifs Scene Graph Parsing with Global Context
- Principles of Visual Tokens for Efficient Video Understanding from Laura Sevilla-Lara
- Rethinking Spatiotemporal Feature Learning Speed-Accuracy Trade-offs in Video Classification
- SPECTRUM Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
- STAR A Benchmark for Situated Reasoning in Real-World Videos
- Unified Video-Language Pre-training with Synchronized Audio - VLSA
- Unsupervised Visual Representation Learning by Context Prediction
- Vid2Seq Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
- Video Swin Transformer
- Video-LLaVA Learning United Visual Representation by Alignment Before Projection
- VideoBERT A Joint Model for Video and Language Representation Learning
- VideoOFA Two-Stage Pre-Training for Video-to-Text Generation
- Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
- Visualizing and Understanding Convolutional Networks
- VideoLlamas
- [[-Video A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation]]
Dataset papers are under Vision Datasets
Evaluation
- Visual Commonsense Reasoning (VCR): From Recognition to Cognition Visual Commonsense Reasoning
- VCR Leaderboard: https://visualcommonsense.com/leaderboard/
- SODA: SODA Story Oriented Dense Video Captioning Evaluation Framework
- The Second Perception Test Challenge - ECCV Workshop 2024
- STAR: STAR A Benchmark for Situated Reasoning in Real-World Videos
- Artificial Analysis Video Generation Arena Leaderboard
- MSVD [Chen and Dolan, 2011]
- MSRVTT [Xu et al. 2016]
- TGIF [Li et al. 2016]
- TVQA: TVQA Localized, Compositional Video Question Answering
- MVBench: MVBench A Comprehensive Multi-modal Video Understanding Benchmark
- Charades-STA - moment retrieval benchmark
- QVHighlights - moment retrieval benchmark
- ActivityNet - moment retrieval benchmark
- QVHighlights - âchallenging long-video multi-moment benchmarkâ
- STAR: STAR A Benchmark for Situated Reasoning in Real-World Videos
- EPIC-Kitchens [26]
- LSMDC-FiB [96]
- MSR-VTT QA [120]
- DeVAn: DeVAn Dense Video Annotation for Video-Language Models
Surveys
Tasks
In the following, reference numbers are from HowTo100M Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips:
- text-to-video retrieval [25, 32, 54, 55, 63]
- text-based action or event localization [15]
- video captioning [36, 61]
- video question answering [51, 63]
- video summarization with natural language [38]
Implementation
torchvision - PyTorch
- Models and pre-trained weights
- The following video classification models are available, with or without pre-trained weights:
- Datasets
Resources đ
- Stanford CS231n: Deep Learning for Computer Vision
- Math Behind CNNs for Image Processing Svitla Systems
- Video Understanding - OpenAI Cookbook entry from Nov 6, 2023: Processing and narrating a video with GPTâs visual capabilities and the TTS API OpenAI Cookbook