Dataset papers are under Vision Datasets

Evaluation

Surveys

Tasks

In the following, reference numbers are from HowTo100M Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips:

  • text-to-video retrieval [25, 32, 54, 55, 63]
  • text-based action or event localization [15]
  • video captioning [36, 61]
  • video question answering [51, 63]
  • video summarization with natural language [38]

Implementation

torchvision - PyTorch

Resources 📚