[person1][person2][person3][person4][person5][person6][person7][tie1][bottle1][bottle2][bottle3][cup1][cup2][cup3][cup4][cup5][cup6][knife1][spoon1][sandwich1][chair1][chair2][chair3][diningtable1]

Why is [person4] pointing at [person1]?

Rationale: I think so because


Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding.

With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered pancakes). While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a rationale explaining why its answer is true.

Overview of VCR

  • 290k multiple choice questions
  • 290k correct answers and rationales: one per question
  • 110k images
  • Counterfactual choices obtained with minimal bias, via our new Adversarial Matching approach
  • Answers are 7.5 words on average; rationales are 16 words.
  • High human agreement (>90%)
  • Scaffolded on top of 80 object categories from COCO
  • Questions are highly diverse and challenging: browse and see for yourself!

From Recognition to Cognition: Visual Commonsense Reasoning

If the paper inspires you, please cite us:

@inproceedings{zellers2019vcr,
  author = {Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
  title = {From Recognition to Cognition: Visual Commonsense Reasoning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}

Authors

VCR is an effort between researchers at the University of Washington and AI2, along with a group of fantastic crowd workers who annotated the data. We’re also grateful for the following sponsors: