Title: VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Authors: Xilun Chen, Lili Yu, Wenhan Xiong, Barlas Oğuz, Yashar Mehdad, Wen-tau Yih
Published: 4th May 2023 (Thursday) @ 23:27:21
Link: http://arxiv.org/abs/2305.03204v1
Abstract
We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and video question answering: A generative encoder-decoder model is first jointly pre-trained on massive image-text data to learn fundamental vision-language concepts, and then adapted to video data in an intermediate video-text pre-training stage to learn video-specific skills such as spatio-temporal reasoning. As a result, our VideoOFA model achieves new state-of-the-art performance on four Video Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr score. It also outperforms existing models on two open-ended Video Question Answering datasets, showcasing its generalization capability as a universal video-to-text model.
an early-fusion VLM for video-to-text generation. Many earlier video VLMs either lack the ability to generate texts, or combine a video encoder with a separately trained text decoder leading to suboptimal accuracy. In contrast, VideoOFA proposes a two-stage pre-training framework to adapt a single generative image-text VLM to video-text tasks. In particular, VideoOFA initializes from an image-text VLM that is capable of text generation and jointly pre-trained on massive image-text data to learn fundamental visual-language representations . It then proposes an intermediate video-text pre-training step to adapt the backbone VLM to video-text tasks and learn video-specific concepts such as temporal reasoning. The intermediate pre-training stage consists of three training objectives, all reformulated as video-to-text generation tasks: Video Captioning, Video-Text Matching, and Frame Order Modeling. VideoOFA is evaluated on several Video Captioning and Video Question Answering benchmarks and showed improved performance compared to previous models.
VideoOFA uses the OFA [Wang et al., 2022] model as its image-text backbone in practice.
— Summary from An Introduction to Vision-Language Modeling