🪴 Anil's Garden

❯

LLaVA-OneVision: Easy Visual Task Transfer

19 Dec 20251 min read

paper

Title: LLaVA-OneVision: Easy Visual Task Transfer
Authors: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
Published: 6th August 2024 (Tuesday) @ 17:59:44
Link: http://arxiv.org/abs/2408.03326v3

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Graph View

Backlinks

Vision

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

LLaVA-OneVision: Easy Visual Task Transfer

Graph View

Backlinks