🪴 Anil's Garden

❯

What matters when building vision-language models?

19 Dec 20252 min read

paper
hugging-face
multimodality
dataset
annotated

Title: What matters when building vision-language models?
Authors: Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh
Published: 3rd May 2024 (Friday) @ 17:00:00
Link: http://arxiv.org/abs/2405.02246v1

Abstract

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Idefics2

Write-up: Introducing Idefics2 A Powerful 8B Vision-Language Model for the community
Model (8B): https://huggingface.co/HuggingFaceM4/idefics2-8b

NB This paper tags withdataset because it also comprises the release of The Cauldron. This is a model paper introducing Idefics2.

Trained on The Cauldron - see card at HuggingFaceM4the_cauldron · Datasets at Hugging Face

Graph View

Backlinks

Multimodality

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

What matters when building vision-language models?

Graph View

Backlinks