Title: Visual Instruction Tuning
Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Published: 17th April 2023 (Monday) @ 17:59:25
Link: http://arxiv.org/abs/2304.08485v2
Abstract
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
Notes
Contributions:
- Instruction following (instruction fine-tuning) dataset that they create and release based on converting image-text pair data to multimodal (image-text) instruction-following data using ChatGPT or GPT-4
- Combine encoder from CLIP Learning Transferable Visual Models From Natural Language Supervision with decoder from Vicuna Vicuna An Open-Source Chatbot Impressing GPT-4 with 90_ ChatGPT Quality LMSYS Org and fine-tune this on the dataset they create
- Benchmarks:
- LLaVA-Bench (COCO)
- LLaVa-Bench (In-the-Wild)
- Open-source code, checkpoints and dataset + a visual chat demo
Notable LMMs trained on image-text pairs:
Other LMMs trained on image-text pairs include BLIP-2 [28], FROMAGe [24], and KOSMOS-1 [20]. PaLM-E [13] is an LMM for embodied AI. Based on the recent âbestâ open-source LLM LLaMA, OpenFlamingo [5] and LLaMA-Adapter [59]
Method
- Language backbone: Vicuna
- Vision backbone: Pre-trained CLIP vision encoder - ViT-L/14 visual feature given input image
- Alignment: visual features to word embedding space via a very simple linear (matrix) projection (trainable):
- They leave the alternative methods for other people to do as future work:
- Q-Former from BLIP-2 Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- âgated cross-attentionâ from Flamingo a Visual Language Model for Few-Shot Learning
- They leave the alternative methods for other people to do as future work:
The point of this paper is instruction following. Quote from §5.1 Experiments: Multimodal Chatbot:
Surprisingly, although LLaVA is trained with a small multimodal instruction-following dataset (âŒ80K unique images), it demonstrates quite similar reasoning results with multimodal GPT-4 on these examples. Note that while these images are out-of-domain for LLaVA, LLaVA is still able to understand the scenes and follow the question instruction to provide a reasonable response. In contrast, BLIP-2 and OpenFlamingo focus on describing the image, instead of following the user instruction to answer in an appropriate manner.
Benchmarks (new; introduced in this paper)
LLaVA-Bench (COCO). We randomly select 30 images from COCO-Val-2014, and for each image, we generate three types of questions (conversation, detailed description, complex reasoning) using the proposed data generation pipeline in Sec. 3, totaling 90 questions. This benchmark studies the modelâs alignment behavior and capabilities with consistent visual inputs. We vary the training datasets to study the effectiveness of different types of instruction-following data, and show the results in Table 4. First, with instruction tuning, the modelâs ability of following user instructions improves significantly by over 50 points. Second, adding a small amount of detailed description and complex reasoning questions contributes to a considerable improvement of the modelâs overall capability by 7 points. Furthermore, it also improves the modelâs performance on conversational questions, suggesting that improvements in reasoning capabilities complement conversational abilities. Finally, we show that having all three types of data yields the best performance at 85.1%.
LLaVA-Bench (In-the-Wild). To evaluate the modelâs capability in more challenging tasks and generalizability to novel domains, we collect a diverse set of 24 images with 60 questions in total, including indoor and outdoor scenes, memes, paintings, sketches, etc., and associate each image with a highly-detailed and manually-curated description and a proper selection of questions. We compare LLaVA, BLIP, and OpenFlamingo in Table 5. Thanks to visual instruction tuning, LLaVA achieves significantly better performance compared with BLIP-2 (+29%) and OpenFlamingo (+48%).
Compared to the text-only GPT-4 that has access to ground-truth labels, LLaVA achieves an impressive 81.7% performance on complex reasoning questions, with an overall score of 67.3%.
Results
Table 3: Example prompt from GPT-4 paper [36] to compare visual reasoning and chat capabilities.
Table 3: Example prompt from GPT-4 paper [36] to compare visual reasoning and chat capabilities. Compared to BLIP-2 [28] and OpenFlamingo [5], LLaVA accurately follows the userâs instructions, instead of simply describing the scene. LLaVA offers a more comprehensive response than GPT-4. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image.
ScienceQA Results
For LLaVA, we use the visual features before the last layer, ask the model to first predict reasons and then the answer, and train it for 12 epochs. It yields 90.92% accuracy, which is quite close to the SoTA 91.68%.
To explore the limit of LLMs, we also prompt GPT-4 using 2-shot in-context-learning and achieve 82.69% accuracy, which is a 7.52% absolute gain compared with 75.17% from GPT-3.5. For a substantial number of questions, we note that GPT-4 fails simply because it reports that there is insufficient context such as images or plots. We consider two schemes to combine the outcomes from our model and GPT-4
- A GPT-4 complement. Whenever GPT-4 fails to provide answers, we use the prediction from our method. This schemes yields 90.97% accuracy, which is almost the same as applying our method alone.
- GPT-4 as the judge. Whenever GPT-4 and LLaVA produce different answers, we prompt GPT-4 again, asking it to provide its own final answer based on the question and two outcomes.
- The spirit is similar with CoT, but with the external knowledge from the other model.
- Surprisingly, this scheme is able to provide consistent improvement over all question classes, and achieves a new SoTA accuracy of 92.53%.
- Interestingly, the text-only GPT-4, which cannot process images, improves the overall performance of the model on questions that have an image as context.
- because some of these questions do not actually require the image context for a correct answer. The GPT-4 judge can identify such cases and correct some of the errors that LLaVA makes. See the example in Appendix.
Ablations
We ablate several design choices on ScienceQA in Table 8.
- Visual features. We tried using the last layer feature from CLIP vision encoder, which yields 89.96% and is 0.96% lower than the feature before the last layer. We hypothesize that this is because CLIPâs last layer features may focus more on global and abstract image properties compared to the layer before it, which can focus more on localized properties that are useful for understanding specific image details.
- Chain-of-thought. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training. Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence, but contributes relatively little to the final performance.
- Pre-training. We skip pre-training and directly train on Science QA from scratch - performance drops to 85.81% accuracy. The 5.11% absolute degradation indicates the importance of our pre-training stage, in aligning multimodal features while preserving the vast pre-trained knowledge.
- Model size. We keep all configurations the same as our best 13B model, and train a 7B model. This yields 89.84% accuracy, which is 1.08% lower than 90.92%, demonstrating the importance of model scale.
Fun Examples
Figure 6: An interesting emergent behavior of LLaVA is its ability to recognize Elon Musk both in a headshot and in a humorous meme where he is dressed as a doge. This implies that the pre-trained CLIP vision encoder may have seen images of Elon Musk. However, it is still surprising because Elon Musk never appears in the training data for either the visual feature alignment or visual instruction tuning stages of LLaVA, which indicates that the base language model generalizes to unseen visual concepts.