Title: xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Authors: Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
Published: 16th August 2024 (Friday) @ 17:57:01
Link: http://arxiv.org/abs/2408.08872v2
Abstract
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
- use a vision token sampler Perceiver resampler from Flamingo a Visual Language Model for Few-Shot Learning
- Decoder-only multimodal model:
- approach connects pre-trained language models to visual inputs using lightweight connectors
- Decoder-only models they cite: MM1, VILA, LLaVA, phi3-vision, and Otter
- Big limitations of BLIP-2:
- uses intricate Q-Former to bridge vision and language modalities
- couples this with a suite of complex training objectives: ITM, ITC, and ITG losses
- ⊠đ pose obstacles for larger-scale training
- ⊠đ only works with a single image as input - interleaved multimodal input is more a typical use case - think of how you interact with GPT-4o now / such models in the future
- open source all the checkpoints inc. DPOâd checkpoints + fine-tuning code
- Dataset contributions:
- MINT-1T - a trillion-token scale interleaved dataset
- BLIP3-KALE - a knowledge-augmented high-quality dense captions dataset
- Two additional specialized datasets here:
- BLIP3-OCR-200M - a large-scale dataset with dense OCR annotations
- BLIP3-GROUNDING-50M - a large-scale visual grounding dataset
- Model:
- ViT
- Perceiver resampler - downsamples embeddings
- LLM: phi3-mini
- Do patch sampling at high resolution and concat (1) those vision tokens with (2) a globally downsized image to convey global information
OpenScholar TLDR (link)
Prompt
Iâm trying to understand BLIP-3. Please outline the significant contributions and the differences against BLIP-2
OpenScholar
BLIP-3 is a significant advancement over BLIP-2, a large multimodal language model (LMM) designed to leverage the power of multimodal and multimodal transformer models. Here, we summarize the main contributions and differences between BLIP-3 and BLIP-2.
The primary contribution of BLIP-3 is its ability to scale up LMM training by utilizing an ensemble of multimodal interleaved datasets, curated caption datasets, and other publicly available datasets (Xue et al. 2024). This allows BLIP-3 to significantly outperform BLIP-2 in various multimodal tasks, such as question rephrasing, image restyling, and context reasoning (Chou et al. 2024). By leveraging an ensemble of datasets, BLIP-3 can capture a broader range of knowledge and improve its performance on more diverse tasks.
In terms of architecture, BLIP-3 is streamlined by replacing the Q-Former with a more scalable vision token sampler (perceiver resampler) and simplifying the training objectives to focus solely on the auto-regressive loss of text tokens in a multimodal context (Xue et al. 2024). This results in a more efficient and effective model that can handle larger-scale training. In contrast, BLIP-2 employs an intricate Q-Former architecture and a suite of complex training objectives, which can hinder larger-scale training and lead to poor performance in certain tasks (Xue et al. 2024).
Another key difference between BLIP-3 and BLIP-2 is the use of a more advanced visual encoder in BLIP-3. Unlike BLIP-2, which uses CLIP-ViT-L/14-336 as its visual encoder, BLIP-3 employs a more powerful visual encoder, ViT-H/14-378 (Hou et al. 2024). This allows BLIP-3 to capture more accurate and detailed visual features, leading to improved performance in tasks such as image restyling and context reasoning.
In addition, BLIP-3 has been shown to be more consistent in its responses compared to BLIP-2 (Chou et al. 2024). This is evident in the results of the question rephrasing task, where BLIP-3 achieves the highest accuracy among the evaluated models, and in the image restyling task, where BLIP-3 outperforms BLIP-2 in consistency metrics.
Overall, the significant contributions of BLIP-3 include its ability to scale up LMM training, its streamlined architecture, the use of a more advanced visual encoder, and its improved consistency in responses (Hou et al. 2024). These advancements make BLIP-3 a more effective and efficient model compared to BLIP-2, and it has the potential to be applied to a wide range of multimodal tasks (Chou et al. 2024).
References
- Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam et al. xGen-MM (BLIP-3): A Family of Open Large Multimodal Models. 2024. arXiv.org.
- Shih-Han Chou, Shivam Chandhok, James J. Little, Leonid Sigal. MM-R3: On (In-)Consistency of Multi-modal Large Language Models (MLLMs). 2024. arXiv.org.
- Wenjun Hou, Yi Cheng, Kaishuai Xu, Yan Hu, Wenjie Li, Jiangming Liu. Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry. 2024. arXiv.org.