Title: xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Authors: Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
Published: 16th August 2024 (Friday) @ 17:57:01
Link: http://arxiv.org/abs/2408.08872v2

Abstract

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.


  • use a vision token sampler Perceiver resampler from Flamingo a Visual Language Model for Few-Shot Learning
  • Decoder-only multimodal model:
    • approach connects pre-trained language models to visual inputs using lightweight connectors
    • Decoder-only models they cite: MM1, VILA, LLaVA, phi3-vision, and Otter
  • Big limitations of BLIP-2:
    • uses intricate Q-Former to bridge vision and language modalities
    • couples this with a suite of complex training objectives: ITM, ITC, and ITG losses
  • 
 👉 pose obstacles for larger-scale training
  • 
 👉 only works with a single image as input - interleaved multimodal input is more a typical use case - think of how you interact with GPT-4o now / such models in the future
  • open source all the checkpoints inc. DPO’d checkpoints + fine-tuning code
  • Dataset contributions:
    • MINT-1T - a trillion-token scale interleaved dataset
    • BLIP3-KALE - a knowledge-augmented high-quality dense captions dataset
  • Two additional specialized datasets here:
    • BLIP3-OCR-200M - a large-scale dataset with dense OCR annotations
    • BLIP3-GROUNDING-50M - a large-scale visual grounding dataset
  • Model:
    • ViT
    • Perceiver resampler - downsamples embeddings
    • LLM: phi3-mini
  • Do patch sampling at high resolution and concat (1) those vision tokens with (2) a globally downsized image to convey global information

Prompt

I’m trying to understand BLIP-3. Please outline the significant contributions and the differences against BLIP-2

OpenScholar

BLIP-3 is a significant advancement over BLIP-2, a large multimodal language model (LMM) designed to leverage the power of multimodal and multimodal transformer models. Here, we summarize the main contributions and differences between BLIP-3 and BLIP-2.

The primary contribution of BLIP-3 is its ability to scale up LMM training by utilizing an ensemble of multimodal interleaved datasets, curated caption datasets, and other publicly available datasets (Xue et al. 2024). This allows BLIP-3 to significantly outperform BLIP-2 in various multimodal tasks, such as question rephrasing, image restyling, and context reasoning (Chou et al. 2024). By leveraging an ensemble of datasets, BLIP-3 can capture a broader range of knowledge and improve its performance on more diverse tasks.

In terms of architecture, BLIP-3 is streamlined by replacing the Q-Former with a more scalable vision token sampler (perceiver resampler) and simplifying the training objectives to focus solely on the auto-regressive loss of text tokens in a multimodal context (Xue et al. 2024). This results in a more efficient and effective model that can handle larger-scale training. In contrast, BLIP-2 employs an intricate Q-Former architecture and a suite of complex training objectives, which can hinder larger-scale training and lead to poor performance in certain tasks (Xue et al. 2024).

Another key difference between BLIP-3 and BLIP-2 is the use of a more advanced visual encoder in BLIP-3. Unlike BLIP-2, which uses CLIP-ViT-L/14-336 as its visual encoder, BLIP-3 employs a more powerful visual encoder, ViT-H/14-378 (Hou et al. 2024). This allows BLIP-3 to capture more accurate and detailed visual features, leading to improved performance in tasks such as image restyling and context reasoning.

In addition, BLIP-3 has been shown to be more consistent in its responses compared to BLIP-2 (Chou et al. 2024). This is evident in the results of the question rephrasing task, where BLIP-3 achieves the highest accuracy among the evaluated models, and in the image restyling task, where BLIP-3 outperforms BLIP-2 in consistency metrics.

Overall, the significant contributions of BLIP-3 include its ability to scale up LMM training, its streamlined architecture, the use of a more advanced visual encoder, and its improved consistency in responses (Hou et al. 2024). These advancements make BLIP-3 a more effective and efficient model compared to BLIP-2, and it has the potential to be applied to a wide range of multimodal tasks (Chou et al. 2024).

References