Llama-3.1 8B Carrot - Capx AI

Community Article Published September 6, 2024

We are excited to release our latest model - Llama-3.1-Carrot, an 8 billion parameter Vision model based on SigLip and Meta AI's Llama 3.1 8B.

image/png

The primary architecture is comprised of 2 components:

  1. Llama 3.1 8B instruct: A large language model known for its strong performance in instruction-following tasks.
  2. SigLIP: A vision encoder that excels in creating rich visual representations.

The model weights are released under Apache 2.0 License here: https://huggingface.co/Capx/Llama-3.1-Vision

with huge thanks to the BAAI team for their amazing work on Bunny.

Model Architecture

We build upon BAAI's Bunny repository for our model. Our vision model's architecture can be broken down into three main components:

  1. Vision Encoder (SigLIP): Responsible for processing and encoding visual inputs into a high-dimensional feature space.
  2. Connector Module: A crucial component that bridges the gap between the vision encoder and the language model, allowing for effective multimodal reasoning.
  3. Language Model (Llama 3.1 8B Instruct): Handles text generation and understanding based on the encoded visual features and textual inputs.

image/png Source: https://arxiv.org/abs/2402.11530

We use LoRA along for Parameter efficient training of the complete model on frugal resources.

Training Process

We employed a two-stage training approach:

  1. Pretraining Stage: In this stage, we align the visual embeddings from a pre-trained vision encoder with the text embeddings from the LLM. The purpose of this stage is to adapt Visual Embeddings that are high-dimensional representations of visual information, extracted from images using a pre-trained vision encoder(in our case, SigLip) to another high-dimensional space that represent text. This is done via a cross modality projector - A 2 layer MLP that projects Image embeddings to textual embeddings.

  2. Visual Instruction Tuning: This process involves training the model on a diverse set of multimodal tasks, teaching it to follow instructions that involve both text and images. LoRA (Low-Rank Adaptation): This is a technique used to efficiently fine-tune large language models. It adds a small number of trainable parameters to each layer of the LLM, allowing for task-specific adaptation without modifying all parameters. Process:

The LLM backbone is fine-tuned using LoRA, which helps maintain the model's general knowledge while adapting to multimodal tasks. The vision encoder and cross-modality projector are fully tuned, allowing them to adapt more completely to the multimodal context.

During training, we monitored key metrics such as loss convergence and validation performance to ensure optimal model behavior.

The following loss was observed during the Visual Instruction Tuning Phase:

image/png

Compute Used

The entire model was trained on 8 x A100s each with 80GB of VRAM. The complete training took approximately 40 hours.

Results and Performance

Our final model demonstrated impressive capabilities in various vision-language tasks:

  • Image Captioning: Generated detailed and contextually relevant captions for a wide range of images.
  • Visual Reasoning: Showed strong performance in tasks requiring complex reasoning about visual scenes.

Examples:

1. image/png

USER: what is it's expression?

ASSISTANT: The cat has a playful expression. It appears to be smiling and sticking its tongue out.

USER: what is in the background?

ASSISTANT: In the background, there is a pool with a white fence around it. There are also trees and a house nearby.

2. image/png

USER: who are the characters in the image?

ASSISTANT: The characters in the image are Mario, Luigi, and Yoshi.

USER: are they real?

ASSISTANT: No, they are not real. They are plastic toys.

Conclusion

We hope the community leverages our learnings and ship cool stuff. The ideas, opportunities and potential are endless! As we continue to refine and expand this model, we anticipate its application in various domains, from content moderation to advanced human-AI interaction systems. Thank you!

We thank the amazing team at BAAI, for their Bunny project, upon which this was built and Meta AI for their Llama 3.1 model!