Extracting language model only

#17

by mariboo - opened 3 days ago

Discussion

mariboo

3 days ago

•

edited 3 days ago

I'm trying to get the language part out of the 90B model. If I do:

model_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
sum(p.numel() for p in model.language_model.parameters())

I get 87,666,865,192 but I was expecting around 70B parameters as per the blog post Llama 3.1 was a starting point?

theo77186

2 days ago

The language model has 20 cross-attention layers in addition of the 80 decoder-only layers (layers 3, 8, 13... 98). To extract the language decoder only, you have to discard these layers and you should get 70B parameters.

mariboo

2 days ago

Ok, now it makes sense - thanks 👍

Sanyam

Meta Llama org about 7 hours ago

Thanks for asking the Q!

With the "drop-in" replacement, we mean to imply that "hey just upload an image and you have a vision model to work on your problem" and "when you don't have an image, you will get same results as the 70B backbone" (because both share the backbone)

So in the sense that you can use the same LLM in practice/prod without having to switch and support two use-cases.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment