Vision encoder 1024 output mapping to img_projection 4096 input?

#46
by hichammhanna - opened

Hi Team,
I'm looking into how the "microsoft/Phi-3-vision-128k-instruct" model works.
I summarized the architecture steps as follows based on the print(model) shown at the end. The steps at a high level are:
Image --> CLIPVisionModel --> LayerNorm (1024) --> ??[Transformation/Concatenation]?? --> img_projection (4096 -> 3072) --> Combined with Text Tokens (3072)

my key question is on how the ?[Transformation/Concatenation]?? step above is implemented to map the normalized 1024 output of the CLIPVisionModel encoder to the 4096 input of the img_projection layer?

Here's are the 2 sections in question:

Screenshot 2024-06-26 at 12.48.40β€―PM.png

Much appreciate any guidance or insight

Thanks

Print(model) output:

Phi3VForCausalLM(
(model): Phi3VModel(
(embed_tokens): Embedding(32064, 3072, padding_idx=32000)
(embed_dropout): Dropout(p=0.0, inplace=False)
(vision_embed_tokens): Phi3ImageEmbedding(
(drop): Dropout(p=0.0, inplace=False)
(wte): Embedding(32064, 3072, padding_idx=32000)
(img_processor): CLIPVisionModel(
(vision_model): CLIPVisionTransformer(
(embeddings): CLIPVisionEmbeddings(
(patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
(position_embedding): Embedding(577, 1024)
)
(pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-23): 24 x CLIPEncoderLayer(
(self_attn): CLIPAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
)
(layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
(img_projection): Sequential(
(0): Linear(in_features=4096, out_features=3072, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=3072, out_features=3072, bias=True)
)
)
(layers): ModuleList(
(0-31): 32 x Phi3DecoderLayer(
(self_attn): Phi3FlashAttention2(
(o_proj): Linear(in_features=3072, out_features=3072, bias=False)
(qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
(rotary_emb): Phi3SuScaledRotaryEmbedding()
)
(mlp): Phi3MLP(
(gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
(down_proj): Linear(in_features=8192, out_features=3072, bias=False)
(activation_fn): SiLU()
)
(input_layernorm): Phi3RMSNorm()
(resid_attn_dropout): Dropout(p=0.0, inplace=False)
(resid_mlp_dropout): Dropout(p=0.0, inplace=False)
(post_attention_layernorm): Phi3RMSNorm()
)
)
(norm): Phi3RMSNorm()
)
(lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

In the image_embedding part of the code,

# 1 x (24x24) x 1024
global_img_feature = img_features[_bs, :1]
# 1 x 12 x 12 x 4096
glb_img = global_img_feature.reshape(1,H,H,C).reshape(1,H//2,2,H//2,2,C).contiguous().permute(0,1,3,2,4,5).reshape(1,H//2,H//2,4*C).contiguous()

Merging the spatial dimension into channel. This is why it becomes 4096.

thanks @2U1 for pointing this out πŸ‘πŸΌπŸ‘πŸΌ

Sign up or log in to comment