microsoft/Phi-3-vision-128k-instruct

I am looking at using phi-3-vision models to try and describe an image. However, I couldn't help but notice that the number of tokens that an image takes is quite large (~2000). Is this correct, or a potential bug? I have included a code snippet so that you can check my assumptions:

From my understanding of VLMs they simply take an image, and use CLIP or similar to project one image to one (or few tokens), so that they become a "language token".

Side questions

Incase it helps me understand phi,

Where is the 17 coming from in the below image shape.
Why is the image_sizes (1, 2) and not (1, 1) given that I have only referenced one image.

from PIL import Image
import requests
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
]
url = "https://sm.ign.com/t/ign_ap/review/d/deadpool-r/deadpool-review_2s7s.1200.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt")

{k: v.shape for k, v in inputs.items()}
# {'input_ids': torch.Size([1, 2371]),
# 'attention_mask': torch.Size([1, 2371]),
# 'pixel_values': torch.Size([1, 17, 3, 336, 336]),
# 'image_sizes': torch.Size([1, 2])}

I tried to do image.resize((128, 128)) but this only increased the number of tokens to 2500+.

microsoft
/

Phi-3-vision-128k-instruct

phi3 image tokens

Side questions