ValueError: Number of image tokens in input_ids (0) different from num_images (1)

by Dilllllll - opened 25 days ago

Discussion

Dilllllll

25 days ago

When I run
output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)

Dilllllll

24 days ago

My complete code:
model_id = "llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
processor = LlavaNextVideoProcessor.from_pretrained(model_id)
import requests
from PIL import Image
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What are these?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs_image, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

Dilllllll

22 days ago

processor = LlavaNextVideoProcessor.from_pretrained(model_id,use_fast=False,)

Bugs will be fixed

Dilllllll changed discussion status to closed 22 days ago

Dilllllll changed discussion status to open 22 days ago

Dilllllll

22 days ago

But when I use a video to test, this error appears again

code:
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.bfloat16)
output = model.generate(**inputs_video, max_new_tokens=30, do_sample=False)

Dilllllll changed discussion status to closed 22 days ago

Dilllllll changed discussion status to open 22 days ago

nielsr

Llava Hugging Face org 21 days ago

Pinging @RaushanTurganbay here

RaushanTurganbay

Llava Hugging Face org 15 days ago

@Dilllllll from the provided code it's unclear if you're trying to run inference with an image or video. Please note that you need to specify the correct modality (image or video) in conversation template.

Can you plz share the a runnable code for reproducing the error if it's not resolved yet?

Dilllllll

15 days ago

code:
import os
import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
import numpy as np
from huggingface_hub import hf_hub_download
import time
model_id = "./llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
).eval()

processor = LlavaNextVideoProcessor.from_pretrained(model_id)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (av.container.input.InputContainer): PyAV container.
indices (List[int]): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])

conversation = [
{

    "role": "user",
    "content": [
        {"type": "text", "text": "Why is this video funny?"},
        {"type": "video"},
        ],
},

]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
video_path = "./sample_demo_1.mp4"
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 2).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs_video, max_new_tokens=30)
print(processor.decode(output[0][2:], skip_special_tokens=True))

Dilllllll

15 days ago

@RaushanTurganbay ,no matter whether the input is an image or a video, this error will appear

RaushanTurganbay

Llava Hugging Face org 15 days ago

@Dilllllll thanks, I found the bug. The chat template was not correct, I updated the files on the hub, should be working now

Dilllllll

14 days ago

@RaushanTurganbay Thanks, the code works.

Dilllllll changed discussion status to closed 14 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment