ValueError: Number of image tokens in input_ids (0) different from num_images (1)

#1
by Dilllllll - opened

When I run
output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)

My complete code:
model_id = "llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
processor = LlavaNextVideoProcessor.from_pretrained(model_id)
import requests
from PIL import Image
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What are these?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs_image, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

processor = LlavaNextVideoProcessor.from_pretrained(model_id,use_fast=False,)

Bugs will be fixed

Dilllllll changed discussion status to closed
Dilllllll changed discussion status to open

But when I use a video to test, this error appears again

code:
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.bfloat16)
output = model.generate(**inputs_video, max_new_tokens=30, do_sample=False)

Dilllllll changed discussion status to closed
Dilllllll changed discussion status to open
Llava Hugging Face org

Pinging @RaushanTurganbay here

Llava Hugging Face org

@Dilllllll from the provided code it's unclear if you're trying to run inference with an image or video. Please note that you need to specify the correct modality (image or video) in conversation template.

Can you plz share the a runnable code for reproducing the error if it's not resolved yet?

code:
import os
import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
import numpy as np
from huggingface_hub import hf_hub_download
import time
model_id = "./llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
).eval()

processor = LlavaNextVideoProcessor.from_pretrained(model_id)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (av.container.input.InputContainer): PyAV container.
indices (List[int]): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])

conversation = [
{

    "role": "user",
    "content": [
        {"type": "text", "text": "Why is this video funny?"},
        {"type": "video"},
        ],
},

]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
video_path = "./sample_demo_1.mp4"
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 2).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs_video, max_new_tokens=30)
print(processor.decode(output[0][2:], skip_special_tokens=True))

@RaushanTurganbay ,no matter whether the input is an image or a video, this error will appear

Llava Hugging Face org

@Dilllllll thanks, I found the bug. The chat template was not correct, I updated the files on the hub, should be working now

@RaushanTurganbay Thanks, the code works.

Dilllllll changed discussion status to closed

Sign up or log in to comment