Can instructBlip process videos

#8
by UncleanCode - opened

I recently looked at the source of the blip2_vicuna-instruct7b on Salesforce/LAVIS repository and found a code for handling videos. I don't know if this is in the hugging face instructBlip model. So I'm asking if instructBlip can handle videos and if yes, how do I go about it?

Hi,

Thanks for your interest in InstructBLIP. Support for videos is not yet present in the Transformers library. Did the authors release any checkpoints trained on video?

I'm unaware of that currently. I'd check to see if there is. What I saw was just a code line for handling videos with low frame count.

also interested in processing videos

Hi @UncleanCode @louis030195

Can you share the snippet for handling videos from the original authors? That can be probably adapted a bit to use transformers model

Hi,

I'm trying to run the demo from the page https://huggingface.co/docs/transformers/main/en/model_doc/instructblip#transformers.InstructBlipForConditionalGeneration at the end and the model des not generate text instead is give this :

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:24<00:00, 6.10s/it]
/home/tanya.kaintura/Project/myenv/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:412: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
warnings.warn(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Hi,

For videos I recommend taking a look at VideoBLIP: https://huggingface.co/models?other=video-to-text

Update, InstructBLIP-Video is now supported! https://huggingface.co/docs/transformers/main/en/model_doc/instructblipvideo

Unfortunately, the generation example on the page is not working. Additionally, could you please provide an example for video and text feature extraction? Error message : TypeError: InstructBlipVideoForConditionalGeneration.forward() got an unexpected keyword argument 'videos'

Pinging @RaushanTurganbay here

@CennetOguz the example code had typos, will fix it on main soon. You can use the following code to generate:

from transformers import InstructBlipVideoProcessor, InstructBlipVideoForConditionalGeneration
import torch
from huggingface_hub import hf_hub_download
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = InstructBlipVideoForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", device_map="auto")
processor = InstructBlipVideoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b")

file_path = hf_hub_download(repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset")
container = av.open(file_path)

# sample uniformly 4 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 4).astype(int)
clip = read_video_pyav(container, indices)

prompt = "What is happening in the video?"
inputs = processor(text=prompt, images=clip, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    do_sample=False,
    num_beams=5,
    max_length=256,
    repetition_penalty=1.5,
    length_penalty=1.0,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

Regarding feature extraction:
From the project repo it looks like InstructBlip models do not support feature extraction since they do not have a projection head to project text/vision embeds to the same latent space. However Blip2 has support for feature extraction given in this notebook. The PR to add ITM capability to Transformers Blip2 is in progress, you can track it here

Sign up or log in to comment