ZhangYuanhan's picture
Update README.md
79fc3ce verified
metadata
datasets:
  - lmms-lab/LLaVA-NeXT-Video-178K
language:
  - en
library_name: transformers
license: apache-2.0
metrics:
  - accuracy
tags:
  - multimodal
model-index:
  - name: LLaVA-NeXT-Video-7B-Qwen2
    results:
      - task:
          type: multimodal
        dataset:
          name: ActNet-QA
          type: actnet-qa
        metrics:
          - type: accuracy
            value: 58.2
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: EgoSchema
          type: egoschema
        metrics:
          - type: accuracy
            value: 57.3
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: MLVU
          type: mlvu
        metrics:
          - type: accuracy
            value: 69.8
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: MVBench
          type: mvbench
        metrics:
          - type: accuracy
            value: 58.4
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: NextQA
          type: nextqa
        metrics:
          - type: accuracy
            value: 82.2
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: PercepTest
          type: percepTest
        metrics:
          - type: accuracy
            value: 71.7
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: VideoChatGPT
          type: videochatgpt
        metrics:
          - type: score
            value: 3.54
            name: score
            verified: true
      - task:
          type: multimodal
        dataset:
          name: VideoDC
          type: videodc
        metrics:
          - type: score
            value: 3.71
            name: score
            verified: true
      - task:
          type: multimodal
        dataset:
          name: LongVideoBench
          type: longvideobench
        metrics:
          - type: accuracy
            value: 57.3
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: VideoMME
          type: videomme
        metrics:
          - type: accuracy
            value: 63.2
            name: accuracy
            verified: true
base_model:
  - lmms-lab/llava-onevision-qwen2-7b-si

LLaVA-NeXT-Video-7B-Qwen2-Video-Only

Table of Contents

  1. Model Summary
  2. Use
  3. Limitations
  4. Training
  5. License
  6. Citation

Model Summary

The LLaVA-NeXT-Video models are 7/72B parameter models trained on LLaVA-NeXT-Video-178K, based on Qwen2 language model with a context window of 32K tokens.

This model support at most 110 frames.

Use

Intended use

The model was trained on LLaVA-Video-178K and have the ability to interact with videos.

Feel free to share your generations in the Community tab!

Generation

We provide the simple generation process for using our model. For more details, you could refer to Github.

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    frame_time = [i/fps for i in frame_idx]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time
pretrained = "lmms-lab/LLaVA-NeXT-Video-7B-Qwen2-Video-Only"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "XXXX"
max_frames_num = "110"
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities=["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

Training

Model

  • Architecture: SO400M + Qwen2
  • Initialized Model: lmms-lab/llava-onevision-qwen2-7b-si
  • Data: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
  • Precision: bfloat16

Hardware & Software

Citation