llava-hf/LLaVA-NeXT-Video-7B-hf · Best practice for fine-tuning llava-next-video-7b

Jul 6

The ms-swift LLM fine-tuning framework supports the inference, fine-tuning, and deployment of llava-next-video-7b, and provides best practice. Feel free to give it a try! 😊

https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/llava-video-best-practice.md

sumanthsadu

Aug 15

I was trying to use the google colab notebook which has been provided for "LLaVA-NeXT-Video-7B-hf" by huggingface on aws sagemaker but I was getting the following error and could not resolve it by my self. Could someone help me here.

KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

entire error message:

KeyError Traceback (most recent call last)
Cell In[17], line 3
1 generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}
----> 3 output = model.generate(**inputs, **generate_kwargs)
4 generated_text = processor.batch_decode(output, skip_special_tokens=True)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/transformers/generation/utils.py:2024, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
2016 input_ids, model_kwargs = self._expand_inputs_for_generation(
2017 input_ids=input_ids,
2018 expand_size=generation_config.num_return_sequences,
2019 is_encoder_decoder=self.config.is_encoder_decoder,
2020 **model_kwargs,
2021 )
2023 # 13. run sample (it degenerates to greedy search when generation_config.do_sample=False)
-> 2024 result = self._sample(
2025 input_ids,
2026 logits_processor=prepared_logits_processor,
2027 logits_warper=prepared_logits_warper,
2028 stopping_criteria=prepared_stopping_criteria,
2029 generation_config=generation_config,
2030 synced_gpus=synced_gpus,
2031 streamer=streamer,
2032 **model_kwargs,
2033 )
2035 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
2036 # 11. prepare logits warper
2037 prepared_logits_warper = (
2038 self._get_logits_warper(generation_config, device=input_ids.device)
2039 if generation_config.do_sample
2040 else None
2041 )

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/transformers/generation/utils.py:2982, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs)
2979 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
2981 # forward pass to get next token
-> 2982 outputs = self(**model_inputs, return_dict=True)
2984 if synced_gpus and this_peer_finished:
2985 continue # don't waste resources running the code we don't need

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/accelerate/hooks.py:169, in add_hook_to_module..new_forward(module, *args, **kwargs)
167 output = module._old_forward(*args, **kwargs)
168 else:
--> 169 output = module._old_forward(*args, **kwargs)
170 return module._hf_hook.post_forward(module, output)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/transformers/models/llava_next_video/modeling_llava_next_video.py:915, in LlavaNextVideoForConditionalGeneration.forward(self, input_ids, pixel_values, pixel_values_videos, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict)
912 # generation with cache, decoding stage
913 elif past_key_values is not None and (pixel_values is not None or pixel_values_videos is not None):
914 # Retrieve the first layer to inspect the logits and mask out the hidden states that are set to 0
--> 915 first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
916 # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
917 batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/transformers/cache_utils.py:334, in DynamicCache.getitem(self, layer_idx)
332 return (self.key_cache[layer_idx], self.value_cache[layer_idx])
333 else:
--> 334 raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")

KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

RaushanTurganbay

Llava Hugging Face org Aug 15

Hey! Yes, the last release had a bug and I opened a PR here (https://github.com/huggingface/transformers/pull/32527). The fix will be included in the next patch release

sumanthsadu

Aug 15

Thank you! May I know approximately when to except the release ? I would be interested to know more if there are any other versions I could try out. I want to use open sourced video to text processing models to work with. Any resources is greatly appreciated.

RaushanTurganbay

Llava Hugging Face org Aug 15

•

edited Aug 15

The patch release should be approximately this week :)

subashpoudel

7 days ago

@RaushanTurganbay I've also got the same error. How to fix it?

RaushanTurganbay

Llava Hugging Face org 7 days ago

@subashpoudel can you share your env with me? I guess you got one of the patch releases which had a bug. If you're installing from source, make sure to upgrade once more. We merged recently a more stable modeling code

subashpoudel

6 days ago

@RaushanTurganbay It's solved sir. Thank you.