meta-llama/Llama-3.2-11B-Vision · Error encountered when fine-tuning

When using AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision") to process batches for finetuning, an error occurred with the following message:

RuntimeError: The expanded size of the tensor (4128) must match the existing size (3096) at non-singleton dimension 3.  Target sizes: [4, 16, 4128, 4128].  Tensor sizes: [4, 1, 3096, 3096]

The error traces show that the error comes from modeling_mllama.py's forward function:

attn_output = F.scaled_dot_product_attention(query, key, value, attn_mask=attention_mask)

Looks like there's something wrong with the attention mask, but it is obtained directly from the loaded processor.
Any idea what causes the error and how I can fix it?
Happy to provide more details, thanks!