HuggingFaceM4/idefics2-8b · The inconsistency between evaluation and training.

Hello, I have a question.
Input the same image into the vision encoder in two ways: one with padding and one without padding, and I found that the number of valid tokens in the output was inconsistent. Moreover, after passing through the connector, the output features were also different.
This made me think of a potential issue:
If the batch size is 1 during evaluation, it means the image doesn't require padding, and since nn.Conv2d performs floor division, the output of ViT will be 4540 = 1800 tokens.
However, during training, when padding is applied, as long as there is one pixel in the patch that is valid, the entire patch is considered valid, which is equivalent to ceiling the division. This results in 4641 = 1886 valid tokens.
Could this lead to inconsistency between training and evaluation?