The inconsistency between evaluation and training.

#76
by Yee111 - opened

Hello, I have a question.
Input the same image into the vision encoder in two ways: one with padding and one without padding, and I found that the number of valid tokens in the output was inconsistent. Moreover, after passing through the connector, the output features were also different.
This made me think of a potential issue:
If the batch size is 1 during evaluation, it means the image doesn't require padding, and since nn.Conv2d performs floor division, the output of ViT will be 4540 = 1800 tokens.
However, during training, when padding is applied, as long as there is one pixel in the patch that is valid, the entire patch is considered valid, which is equivalent to ceiling the division. This results in 46
41 = 1886 valid tokens.
Could this lead to inconsistency between training and evaluation?

HuggingFaceM4 org

Exactly there is a subtle difference, and for the same input, you'll have different output depending on it's in a batch (likely with padded examples) or a single example.
This is why we used a batch size of 1 during the evaluations.
However, the difference in terms of performance on the benchmarks is small in practice, without one strategy that always outperforms the other.

Sign up or log in to comment