Why is the image size designed to be 384, whereas the patch size is designed to be 14, when 384 is not divisible by 14?

#4
by zhongyi1997cn - opened
class SigLipVisionEmbeddings(nn.Module):
    def __init__(self, config: SigLipVisionConfig):
        super().__init__()
        self.config = config
        self.embed_dim = config.hidden_size
        self.image_size = config.image_size
        self.patch_size = config.patch_size

        self.patch_embedding = nn.Conv2d(
            in_channels=config.num_channels,
            out_channels=self.embed_dim,
            kernel_size=self.patch_size,
            stride=self.patch_size,
            padding="valid",
        )

        self.num_patches = (self.image_size // self.patch_size) ** 2
        self.num_positions = self.num_patches
        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
        self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)

    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
        embeddings = patch_embeds.flatten(2).transpose(1, 2)

        embeddings = embeddings + self.position_embedding(self.position_ids)
        return embeddings

And according to the relevant code about 'siglip' in transformers, there is no padding in the convolution, doesn't this mean that the information of 6 pixels width both horizontally and vertically hasn't been utilized?

have you figured out the reason?

Google org

Hi, this is just an inattention mistake. The correct resolution would have been 378px or maybe 336px. But we were so used to the number 384 from past work that we mistakenly just defaulted to that :)

At the end of the day using 384 with /14 instead of 378 "loses" 6px on the right/bottom border, so it very likely has no practical impact, at least not worth re-training it.

how about re-train it with /16 patch? it can be the same sequence length of 336px w/ /14 patch.

Sign up or log in to comment