Multimodal Tokenizer Question

#33
by Nano1337 - opened

In examining the processor and the input_ids, it appears that the image is positioned after the text. Is this a conventional approach? I'm concerned that this configuration could pose an issue when dealing with lengthy texts. Given that the model's maximum context length is 1024 tokens, any excess would necessitate truncation, potentially resulting in the omission of the image tokens.

Is there any option to move the image tokens before the text tokens?

Sign up or log in to comment