Dramatic Drop in Captioning Accuracy?

#6
by setothegreat - opened

Am I the only one currently experiencing this? The original Joy Caption space on here is incredibly accurate when it comes to captioning, rarely requiring modifications, and even when it does they're usually rather minor.
By comparison, just about every single image I attempted to caption with this script had tons of errors, primarily in smaller and medium-sized details. Broader composition elements has a better success rate, but were still less than the HF space, and honestly made the captions rather unusable.

My initial thought is that the images are being analyzed at a size that is just far too small for the model to properly analyze any details, but I don't know the specific differences between the HF Spaces implementation and the implementation here to say for certain.

There is a feature that will scale down the image before processing it. Maybe that number should be set higher so more detail is retained?

I couldn't spot any down-scaling in the code other than the standard preprocessing of "google/siglip-so400m-patch14-384". Where is it?

What could be the cause is the fact that his repository uses a 4-bit quantized version of the Llama 3.1 model, while the huggingface space uses the unquantized model. Also, in this example a temperature of 0.5 is used.

I couldn't spot any down-scaling in the code other than the standard preprocessing of "google/siglip-so400m-patch14-384". Where is it?

What could be the cause is the fact that his repository uses a 4-bit quantized version of the Llama 3.1 model, while the huggingface space uses the unquantized model. Also, in this example a temperature of 0.5 is used.

Good to know. The unquantized 8b model should be sufficient to run on my hardware at the least, so I might give that a try and play around with the temperature a bit later today and report back

Having run the unquantized 8b 3.1 model (or rather the Hermes fork of it since I can't be bothered to modify the script to setup my HF token) I can now confirm that the caption accuracy is significantly higher. It's still not quite as high as the original HF space implementation, specifically with regards to text more than anything, but it's the difference between a 2% usability rate with the quantized model and a 95% usability rate with the unquantized model.

So what are the steps we should do with code/config to use 8bit instead of 4bit quantized version of the Llama 3.1? Thanks

Sign up or log in to comment