Which one is the best?

#21
by akiokawahito - opened

image.png

Lower bit depths (2-bit to 6-bit):
Pros: Significant model size reduction, faster inference
Cons: Potential loss in accuracy, especially for complex tasks

Higher bit depths (8-bit, 16-bit):
Pros: Better preservation of model accuracy
Cons: Larger model size, potentially slower inference

Best option:
There's no universally "best" option, as the ideal choice depends on your specific use case, considering factors like:

  1. Required model accuracy
  2. Inference speed requirements
  3. Storage/memory constraints
  4. Hardware capabilities

Hi Akio,

Using a qualitative analysis made for Lama 3 8B of these quantization methods as a reference, we can get a good idea of the best cost-benefit (https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md#llama-3-8b-scoreboard). I plotted a chart that identifies an ideal area where if further quantized we would get diminishing returns. Please see below:

chart.png

This indicates that models 'Q6_K' and 'Q5_K_S' would yield the best compromise between performance and quality loss. However, ultimately, it will depend on what your system is capable of running.

Kind regards,

'Q6_K' and 'Q5_K_S' !
thank your for your comparision graph.

I wonder where nf4 v2 would land on that table?

Please if anyone can confirm or explain this:
Do the file size directly corresponds to the amount of VRAM that they will take up?
In order to have the fastest inference possible, is the goal to have all the models loaded within the GPU - VRAM itself?
Say for 12GB VRAM...
1.) flux1-dev-Q4_K_S.gguf - 6.81 GB
2.) t5-v1_1-xxl-encoder-Q5_K_S.gguf - 3.29 GB
3.) clip_l.safetensors - 234 MB

Which make a total of about 10.5 GB.
Leaving 1-1.5GB VRAM as room for inference calculations.
Monitor is connected to iGPU and Browser's Hardware Acceleration has been turned off.

@ViratX , this seems correct to me.
For the T5 model, I would use the t5xxl_fp8_e4m3fn.safetensors instead. On my tests, I observed a noticeable reduction in prompt understanding and composition coherence with the Q5_K_S.gguf version. The difference in size between the fp8 and Q5_K_S is not that big.

I thought the CLIP models are being loaded into RAM, or is it maybe an option (forge for example offers some options).

@Gabriel-Filincowsky From the chart you've put together, it would seem that an even more ideal option would be a q5KM version of the model. It has much better quality than q5KS with only a slightly larger size. Why wouldn't someone have made that version?

@ViratX , this seems correct to me.
For the T5 model, I would use the t5xxl_fp8_e4m3fn.safetensors instead. On my tests, I observed a noticeable reduction in prompt understanding and composition coherence with the Q5_K_S.gguf version. The difference in size between the fp8 and Q5_K_S is not that big.

In comparison between t5xxl_fp8_e4m3fn and t5-v1_1-xxl-encoder-Q8_0 which one is better? thanks in advance

@Danneil , I have not made tests between the fp8 and Q8. I would assume they are similar, but I am not sure.
I did not make any tests between them because there is no file size -> memory allocation advantage to justify. And as the fp8 has been around longer, I assume it has better support. However, it could be the case that the Q8 quantization yields better quality at the same size, which would be a reason to prefer it over the fp8.

@ViratX , this seems correct to me.
For the T5 model, I would use the t5xxl_fp8_e4m3fn.safetensors instead. On my tests, I observed a noticeable reduction in prompt understanding and composition coherence with the Q5_K_S.gguf version. The difference in size between the fp8 and Q5_K_S is not that big.

In comparison between t5xxl_fp8_e4m3fn and t5-v1_1-xxl-encoder-Q8_0 which one is better? thanks in advance

See: https://github.com/city96/ComfyUI-GGUF/issues/68

Sign up or log in to comment