Please post f16 quantization.

#1
by ZeroWw - opened

Please post f16 quantization.
Requantizing is better from f16 or f32.
If you can, post them both.

I through the original format is BF16.

yes.. but f16 (fp16) does not cause harm to the model. bf16 is way bigger.

Qwen org

BF16 and F16 should be identical in size

If you need the f32 i uploaded it here: https://huggingface.co/bartowski/Qwen2-7B-Instruct-GGUF/blob/main/Qwen2-7B-Instruct-f32.gguf

hmm maybe I got confused... I though bf16 was way bigger than f16 (I know they are both 16 bit) perhaps I was tired and read it wrong.
anyway I posted now my quantizations of quen 1.5 and qwen2...

Bf16 represents a larger range of values but is not bigger

Bf16 represents a larger range of values but is not bigger

Got it. thanks.

Bf16 represents a larger range of values but is not bigger

On second thought, I checked and I don't agree: if I quantize to bf16 using llama I get a way bigger size I get if I quantize to f16.
Perhaps it's because llama does a mixed quantization and keeps some tensors at f32...
Anyway I see no degradation at pure f16.

that's llama.cpp doing it then, if you take a bf16 and convert it to fp16 the model size stays identical

Sign up or log in to comment