Please post f16 quantization.

by ZeroWw - opened May 23

Discussion

ZeroWw

May 23

•

edited May 23

Please post f16 quantization.
Requantizing is better from f16 or f32.
If you can, post them both.

shing3232

May 28

I through the original format is BF16.

ZeroWw

Jun 14

yes.. but f16 (fp16) does not cause harm to the model. bf16 is way bigger.

bartowski

Qwen org Jun 14

BF16 and F16 should be identical in size

If you need the f32 i uploaded it here: https://huggingface.co/bartowski/Qwen2-7B-Instruct-GGUF/blob/main/Qwen2-7B-Instruct-f32.gguf

ZeroWw

29 days ago

hmm maybe I got confused... I though bf16 was way bigger than f16 (I know they are both 16 bit) perhaps I was tired and read it wrong.
anyway I posted now my quantizations of quen 1.5 and qwen2...

bartowski

Qwen org 28 days ago

Bf16 represents a larger range of values but is not bigger

ZeroWw

28 days ago

Bf16 represents a larger range of values but is not bigger

Got it. thanks.

ZeroWw

28 days ago

Bf16 represents a larger range of values but is not bigger

On second thought, I checked and I don't agree: if I quantize to bf16 using llama I get a way bigger size I get if I quantize to f16.
Perhaps it's because llama does a mixed quantization and keeps some tensors at f32...
Anyway I see no degradation at pure f16.

bartowski

Qwen org 27 days ago

that's llama.cpp doing it then, if you take a bf16 and convert it to fp16 the model size stays identical

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment