ZeroWw/Llama-3.2-1B-Instruct-GGUF · Question about your quantization method

5 days ago

Hello ZeroWw,

I've been using the method that you posted about somewhere else (a couple months ago) for my local quants:
--output-tensor-type f16 --token-embedding-type f16

Today I noticed that there are some other options as well:

--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing

--pure: Disable k-quant mixtures and quantize all tensors to the same type

I'm curious to know what your thoughts are on those options, if you don't mind me asking?

Thank you.

ZeroWw

Owner about 21 hours ago

Well.. in my quants there is a pure q8 for comparison.
the leave output tensor is quite useless because it leaves to to f16 (so it's the same as doing --output-tensor-type f16)
I usually first conver to f16 using convert.py then I quantize. so the base is always f16. except for the q8q4 which is quite cool. output and embed at q8 and the rest at q4... it works pretty well.

rollercoasterX

about 21 hours ago

Thank you @ZeroWw that's great to know. I appreciate your help!