What are the differences between yours and meta's offical one?

#2
by c6sneaky - opened

Here is the link to the official fp8 quant: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

Neural Magic org

Meta skipped all layers' QKV/output matrices and the first and last layers completely. This breaks down to:

  • Meta FP8: 325B out of 410B params quantized (80%)
  • NM FP8: 406B out of 410B params quantized (99%)

With 99.9% recovery and 80GB memory saved for NM

Hi guys! Thank you for your work.

Meta used FBGEMM(https://github.com/pytorch/FBGEMM) and you used LLM Compressor (https://github.com/vllm-project/llm-compressor). I haven’t done extensive research, but could you clarify the main differences in their quantization procedures?

Sign up or log in to comment