fp8 / int8 inference - use bitsandbytes or awq

#8
by dtanow - opened

Excellent work on distilling a smaller LLAMA3.1 70B! What is the recommended way to evaluate on fp8 / int8 precision without conversion to TensorRT-LLM? FP8 is mentioned in the blog post, but no information here on HuggingFace. I would appreciate some tips.

Sign up or log in to comment