How to FP8 inference

#5
by alfredplpl - opened

According to your blog, this model can deal with FP8 inference.

Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

How can I use this model with FP8 inference?

Same question.

Also, will there be a FP8 checkpoint?

Vllm currently takes any full model and brings it to fp8 on load with the fp8 flag. its also possible to save the fp8 weights, but fp8 conversion takes less than 10 seconds in my testing on 8b models.

I would assume that once vllm updates with mistral nemo support it should be ready to load in fp8 out the gate.

Sign up or log in to comment