How to FP8 inference

by alfredplpl - opened Jul 18

Discussion

alfredplpl

Jul 18

•

edited Jul 18

According to your blog, this model can deal with FP8 inference.

Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

How can I use this model with FP8 inference?

Enigrand

Jul 18

•

edited Jul 18

Same question.

Also, will there be a FP8 checkpoint?

Pyroserenus

Jul 18

Vllm currently takes any full model and brings it to fp8 on load with the fp8 flag. its also possible to save the fp8 weights, but fp8 conversion takes less than 10 seconds in my testing on 8b models.

I would assume that once vllm updates with mistral nemo support it should be ready to load in fp8 out the gate.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment