mo137
/

FLUX.1-dev_Q8-fp16-fp32-mix_8-to-32-bpw_gguf

image-generation

Model card Files Files and versions Community

FLUX.1-dev_Q8-fp16-fp32-mix_8-to-32-bpw_gguf / README.md

mo137's picture

Update README.md

4f2dd61 verified about 1 month ago

|

No virus

2.09 kB

	---
	base_model: black-forest-labs/FLUX.1-dev
	library_name: gguf
	license: other
	license_name: flux-1-dev-non-commercial-license
	license_link: LICENSE.md
	quantized_by: mo137
	tags:
	- text-to-image
	- image-generation
	- flux
	---

	Flux.1-dev in a few experimental custom formats, mixing tensors in Q8_0, fp16, and fp32.
	Converted from black-forest-labs' original bf16 weights.

	### Motivation
	Flux's weights were published in bf16.
	Conversion to fp16 is slightly lossy, but fp32 is lossless.
	I experimented with mixed tensor formats to see if it would improve quality.

	### Evaluation
	I tried comparing the outputs but I can't say with any certainty if these models are significantly better than pure Q8_0.
	You're probably better off using Q8_0, but I thought I'll share these – maybe someone will find them useful.

	Higher bits per weight (bpw) numbers result in slower computation:
	```
	20 s Q8_0
	23 s 11.0bpw-txt16
	30 s fp16
	37 s 16.4bpw-txt32
	310 s fp32
	```

	In the txt16/32 files, I quantized only these layers to Q8_0, unless they were one-dimensional:
	```
	img_mlp.0
	img_mlp.2
	img_mod.lin
	linear1
	linear2
	modulation.lin
	```
	But left all these at fp16 or fp32, respectively:
	```
	txt_mlp.0
	txt_mlp.2
	txt_mod.lin
	```
	The resulting bpw number is just an approximation from file size.

	---

	This is a direct GGUF conversion of [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main)

	As this is a quantized model not a finetune, all the same restrictions/original license terms still apply.

	The model files can be used with the [ComfyUI-GGUF](https://github.com/city96/ComfyUI-GGUF) custom node.

	Place model files in `ComfyUI/models/unet` - see the GitHub readme for further install instructions.

	Please refer to [this chart](https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md#llama-3-8b-scoreboard) for a basic overview of quantization types.

	(Model card mostly copied from [city96/FLUX.1-dev-gguf](https://huggingface.co/city96/FLUX.1-dev-gguf) - which contains conventional and useful GGUF files.)