File size: 2,904 Bytes
d2e658c
 
 
 
 
 
 
 
 
b0de423
 
 
 
 
 
 
 
 
 
2e3fda7
 
 
b65fb59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0de423
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: mit
library_name: diffusers
---

# Ostris VAE - KL-f8-d16

A 16 channel VAE with 8x downsample. Trained from scratch on a balance of photos, artistic, text, cartoons, vector images. 

It is lighter weight that most VAEs with only 57,266,643 parameters (vs SD3 VAE: 83,819,683) which means it is faster and uses less VRAM yet scores quite similarly
on real images. Plus it is MIT licensed so you can do whatever you want with it.

| VAE|PSNR (higher better)| LPIPS (lower better) | # params |
|----|----|----|----|
| sd-vae-ft-mse|26.939|0.0581|83,653,863|
| SDXL|27.370|0.0540|83,653,863|
| SD3|31.681|0.0187|83,819,683|
| **Ostris KL-f8-d16** |**31.166**|**0.0198**|**57,266,643**|

### Compare
Check out the comparison at [imgsli](https://imgsli.com/Mjc2MjA3).

### Use with SD1.5 (Diffusers)
```py
import torch
from diffusers import AutoencoderKL, StableDiffusionPipeline
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

model_id = "runwayml/stable-diffusion-v1-5"
decoder_id = "ostris/vae-kl-f8-d16"
adapter_id = "ostris/16ch-VAE-Adapters"
adapter_ckpt = "16ch-VAE-Adapter-SD15-alpha.safetensors"
dtype = torch.float16

vae = AutoencoderKL.from_pretrained(decoder_id, torch_dtype=dtype)
pipe = StableDiffusionPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.float16)

ckpt_file = hf_hub_download(adapter_id, adapter_ckpt)
ckpt = load_file(ckpt_file)

lora_state_dict = {k: v for k, v in ckpt.items() if "lora" in k}
unet_state_dict = {k.replace("unet_", ""): v for k, v in ckpt.items() if "unet_" in k}

pipe.unet.conv_in = torch.nn.Conv2d(16, 320, 3, 1, 1)
pipe.unet.conv_out = torch.nn.Conv2d(320, 16, 3, 1, 1)
pipe.unet.load_state_dict(unet_state_dict, strict=False)
pipe.unet.conv_in.to(dtype)
pipe.unet.conv_out.to(dtype)
pipe.unet.config.in_channels = 16
pipe.unet.config.out_channels = 16

pipe.load_lora_weights(lora_state_dict)
pipe.fuse_lora()

pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
negative_prompt = (
    "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame,"
    "extra limbs, disfigured, deformed, body out of frame, bad anatomy, watermark, signature,"
    "cut off, low contrast, underexposed, overexposed, bad art, beginner, amateur, distorted face"
)
image = pipe(prompt, negative_prompt=negative_prompt).images[0]

image.save("astronaut_rides_horse.png")
```

### What do I do with this?

If you don't know, you probably don't need this. This is made as an open source lighter version of a 16ch vae. 
You would need to train it into a network before it is useful. I plan to do this myself for SD 1.5, SDXL, and possibly pixart. 
[Follow me on Twitter](https://x.com/ostrisai) to keep up with my work on that. 

### Note: Not SD3 compatable
This VAE is not SD3 compatable as it is trained from scratch and has an entirely different latent space.