mapo-beta / README.md
sayakpaul's picture
sayakpaul HF staff
Update README.md
7d6af58 verified
metadata
license: openrail++
library_name: diffusers
tags:
  - text-to-image
  - text-to-image
  - diffusers-training
  - diffusers
  - stable-diffusion-xl
  - stable-diffusion-xl-diffusers
base_model: stabilityai/stable-diffusion-xl-base-1.0

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference


We propose MaPO, a reference-free, sample-efficient, memory-friendly alignment technique for text-to-image diffusion models. For more details on the technique, please refer to our paper here.

Developed by

  • Jiwoo Hong* (KAIST AI)
  • Sayak Paul* (Hugging Face)
  • Noah Lee (KAIST AI)
  • Kashif Rasul (Hugging Face)
  • James Thorne (KAIST AI)
  • Jongheon Jeong (Korea University)

Dataset

This model was fine-tuned from Stable Diffusion XL on the yuvalkirstain/pickapic_v2 dataset.

Training Code

Refer to our code repository here.

Qualitative Comparison

Results

Below we report some quantitative metrics and use them to compare MaPO to existing models:

Average score for Aesthetic, HPS v2.1, and PickScore
Aesthetic HPS v2.1 Pickscore
SDXL 6.03 30.0 22.4
SFTChosen 5.95 29.6 22.0
Diffusion-DPO 6.03 31.1 22.7
MaPO (Ours) 6.17 31.2 22.5

We evaluated this checkpoint in the Imgsys public benchmark. MaPO was able to outperform or match 21 out of 25 state-of-the-art text-to-image diffusion models by ranking 7th on the leaderboard at the time of writing, compared to Diffusion-DPO’s 20th place, while also consuming 14.5% less wall-clock training time on adapting Pick-a-Pic v2. We appreciate the imgsys team for helping us get the human preference data.

The table below reports memory efficiency of MaPO, making it a better alternative for alignment fine-tuning of diffusion models:

Computational costs of Diffusion-DPO and MaPO
Diffusion-DPO MaPO (Ours)
Time (↓) 63.5 54.3 (-14.5%)
GPU Mem. (↓) 55.9 46.1 (-17.5%)
Max Batch (↑) 4 16 (×4)

Inference

from diffusers import DiffusionPipeline, AutoencoderKL, UNet2DConditionModel
import torch 

sdxl_id = "stabilityai/stable-diffusion-xl-base-1.0"
vae_id = "madebyollin/sdxl-vae-fp16-fix"
unet_id = "mapo-t2i/mapo-beta"

vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16)
unet = UNet2DConditionModel.from_pretrained(unet_id, torch_dtype=torch.float16)
pipeline = DiffusionPipeline.from_pretrained(sdxl_id, vae=vae, unet=unet, torch_dtype=torch.float16).to("cuda")

prompt = "An abstract portrait consisting of bold, flowing brushstrokes against a neutral background."
image = pipeline(prompt=prompt, num_inference_steps=30).images[0]

For qualitative results, please visit our project website.

Citation

@misc{hong2024marginaware,
  title={Margin-aware Preference Optimization for Aligning Diffusion Models without Reference}, 
  author={Jiwoo Hong and Sayak Paul and Noah Lee and Kashif Rasul and James Thorne and Jongheon Jeong},
  year={2024},
  eprint={2406.06424},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}