arxiv:2407.14622

BOND: Aligning LLMs with Best-of-N Distillation

Published on Jul 19

· Submitted by

piergs on Jul 23

Upvote

Authors:

Pier Giuseppe Sessa ,

Robert Dadashi ,

Léonard Hussenot ,

Johan Ferret ,

Alexandre Ramé ,

Sarah Perrin ,

Andrea Michi ,

Olivier Bachem

Abstract

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

View arXiv page View PDF Add to collection

Community

piergs

Paper author Paper submitter Jul 23

We present J-BOND 🕴️, a novel alignment method that steers the LLM towards the Best-of-N distribution via online distillation. This allows inheriting the strong properties of Best-of-N sampling, while requiring only a single sample at inference time.

To achieve this, J-BOND minimizes the Jeffreys divergence between the training policy and the Best-of-N distribution, trading off mode covering (forward KL) and mode seeking (backward KL) achieving the best of both divergences. Moreover, it implements an iterative distillation approach aiming at distilling the Best-of-N version of an Exponential Moving Average (EMA) anchor policy. This allows keeping reduced sample complexity and stable optimization, while the policy continuously improves its performance.
We demonstrate our design choices and overall approach on an abstractive summarization task and for the fine tuning of Gemma. Aligning Gemma policies with J-BOND led to superior performance than standard RLHF baselines, with improvements on several benchmarks.

nielsr

Jul 25

Hi @piergs ,

Congrats on this new work, would be cool to have it implemented in TRL (similar to DPO and other human preference tuning algorithms): https://github.com/huggingface/trl.

Let me know if I need to connect you with the team!

Cheers,
Niels
Open-source @ HF