Papers
arxiv:2407.14622

BOND: Aligning LLMs with Best-of-N Distillation

Published on Jul 19
· Submitted by piergs on Jul 23
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

Community

Paper author Paper submitter

We present J-BOND 🕴️, a novel alignment method that steers the LLM towards the Best-of-N distribution via online distillation. This allows inheriting the strong properties of Best-of-N sampling, while requiring only a single sample at inference time.

To achieve this, J-BOND minimizes the Jeffreys divergence between the training policy and the Best-of-N distribution, trading off mode covering (forward KL) and mode seeking (backward KL) achieving the best of both divergences. Moreover, it implements an iterative distillation approach aiming at distilling the Best-of-N version of an Exponential Moving Average (EMA) anchor policy. This allows keeping reduced sample complexity and stable optimization, while the policy continuously improves its performance.
We demonstrate our design choices and overall approach on an abstractive summarization task and for the fine tuning of Gemma. Aligning Gemma policies with J-BOND led to superior performance than standard RLHF baselines, with improvements on several benchmarks.

·

Hi @piergs ,

Congrats on this new work, would be cool to have it implemented in TRL (similar to DPO and other human preference tuning algorithms): https://github.com/huggingface/trl.

Let me know if I need to connect you with the team!

Cheers,
Niels
Open-source @ HF

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.14622 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.14622 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.14622 in a Space README.md to link it from this page.

Collections including this paper 3