YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Training
The base model is mistralai/Mistral-7B-Instruct-v0.2
.
We also merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling.
Thanks Wei (https://huggingface.co/weqweasdas) for his help and contribution to the community.
Usage
To use this model, you need to load by AutoModelForSequenceClassification
,
model = AutoModelForSequenceClassification.from_pretrained(
"hendrydong/Mistral-RM-for-RAFT-GSHF-v0", num_labels=1, torch_dtype=torch.bfloat16
)
and prepare dataset like
SAMPLE =[
{'role': 'user', 'content': 'Hi!'},
{'role': 'assistant', 'content': 'How are you?'},
]
The template is the same as mistralai/Mistral-7B-Instruct-v0.2
.
The reward model can be used for iterative SFT/DPO.
Please cite them if you found this RM helpful,
@article{dong2023raft,
title={Raft: Reward ranked finetuning for generative foundation model alignment},
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
journal={arXiv preprint arXiv:2304.06767},
year={2023}
}
@article{xiong2023gibbs,
title={Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf},
author={Xiong, Wei and Dong, Hanze and Ye, Chenlu and Zhong, Han and Jiang, Nan and Zhang, Tong},
journal={arXiv preprint arXiv:2312.11456},
year={2023}
}
- Downloads last month
- 1,382
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.