zhilinw commited on
Commit
09f2d4c
1 Parent(s): 74f8f17

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -2
README.md CHANGED
@@ -24,7 +24,29 @@ For the same prompt, a response with higher reward score has higher quality than
24
 
25
  Llama-3.1-Nemotron-70B-Reward-HF has been converted from [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) as evaluated in NeMo-Aligner, which the evaluation results below are based on.
26
 
27
- Try it for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward) - it comes with an OpenAI-compatible API interface!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
 
30
  ## Terms of use
@@ -34,7 +56,7 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
34
 
35
  ## RewardBench Primary Dataset LeaderBoard
36
 
37
- As of 30 Sept 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as with strong performance in Chat, Safety and Reasoning categories among the models below.
38
 
39
  | Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
40
  |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
@@ -107,6 +129,16 @@ E-Mail: [Zhilin Wang](mailto:zhilinw@nvidia.com)
107
  If you find this model useful, please cite the following works
108
 
109
  ```bibtex
 
 
 
 
 
 
 
 
 
 
110
  @misc{wang2024helpsteer2,
111
  title={HelpSteer2: Open-source dataset for training top-performing reward models},
112
  author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
@@ -119,6 +151,7 @@ If you find this model useful, please cite the following works
119
 
120
  ## References(s):
121
 
 
122
  * [HelpSteer2](https://arxiv.org/abs/2406.08673)
123
  * [HelpSteer](https://arxiv.org/abs/2311.09528)
124
  * [SteerLM method](https://arxiv.org/abs/2310.05344)
 
24
 
25
  Llama-3.1-Nemotron-70B-Reward-HF has been converted from [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) as evaluated in NeMo-Aligner, which the evaluation results below are based on.
26
 
27
+ Try hosted inference for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward) - it comes with an OpenAI-compatible API interface and simply signing up gets you 100k free API calls to this model.
28
+
29
+ Using this reward model for RLHF (specifically, REINFORCE), we were able to tune a Llama-3.1-70B-Instruct model to reach [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6, [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
30
+
31
+ As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks, edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
32
+
33
+ See details on our paper at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question ```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:
34
+
35
+ ```
36
+ A sweet question!
37
+ Let’s count the “R”s in “strawberry”:
38
+ 1. S
39
+ 2. T
40
+ 3. R
41
+ 4. A
42
+ 5. W
43
+ 6. B
44
+ 7. E
45
+ 8. R
46
+ 9. R
47
+ 10. Y
48
+ There are **3 “R”s** in the word “strawberry”.
49
+ ```
50
 
51
 
52
  ## Terms of use
 
56
 
57
  ## RewardBench Primary Dataset LeaderBoard
58
 
59
+ As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as with strong performance in Chat, Safety and Reasoning categories among the models below.
60
 
61
  | Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
62
  |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
 
129
  If you find this model useful, please cite the following works
130
 
131
  ```bibtex
132
+ @misc{wang2024helpsteer2preferencecomplementingratingspreferences,
133
+ title={HelpSteer2-Preference: Complementing Ratings with Preferences},
134
+ author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
135
+ year={2024},
136
+ eprint={2410.01257},
137
+ archivePrefix={arXiv},
138
+ primaryClass={cs.LG},
139
+ url={https://arxiv.org/abs/2410.01257},
140
+ }
141
+
142
  @misc{wang2024helpsteer2,
143
  title={HelpSteer2: Open-source dataset for training top-performing reward models},
144
  author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
 
151
 
152
  ## References(s):
153
 
154
+ * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
155
  * [HelpSteer2](https://arxiv.org/abs/2406.08673)
156
  * [HelpSteer](https://arxiv.org/abs/2311.09528)
157
  * [SteerLM method](https://arxiv.org/abs/2310.05344)