chrisliu298 commited on
Commit
0f83530
1 Parent(s): 32e223b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -9
README.md CHANGED
@@ -44,20 +44,21 @@ During dataset curation, we adopt several tricks to achieve both performance imp
44
 
45
  We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench). As of September 2024, Skywork-Reward-Gemma-2-27B and Skywork-Reward-Llama-3.1-8B rank first and third on the RewardBench leaderboard.
46
 
47
- | Rank | Model | Chat | Chat Hard | Safety | Reasoning | Score |
48
- | :---: | --------------------------- | :---: | :-------: | :----: | :-------: | :---: |
49
- | 1 | Skywork-Reward-Gemma-2-27B | 95.8 | 91.4 | 92.0 | 96.1 | 93.8 |
50
- | 2 | SFR-LLaMa-3.1-70B-Judge-r | 96.9 | 84.8 | 92.2 | 97.6 | 92.8 |
51
- | 3 | Skywork-Reward-Llama-3.1-8B | 95.8 | 87.3 | 90.6 | 96.2 | 92.5 |
52
- | 4 | Nemotron-4-340B-Reward | 95.8 | 87.1 | 92.2 | 93.6 | 92.2 |
53
- | 5 | ArmoRM-Llama3-8B-v0.1 | 96.9 | 76.8 | 92.2 | 97.3 | 90.8 |
54
- | 6 | internlm2-20b-reward | 98.9 | 76.5 | 89.9 | 95.8 | 90.3 |
 
55
 
56
  ## Demo Code
57
 
58
  We provide example usage of the Skywork reward model series below. Please note that:
59
 
60
- 1. We removed the BOS token from the chat templates of the two models to prevent it being added twice during `apply_chat_template` and tokenization.
61
  2. To enable optimal performance for the 27B reward model, ensure that you have enabled either the `flash_attention_2` or `eager` implementation. The default `spda` implementation may result in bugs that could significantly degrade the model's performance for this particular model.
62
 
63
  Below is an example of obtaining the reward scores of two conversations.
 
44
 
45
  We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench). As of September 2024, Skywork-Reward-Gemma-2-27B and Skywork-Reward-Llama-3.1-8B rank first and third on the RewardBench leaderboard.
46
 
47
+ | Rank | Model | Chat | Chat Hard | Safety | Reasoning | Score |
48
+ | :---: | ------------------------------- | :---: | :-------: | :----: | :-------: | :---: |
49
+ | 1 | Skywork-Reward-Gemma-2-27B | 95.8 | 91.4 | 92.0 | 96.1 | 93.8 |
50
+ | 2 | SFR-LLaMa-3.1-70B-Judge-r | 96.9 | 84.8 | 92.2 | 97.6 | 92.8 |
51
+ | 3 | Skywork-Reward-Llama-3.1-8B | 95.8 | 87.3 | 90.6 | 96.2 | 92.5 |
52
+ | 4 | Nemotron-4-340B-Reward | 95.8 | 87.1 | 92.2 | 93.6 | 92.2 |
53
+ | 5 | ArmoRM-Llama3-8B-v0.1 | 96.9 | 76.8 | 92.2 | 97.3 | 90.8 |
54
+ | 6 | Salesforce/SFR-nemo-12B-Judge-r | 97.2 | 82.2 | 87.5 | 95.1 | 90.5 |
55
+ | 7 | internlm2-20b-reward | 98.9 | 76.5 | 89.9 | 95.8 | 90.3 |
56
 
57
  ## Demo Code
58
 
59
  We provide example usage of the Skywork reward model series below. Please note that:
60
 
61
+ 1. We removed the BOS token from the chat templates of the two models to prevent it being added twice during `apply_chat_template` and tokenization. **Therefore, please do not rely on `apply_chat_template` to add the BOS token.**
62
  2. To enable optimal performance for the 27B reward model, ensure that you have enabled either the `flash_attention_2` or `eager` implementation. The default `spda` implementation may result in bugs that could significantly degrade the model's performance for this particular model.
63
 
64
  Below is an example of obtaining the reward scores of two conversations.