File size: 5,970 Bytes
57d489d
 
 
 
 
 
 
0d36935
57d489d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
365bb99
57d489d
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: apache-2.0
---
# Omni-Judge

## Introduction

Omni-Judge is an open-source mathematical evaluation model designed to assess whether a solution generated by a model is correct given a problem and a standard answer. Due to the complexity of high-level mathematical problems and their solutions, designing rule-based evaluation methods can be challenging. Omni-Judge, similar to GPT-4-as-a-judge, offers automated assessment with greater efficiency and lower cost. For utilization details, please refer to [this section](#Quickstart).

Omni-Judge can be applied to various mathematical reasoning benchmarks, such as our proposed [Omni-MATH](https://omni-math.github.io/).

## Model Details

Omni-Judge builds on the `meta-llama/Meta-Llama-3-8B-Instruct`, incorporating GPT-4o evaluation data for instruction tuning. The training dataset comprises 21,451 examples, with a total of 2 epochs. Omni-Judge's performance is closely aligned with GPT-4o. We created an internal evaluation set using queries not previously seen by the model, consisting of 2,690 test samples. The agreement rate between Omni-Judge and GPT-4o evaluations is approximately 95%.

## Quickstart


```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "KbsdJames/Omni-Judge", 
    device_map="auto", 
    torch_dtype=torch.bfloat16, 
)
tokenizer = AutoTokenizer.from_pretrained(
    "KbsdJames/Omni-Judge", 
    trust_remote_code=True
)

# set terminators for decoding
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

question = "Compute the smallest positive integer $n$ for which $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}$ is an integer."
reference_answer = "6156"
student_solution = "Let's approach this step-by-step:\n\n1) Let's denote $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}} = k$, where k is an integer.\n\n2) Square both sides:\n   $(\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}})^2 = k^2$\n\n3) Expand the left side:\n   $100+\\sqrt{n} + 100-\\sqrt{n} + 2\\sqrt{(100+\\sqrt{n})(100-\\sqrt{n})} = k^2$\n\n4) Simplify:\n   $200 + 2\\sqrt{10000-n} = k^2$\n\n5) Rearrange:\n   $\\sqrt{10000-n} = \\frac{k^2-200}{2}$\n\n6) For this to be true, $\\frac{k^2-200}{2}$ must be a perfect square. Let's call it $m^2$:\n   $\\frac{k^2-200}{2} = m^2$\n\n7) Multiply both sides by 2:\n   $k^2-200 = 2m^2$\n\n8) Rearrange:\n   $k^2 = 2m^2 + 200$\n\n9) For the smallest positive n, we need the smallest k > 14 (since $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}} > 2\\sqrt{100} = 20$) that makes this equation true.\n\n10) Testing values:\n    For k = 15: $15^2 = 225 = 2m^2 + 200$\n                $25 = 2m^2$\n                $m^2 = 12.5$ (not an integer)\n\n    For k = 16: $16^2 = 256 = 2m^2 + 200$\n                $56 = 2m^2$\n                $m^2 = 28$ (integer)\n\n11) So, the smallest valid k is 16, and m = $\\sqrt{28}$\n\n12) Now, going back to step 5:\n    $\\sqrt{10000-n} = \\frac{16^2-200}{2} = 28$\n\n13) Solve for n:\n    $10000-n = 28^2 = 784$\n    $n = 10000 - 784 = 9216$\n\nTherefore, the smallest positive integer n for which $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}$ is an integer is 9216."

# pre-process
formatted_context = tokenizer.get_context(
    question,
    reference_answer,
    student_solution,
)
model_inputs = tokenizer(formatted_context, return_tensors="pt")
input_ids = model_inputs["input_ids"]
attention_mask = model_inputs["attention_mask"]

# do inference
pred = model.generate(
    input_ids=input_ids.to(model.device),
    attention_mask=attention_mask.to(model.device),
    do_sample = False,
    num_return_sequences = 1,
    max_new_tokens = 300,
)[0].cpu().tolist()

# post-process
pred = pred[len(input_ids[0].cpu().tolist()):]
for terminator in terminators:
    if terminator in pred:
        pred = pred[:pred.index(terminator)]
response = tokenizer.decode(pred, skip_special_tokens=True)
pred_truth = tokenizer.parse_response(response)

# if response parsing fails, the answer/judgement/justification will be None,
# which we consider as errors in prediction. 
# in this case, using multiple sampling may help.

print("answer:", pred_truth["answer"])
# >>> answer: 9216
print("judgement:", pred_truth["judgement"])
# >>> judgement: FALSE
print("justification:", pred_truth["justification"])
# >>> justification: The student's answer of 9216 is incorrect in the context of the problem, which asks for the smallest positive integer $\\(n\\)$ for which $\\(\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}\\)$ is an integer. The reference answer is 6156. The student's solution incorrectly calculates the value of \\(n\\) by incorrectly identifying the smallest integer value of \\(k\\) and then incorrectly solving for \\(n\\). The student's approach does not accurately capture the correct value of \\(n\\), which is 6156, as indicated by the reference answer. Therefore, the student's answer does not share the same meaning as the reference answer.
```



## Evaluation

Given GPT-4o judgement as the golden results, we report the performance of Omni-Judge.

For a fair comparison, the questions for train and test are different.

The results are shown below:

| Source                          | Success of Parsing | Consistency |
| :-----------------------------: | :----------------: | :---------: |
| deepseek-coder-v2-lite-instruct | 100                | 95.08       |
| deepseek-math-7b-RL             | 99.55              | 94.20       |
| mathqwen-7b-Instruct            | 100                | 95.32       |
| mathqwen-72b-Instruct           | 99.78              | 94.65       |
| GPT-4o                          | 100                | 94.87       |
| claude_sonnet-3-5               | 100                | 93.54       |
| All                             | 99.89              | 94.61       |



## Citation

If you find our work helpful, feel free to give a star to our repo.