Spaces:

Bazsalanszky
/

HunEval

Sleeping

App Files Files Community

Bazsalanszky commited on Aug 6

Commit

346cd10

•

1 Parent(s): c42a846

Change description

Browse files

Files changed (1) hide show

src/about.py +6 -3

src/about.py CHANGED Viewed

@@ -46,16 +46,19 @@ language and its structures.
 LLM_BENCHMARKS_TEXT = """
 ## How it works
 The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
-prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent.
 ## Reproducing the results
 TODO
 """
 EVALUATION_QUEUE_TEXT = """
-## Evaluation
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"

 LLM_BENCHMARKS_TEXT = """
 ## How it works
 The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
+prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent. An answer is considered correct if it matches the correct answer in the set of possible answers. The task is given to the model three times. If it answers correctly at least once, it is considered correct. The final score is the number of correct answers divided by the number of tasks.
+To run the evaluation, we gave the model 2048 tokens to generate the answer and 0.0 was used as the temperature.
 ## Reproducing the results
 TODO
+## Evaluation
+In the current version of the benchmark, some models, (ones that were most likely trained on Hungarian data) perform very well, while others, (ones that were not trained on Hungarian data) perform poorly. This may indicate that in the future, more challenging tasks should be added to the benchmark to make it more difficult for models that were trained on Hungarian data. Please note that the benchmark is still in development and the results may change in the future. Your feedback is highly appreciated.
 """
 EVALUATION_QUEUE_TEXT = """
+TODO
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"