Spaces:
Sleeping
Sleeping
Bazsalanszky
commited on
Commit
•
346cd10
1
Parent(s):
c42a846
Change description
Browse files- src/about.py +6 -3
src/about.py
CHANGED
@@ -46,16 +46,19 @@ language and its structures.
|
|
46 |
LLM_BENCHMARKS_TEXT = """
|
47 |
## How it works
|
48 |
The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
|
49 |
-
prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent.
|
|
|
|
|
50 |
|
51 |
## Reproducing the results
|
52 |
TODO
|
53 |
|
|
|
|
|
54 |
"""
|
55 |
|
56 |
EVALUATION_QUEUE_TEXT = """
|
57 |
-
|
58 |
-
|
59 |
"""
|
60 |
|
61 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
|
|
46 |
LLM_BENCHMARKS_TEXT = """
|
47 |
## How it works
|
48 |
The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
|
49 |
+
prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent. An answer is considered correct if it matches the correct answer in the set of possible answers. The task is given to the model three times. If it answers correctly at least once, it is considered correct. The final score is the number of correct answers divided by the number of tasks.
|
50 |
+
|
51 |
+
To run the evaluation, we gave the model 2048 tokens to generate the answer and 0.0 was used as the temperature.
|
52 |
|
53 |
## Reproducing the results
|
54 |
TODO
|
55 |
|
56 |
+
## Evaluation
|
57 |
+
In the current version of the benchmark, some models, (ones that were most likely trained on Hungarian data) perform very well, while others, (ones that were not trained on Hungarian data) perform poorly. This may indicate that in the future, more challenging tasks should be added to the benchmark to make it more difficult for models that were trained on Hungarian data. Please note that the benchmark is still in development and the results may change in the future. Your feedback is highly appreciated.
|
58 |
"""
|
59 |
|
60 |
EVALUATION_QUEUE_TEXT = """
|
61 |
+
TODO
|
|
|
62 |
"""
|
63 |
|
64 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|