Bazsalanszky commited on
Commit
346cd10
1 Parent(s): c42a846

Change description

Browse files
Files changed (1) hide show
  1. src/about.py +6 -3
src/about.py CHANGED
@@ -46,16 +46,19 @@ language and its structures.
46
  LLM_BENCHMARKS_TEXT = """
47
  ## How it works
48
  The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
49
- prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent.
 
 
50
 
51
  ## Reproducing the results
52
  TODO
53
 
 
 
54
  """
55
 
56
  EVALUATION_QUEUE_TEXT = """
57
- ## Evaluation
58
-
59
  """
60
 
61
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 
46
  LLM_BENCHMARKS_TEXT = """
47
  ## How it works
48
  The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
49
+ prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent. An answer is considered correct if it matches the correct answer in the set of possible answers. The task is given to the model three times. If it answers correctly at least once, it is considered correct. The final score is the number of correct answers divided by the number of tasks.
50
+
51
+ To run the evaluation, we gave the model 2048 tokens to generate the answer and 0.0 was used as the temperature.
52
 
53
  ## Reproducing the results
54
  TODO
55
 
56
+ ## Evaluation
57
+ In the current version of the benchmark, some models, (ones that were most likely trained on Hungarian data) perform very well, while others, (ones that were not trained on Hungarian data) perform poorly. This may indicate that in the future, more challenging tasks should be added to the benchmark to make it more difficult for models that were trained on Hungarian data. Please note that the benchmark is still in development and the results may change in the future. Your feedback is highly appreciated.
58
  """
59
 
60
  EVALUATION_QUEUE_TEXT = """
61
+ TODO
 
62
  """
63
 
64
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"