Spaces:
Sleeping
Sleeping
from dataclasses import dataclass | |
from enum import Enum | |
class Task: | |
benchmark: str | |
metric: str | |
col_name: str | |
# Select your tasks here | |
# --------------------------------------------------- | |
class Tasks(Enum): | |
# task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
task0 = Task("history","score", "History") | |
task1 = Task("grammar","score", "Grammar") | |
task2 = Task("logic","score", "Logic") | |
task3 = Task("sayings","score", "Sayings") | |
task4 = Task("spelling","score", "Spelling") | |
task5 = Task("vocabulary","score", "Vocabulary") | |
NUM_FEWSHOT = 0 # Change with your few shot | |
# --------------------------------------------------- | |
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">HunEval leaderboard</h1>""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
The HunEval leaderboard assesses the performance of models on a benchmark designed to evaluate their proficiency in understanding the Hungarian language and its nuances. The benchmark comprises two | |
primary components: (1) linguistic comprehension tasks, which aim to gauge a model's ability to interpret and process Hungarian text; and (2) knowledge-based tasks that examine a model's familiarity | |
with Hungarian cultural and linguistic phenomena. The benchmark is comprised of multiple sub-tasks, each targeting a distinct aspect of the model's performance. | |
In designing the benchmark, our objective was to create challenges that would be intuitive for native Hungarian speakers or individuals with extensive exposure to the language, but potentially more | |
demanding for models without prior training on Hungarian data. As such, we anticipate that models trained on Hungarian datasets will perform well on the benchmark, whereas those lacking this experience | |
may encounter difficulties. Notwithstanding, a model's strong performance on the benchmark does not imply expertise in a specific task; rather, it indicates a proficiency in understanding Hungarian | |
language and its structures. | |
**Note that this benchmark is just a Proof of Concept and is not intended to be a comprehensive evaluation of a model's capabilities.** We encourage participants to explore the benchmark and provide feedback on how it can be improved. | |
""" | |
# Which evaluations are you running? how can people reproduce what you have? | |
LLM_BENCHMARKS_TEXT = """ | |
## How it works | |
The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system | |
prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent. | |
## Reproducing the results | |
TODO | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
TODO | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
# Citation text for HunEval by Balázs Ádám Toldi, 2024, inprogress | |
CITATION_BUTTON_TEXT = r""" | |
@misc{toldi2024huneval, | |
title={HunEval}, | |
author={Balázs Ádám Toldi}, | |
year={2024}, | |
howpublished={\url{https://huggingface.co/spaces/Bazsalanszky/huneval}} | |
} | |
""" | |