Spaces:

Bazsalanszky
/

HunEval

Sleeping

App Files Files Community

HunEval / src /about.py

Bazsalanszky

Updated about page

09e2613 about 2 months ago

raw

history blame

No virus

3.36 kB

	from dataclasses import dataclass
	from enum import Enum


	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str


	# Select your tasks here
	# ---------------------------------------------------
	class Tasks(Enum):
	# task_key in the json file, metric_key in the json file, name to display in the leaderboard
	task0 = Task("history","score", "History")
	task1 = Task("grammar","score", "Grammar")
	task2 = Task("logic","score", "Logic")
	task3 = Task("sayings","score", "Sayings")
	task4 = Task("spelling","score", "Spelling")
	task5 = Task("vocabulary","score", "Vocabulary")


	NUM_FEWSHOT = 0 # Change with your few shot
	# ---------------------------------------------------


	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">HunEval leaderboard</h1>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	The HunEval leaderboard assesses the performance of models on a benchmark designed to evaluate their proficiency in understanding the Hungarian language and its nuances. The benchmark comprises two
	primary components: (1) linguistic comprehension tasks, which aim to gauge a model's ability to interpret and process Hungarian text; and (2) knowledge-based tasks that examine a model's familiarity
	with Hungarian cultural and linguistic phenomena. The benchmark is comprised of multiple sub-tasks, each targeting a distinct aspect of the model's performance.

	In designing the benchmark, our objective was to create challenges that would be intuitive for native Hungarian speakers or individuals with extensive exposure to the language, but potentially more
	demanding for models without prior training on Hungarian data. As such, we anticipate that models trained on Hungarian datasets will perform well on the benchmark, whereas those lacking this experience
	may encounter difficulties. Notwithstanding, a model's strong performance on the benchmark does not imply expertise in a specific task; rather, it indicates a proficiency in understanding Hungarian
	language and its structures.

	Note that this benchmark is just a Proof of Concept and is not intended to be a comprehensive evaluation of a model's capabilities. We encourage participants to explore the benchmark and provide feedback on how it can be improved.
	"""

	# Which evaluations are you running? how can people reproduce what you have?
	LLM_BENCHMARKS_TEXT = """
	## How it works
	The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
	prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent.

	## Reproducing the results
	TODO

	"""

	EVALUATION_QUEUE_TEXT = """
	TODO
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	# Citation text for HunEval by Balázs Ádám Toldi, 2024, inprogress
	CITATION_BUTTON_TEXT = r"""
	@misc{toldi2024huneval,
	title={HunEval},
	author={Balázs Ádám Toldi},
	year={2024},
	howpublished={\url{https://huggingface.co/spaces/Bazsalanszky/huneval}}
	}
	"""