open-llm-leaderboard/open_llm_leaderboard · Raw results to normalized results

Jul 8

I am trying to find the normalized results from the RAW results on some models of the leaderboard. For tasks that do not have subtasks, like GPQA MMLU-PRO.... It works by just subtracting the random score and remapping to (0, 1). However, for task like BBH and MUSR, I tried a bunch of techniques, taking in account the number of samples per subtask but I am not able to find the right normalization. How can I recover MUSR from MUSR RAW.

Ilyasch2 changed discussion title from Raw results to normalized results. to Raw results to normalized results Jul 8

alozowski

Open LLM Leaderboard org Jul 8

Hi @Ilyasch2 ,

To normalise results for tasks with subtasks in leaderboards like MUSR, you can follow this idea:

Define a normalisation function, for instance:

def normalize_within_range(value, lower_bound, higher_bound):
    return (value - lower_bound) / (higher_bound - lower_bound)

Calculate lower bound for each subtask. The lower bound is the score you would get with a random baseline, so the reciprocal of the number of choices for each subtask. The example for MUSR:

MUSR murder mysteries: 2 choices (lower_bound = 0.5)
MUSR object placement: 5 choices (lower_bound = 0.2)
MUSR team allocation: 3 choices (lower_bound = 0.333)

You can find num_choices for other benchmarks here in the doc.

For each subtask, normalise the raw scores. If the raw score is below the lower bound, it's normalized to 0. Otherwise, apply the normalisation function and scale it to a percentage:

if raw_score < lower_bound:
    normalized_score = 0
else:
    normalized_score = normalize_within_range(raw_score, lower_bound, 1) * 100

Average the normalised scores across subtasks to obtain the overall normalised score for MUSR.

For more details, please, check out our blog

alozowski changed discussion status to closed Jul 8