Raw results to normalized results

#825
by Ilyasch2 - opened

I am trying to find the normalized results from the RAW results on some models of the leaderboard. For tasks that do not have subtasks, like GPQA MMLU-PRO.... It works by just subtracting the random score and remapping to (0, 1). However, for task like BBH and MUSR, I tried a bunch of techniques, taking in account the number of samples per subtask but I am not able to find the right normalization. How can I recover MUSR from MUSR RAW.

Ilyasch2 changed discussion title from Raw results to normalized results. to Raw results to normalized results
Open LLM Leaderboard org

Hi @Ilyasch2 ,

To normalise results for tasks with subtasks in leaderboards like MUSR, you can follow this idea:

  • Define a normalisation function, for instance:
def normalize_within_range(value, lower_bound, higher_bound):
    return (value - lower_bound) / (higher_bound - lower_bound)
  • Calculate lower bound for each subtask. The lower bound is the score you would get with a random baseline, so the reciprocal of the number of choices for each subtask. The example for MUSR:
MUSR murder mysteries: 2 choices (lower_bound = 0.5)
MUSR object placement: 5 choices (lower_bound = 0.2)
MUSR team allocation: 3 choices (lower_bound = 0.333)

You can find num_choices for other benchmarks here in the doc.

  • For each subtask, normalise the raw scores. If the raw score is below the lower bound, it's normalized to 0. Otherwise, apply the normalisation function and scale it to a percentage:
if raw_score < lower_bound:
    normalized_score = 0
else:
    normalized_score = normalize_within_range(raw_score, lower_bound, 1) * 100
  • Average the normalised scores across subtasks to obtain the overall normalised score for MUSR.

For more details, please, check out our blog

alozowski changed discussion status to closed

Sign up or log in to comment