Wrong results or am i understanding something wrong?

#839
by nicobuko - opened

Hi,
I am currently looking at some results of the new leaderboard and some parts of them i do not understand.

I for example looked at following results:
https://huggingface.co/datasets/open-llm-leaderboard/mistralai__Mistral-7B-v0.3-details/blob/main/mistralai__Mistral-7B-v0.3/samples_leaderboard_musr_object_placements_2024-06-16T16-59-40.129004.json

In there i saw for example the following doc:

{
    "doc_id": 100,
    "doc": {
        "narrative": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
        "question": "Which location is the most likely place Amy would look to find the laptop given the story?",
        "choices": "[\"Amy's bag\", \"Steve's desk\", 'meeting room', 'storage room']",
        "answer_index": 1,
        "answer_choice": "Steve's desk"
    },
    "target": "Steve's desk",
    "arguments": {
        "gen_args_0": {
            "arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
            "arg_1": " Amy's bag"
        },
        "gen_args_1": {
            "arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
            "arg_1": " Steve's desk"
        },
        "gen_args_2": {
            "arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
            "arg_1": " meeting room"
        },
        "gen_args_3": {
            "arg_0": "Today was an important day for Steve, a chance for a major advancement in his career hinged...",
            "arg_1": " storage room"
        }
    },
    "resps": [
        [
            [
                "-4.742625713348389",
                "False"
            ]
        ],
        [
            [
                "-5.4493818283081055",
                "False"
            ]
        ],
        [
            [
                "-8.597461700439453",
                "False"
            ]
        ],
        [
            [
                "-7.260561466217041",
                "False"
            ]
        ]
    ],
    "filtered_resps": [
        [
            "-4.742625713348389",
            "False"
        ],
        [
            "-5.4493818283081055",
            "False"
        ],
        [
            "-8.597461700439453",
            "False"
        ],
        [
            "-7.260561466217041",
            "False"
        ]
    ],
    "doc_hash": "1ff558c6b1587662ad68ccd0f38192193dddd6376f35f96cee6f445c4b4c26a6",
    "prompt_hash": "30efa381b5d2f7a8d22c15caeb5cb192cf7ca3b0481be33bf0e50f7a474c8668",
    "target_hash": "78fb1574492fbfb833e674311451bfff46b7377e0c1824b94bf4b1ddc84d0039",
    "acc_norm": 1.0
}

The answer index is 1.
When looking at the resps values the highest value is the "-4.742625713348389" which is the index 0.
I thought that the highest value determines the answer of the model?
So i would have thought that the answer of the model is index 0.

Why is the acc_norm than 1.0?

There are two types of indexing: zero-based, starting from 0, and one-based, starting from 1. I think this might be just using a zero-based index when you were expecting it to start from zero. I might be wrong though, so let's wait for an answer from one of the maintainers.

There are two types of indexing: zero-based, starting from 0, and one-based, starting from 1. I think this might be just using a zero-based index when you were expecting it to start from zero. I might be wrong though, so let's wait for an answer from one of the maintainers.

I do not think that this is the explanation, but yes you are totaly right, lets wait of an official answer.

Here is another example out of https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-70B-Instruct-details/blob/main/meta-llama__Meta-Llama-3-70B-Instruct/samples_leaderboard_musr_murder_mysteries_2024-06-19T08-22-58.348428.json, which does not make sense to me.

{
    "doc_id": 0,
    "doc": {
        "narrative": "In an adrenaline inducing bungee jumping site, ...",
        "question": "Who is the most likely murderer?",
        "choices": "['Mackenzie', 'Ana']",
        "answer_index": 0,
        "answer_choice": "Mackenzie"
    },
    "target": "Mackenzie",
    "arguments": {
        "gen_args_0": {
            "arg_0": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIn an adrenaline inducing bungee jumping site, ...",
            "arg_1": "Mackenzie"
        },
        "gen_args_1": {
            "arg_0": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIn an adrenaline inducing bungee jumping site, ...",
            "arg_1": "Ana"
        }
    },
    "resps": [
        [
            [
                "-18.321674346923828",
                "False"
            ]
        ],
        [
            [
                "-17.761363983154297",
                "False"
            ]
        ]
    ],
    "filtered_resps": [
        [
            "-18.321674346923828",
            "False"
        ],
        [
            "-17.761363983154297",
            "False"
        ]
    ],
    "doc_hash": "5f1aa1c93592052d09fd5c2269624f7f6502e7a0a449eaedade303f15e4f9a7e",
    "prompt_hash": "eba5abc36b0f013ee9ad59846c5732be9d46917d24400f110b41e7dcdf3c34b4",
    "target_hash": "5e0a11a6f7067982f903e924b45692ab48c7224f794799148ed9bc3b6fc1e340",
    "acc_norm": 1.0
}

Answer index is 0. The higher value in resps is "-17.761363983154297" which is at index 1. So i would think that the model predicts 1. But the acc is again 1.0.

Also in the new visualization possibility on https://huggingface.co/spaces/open-llm-leaderboard/blog it seems not fitting.

image.png

Open LLM Leaderboard org

Iirc, the logprobs that you see displayed here are the sum of the logprobs over the choice tokens, but not yet normalized (on the number of tokens of the choice). (They would correspond to the acc score).
For the æcc_norm score, you normalize by the number of tokens. When you do so, Mackenzie is longer than Ana for example, so you end up with a smaller normalised logprob, hence a smaller score.
I agree it's not super legible though.

@clefourrier Thank you for the response :) Now i do understand it! I am new to the leaderboard and did not have the information. Is there any documentation available on this area from which i could have known this?

Open LLM Leaderboard org

We're working on improving our doc, I don't think it's there yet - @alozowski do you think it would make sense to add to the FAQ?

Open LLM Leaderboard org

Hi everyone!

Here is the new Scores Normalization page in our documentation – please, check it out

Hi @alozowski :) Thank you for your comment and the hint for the new page :) This is a very helpful page but it does not cover directly the topic of this discussion. The confusion here was due to the normalization of the logprobs based on the number of tokens of the choice.

Open LLM Leaderboard org

Yes, I'll add more information there on this topic, you can check this documentation page from time to time to follow the updates!

I think I can close this discussion for now, please, feel free to open a new one in case of any other questions :)

alozowski changed discussion status to closed

Sign up or log in to comment