Our Transformers Code Agent beats the GAIA benchmark!
โข
26
It's not using GPT-4o for evaluation, evaluation is done with exact string match!
Great idea! Can I build it @victor or you'd like to make it yourself?
{
"rationale": "The answer does not match the true answer at all."
"score": 1,
"confidence_level": 0.85
}