Papers
arxiv:2407.01231

MIRAI: Evaluating LLM Agents for Event Forecasting

Published on Jul 1
· Submitted by ydeng9 on Jul 2
Authors:
,
,

Abstract

Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

Community

Paper author Paper submitter
edited 21 days ago

We introduce MIRAI, a benchmark crafted for evaluating LLM agents for temporal forecasting in the realm of international events, with tool-use and complex reasoning.

See https://mirai-llm.github.io/ for more details and examples.

Hi @ydeng9 congrats on this work! I see the dataset is currently hosted on Google Drive, would you be up for pushing it to the Hugging Face hub and link it to this paper?

See here: https://huggingface.co/docs/datasets/loading, and here for linking it to this paper page: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

Thanks!

·
Paper author

Thanks for the instructions! We are working on releasing our data on HuggingFace, which should be available shortly. :)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.01231 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.01231 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.01231 in a Space README.md to link it from this page.

Collections including this paper 3