BoxingGym

Evaluating the ability of language models in experimental design and model discovery

Introducing BoxingGym, a benchmark system developed to evaluate the ability of language models and other agents in experimental design and model discovery.

Working paper.

Abstract

Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLMs’ ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a systematic benchmark with 10 environments for evaluating both experimental design (e.g., collecting data to test a scientific theory) and model discovery (e.g., proposing and revising scientific theories).

Key Contributions

  • A systematic benchmark for evaluating language models in experimental design and model discovery.
  • Integration of probabilistic modeling to simulate real-world scientific environments.
  • Communication-based evaluation where agents explain models to novices.

Team

  • Kanishk Gandhi
  • Michael Y. Li
  • Lyle Goodyear
  • Louise Li
  • Aditi Bhaskar
  • Mohammed Zaman
  • Noah Goodman

Further Information

For more details, visit the BoxingGym Project Website or read the working paper on arXiv.

References