BoxingGym
Evaluating the ability of language models in experimental design and model discovery
Introducing BoxingGym, a benchmark system developed to evaluate the ability of language models and other agents in experimental design and model discovery.

Abstract
Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLMs’ ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a systematic benchmark with 10 environments for evaluating both experimental design (e.g., collecting data to test a scientific theory) and model discovery (e.g., proposing and revising scientific theories).
Key Contributions
- A systematic benchmark for evaluating language models in experimental design and model discovery.
- Integration of probabilistic modeling to simulate real-world scientific environments.
- Communication-based evaluation where agents explain models to novices.
Team
- Kanishk Gandhi
- Michael Y. Li
- Lyle Goodyear
- Louise Li
- Aditi Bhaskar
- Mohammed Zaman
- Noah Goodman
Further Information
For more details, visit the BoxingGym Project Website or read the working paper on arXiv.