Testing AI's limits in Theoretical Physics

By Cosmologa | dephy | 21 Sep 2025

The field of theoretical physics explores some of the deepest questions about the Universe, from the behavior of the smallest particles to cosmological scales. Although some of these scales can not yet be fully explored through experiments, there is an intense activity in the theoretical counterpart. The modern advances in artificial intelligence (AI) spark the fascinating question: can AI solve complex problems in theoretical physics? A recent paper titled Theoretical Physics Benchmark (TPBench) - a dataset and study of AI reasoning capabilities in theoretical physics addresses this question by creating special tests to measure how well AI models perform on physics problems ranging from undergraduate to research-level questions.

The authors of the paper designed a set of 57 physics problems that cover a broad range of subjects, including cosmology, high-energy physics, general relativity, and foundational topics like classical and quantum mechanics. Their benchmark is unique in the sense that most of the problems are newly created and not taken from existing textbooks or public collections. This ensures that the challenge is new for AI instead of rehearsed material, and that it would need to reason in order to tackle these real scientific problems.

An interesting insight that can be read in the paper is the difference between the kind of reasoning used in physics versus mathematics. Maths demand exact proofs and rigid logical structures, aiming to establish broad and absolute statements. In contrast, theoretical physics frequently deals with approximations and models that describe nature within a limited range of conditions. Also, often formal mathematical proofs are not possible in derivations of computations, which can be challenging to obtain from pure LLM tools. Additionally, physicists are more often concerned with predictions that can be tested experimentally or with observations than formal proofs, which makes the reasoning style more flexible, but also more complex for AI to handle.

The researchers put several AI models to the test (new models, at the date of the publication), and while these new models showed promising results on easier problems, they struggled significantly with the research-level problems. This gap highlights that AI is still far from fully autonomous research capabilities in physics labs. A cautionary remark has to be done regarding the limitations of this work. The benchmark mainly evaluates AI's ability to reproduce or solve given problems rather than engaging in the entire scientific process. There is lack of evaluation of hypothesis generation abilities, experiment design or iterative testing. Another shortcomings come from the method for evaluating the answers from AI, where a separate large language model assesses the reasoning and solution quality of the other models. This is because the evaluating and grading AI itself may have limitations in understanding advanced physics concepts perfectly, therefore its judgements can raise concerns about the scientific validity and rigor of the evaluation.

Nonetheless, the paper offers a hopeful outlook for future AI-assisted physics research. Although AI can not yet replace human physicists on the frontier of discovery, improvements in AI reasoning ability might soon enable it to support researchers by handling routine calculations or helping to check complex derivations.

The authors invite the scientific community to collaborate and push the boundaries of what AI can do in theoretical physics. This initiative proves itself of utmost importance to accelerate discoveries by combining human creativity with AI's growing computational power. There have been a couple of continuation papers that will be covered in future posts. This foundational paper opens the door for decentralizing research in physics in the near future, and it has an optimistic direction for science overall.

Blog Artificial Intelligence Decentralization Science

How do you rate this article?