PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD

April 3, 2025

      Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on
mathematical competitions like AIME, with the leading model, O3-MINI, achieving scores comparable to top human competitors. 

However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks.

To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. 

Our results reveal that all tested models struggled significantly, achieving less than 5% on average.

<https://arxiv.org/pdf/2503.21934v1>

Daniela Tafani

tags

participants (1)