MathArena

How exactly do you compute accuracy?

We compute the accuracy of a model by prompting it to solve each problem 4 times and computing the success rate for this problem by dividing the number of correct solutions by 4. This corresponds to the pass@1 metric estimated using 4 samples. The final accuracy is the average pass@1 over all problems. We do not perform majority voting or other criteria like pass@K.

What do the colors in the table mean?

The colors indicate success rates of the problems:

Green: Problem solved >75% of the time
Yellow: Problem solved 25-75% of the time
Orange: Problem solved 1-24% of the time
Red: Problem never solved.

Can you show the average number of input and output tokens for each model?

Yes, below you can find the average number of input and output tokens for each model along with the price per million tokens for the API we used. The data is shown for the competition that is visible on the page.

How are models evaluated on Project Euler? Is tool use allowed?

For each model, we experiment with both our own scaffold that performs multi-turn code execution via function calling, and the model provider's code interpreter (if provided via API), and select whatever works better for that model.

How is the cost calculated?

The cost shows the total cost of evaluating the model on the entire benchmark (all problems and all repetitions). It is calculated based on the API pricing for each model. For open-source models, costs can vary significantly depending on the chosen API provider and our results may not always be achieved using the most cost-effective option. In particular, for DeepSeek models we use Together API as we found it to be the most reliable endpoint, but it is more expensive than DeepSeek's own API. For gemini-2.0-flash-thinking it was impossible to determine the cost since the pay-as-you-go pricing is not available, and the Google API does not return the number of thinking tokens.

How do you know that your problems are not in the training data?

First, we always evaluate models on new competitions immediately as the problems are released, guaranteeing that the knowledge cutoff of the model is before the date of the competition. While it is not impossible to rule out that evaluated problems or their variants are in the training data (e.g. because they appeared in another competition, see here), the organizers of competitions such as AIME always try to ensure highest quality of their problem set. So we believe that the problems are sufficiently novel that it is possible to evaluate generalization capabilities of the models. Nevertheless, for every competition, we check for similar existing problems using Deep Research, and if any similar problem has been found we include this information next to the corresponding problem from the competition (can be found by clicking in the table). If you find any contamination (e.g. problems that have appeared before), feel free to contact us and we will add this information to the table.

Can you evaluate more models?

Yes, we are planning to add more models while keeping the table concise and informative. Some of the models are difficult to evaluate due to the rate limits in particular APIs, but we will try to add most well known ones. We also released our evaluation scripts to enable the community to evaluate their models.

How can I contact you?

You can contact us via email at jasper.dekoninck@inf.ethz.ch.

How should we cite this webpage?


@misc{balunovic_srimatharena_2025,
  title = {MathArena: Evaluating LLMs on Uncontaminated Math Competitions},
  author = {Mislav Balunović and Jasper Dekoninck and Ivo Petrov and Nikola Jovanović and Martin Vechev},
  copyright = {MIT},
  url = {https://matharena.ai/},
  publisher = {SRI Lab, ETH Zurich},
  month = feb,
  year = {2025},
}

MathArena:
Evaluating LLMs on Uncontaminated Math Competitions

Click on a cell to see the raw model output.

What is MathArena?

Frequently Asked Questions

MathArena:Evaluating LLMs on Uncontaminated Math Competitions

Click on a cell to see the raw model output.

What is MathArena?

Frequently Asked Questions

MathArena:
Evaluating LLMs on Uncontaminated Math Competitions