COUNTERMATH
Counterexample-Driven Conceptual Reasoning in Mathematical LLMs
What is COUNTERMATH?
COUNTERMATH is a high-quality, university-level mathematical benchmark that evaluates Large Language Models' ability to conduct mathematical reasoning and proof through counterexamples. The dataset contains 1,216 mathematical statements that require LLMs to prove or disprove claims by providing counterexamples, thereby assessing their grasp of mathematical concepts.
Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, COUNTERMATH aims to move beyond drill-based learning and enhance LLMs' deeper understanding of mathematical theorems and related concepts.
If you use COUNTERMATH in your research, please cite our paper:
Getting started
COUNTERMATH is released under a CC BY-SA 4.0 License. The dataset and evaluation tools can be downloaded from the following links:
If you want to evaluate your model on our dataset, please refer to our submission guidelines. Once you have completed this step, you can submit your results using the method below. After that, we will score your results and update the leaderboard.
Stay connected!
Join our community to receive updates or participate in discussions about COUNTERMATH!
- Follow our GitHub repository
- View paper on OpenReview
- Contact us: zhikunxu@asu.edu
Leaderboard
The following table shows the performance of various models on the COUNTERMATH (ranked by Judgement F1 (macro)):
| Rank | Date | Model | Code | Judgement | Rationale Reasoning | ||
|---|---|---|---|---|---|---|---|
| F1 (macro) | Examples (%) | Strict (%) | Loose (%) | ||||
| 1 | / | Deepseek-R1 DeepSeek | / | 80.7 | 86.8 | 54.2 | 65.3 |
| 2 | / | Gemini2.5-pro Google | / | 77.0 | 90.8 | 65.1 | 75.7 |
| 3 | / | Claude3.7-sonnet Anthropic | / | 64.8 | 78.0 | 45.0 | 52.5 |
| 4 | / | OpenAI o1-preview OpenAI | / | 60.1 | 55.8 | 39.8 | 40.9 |
| 5 | / | GPT-4o OpenAI | / | 59.0 | 44.7 | 19.7 | 21.3 |
| 6 | / | Qwen-max Alibaba | / | 58.9 | 61.8 | 30.4 | 33.9 |
| 7 | / | Qwen2.5-Math-72B-Instruct Alibaba | / | 41.8 | 76.6 | 38.9 | 41.6 |
| 8 | / | QwQ-32B-Preview Alibaba | / | 39.9 | 70.0 | 38.6 | 43.8 |
| 9 | / | Qwen2.5-Math-7B-Instruct Alibaba | / | 38.3 | 74.2 | 30.2 | 33.2 |
| 10 | / | Eurus-2-7B-PRIME Mistral | / | 37.5 | 64.8 | 28.5 | 32.0 |
| 11 | / | InternLM2-Math-Plus-Mixtral8x22B Shanghai AI Lab | / | 37.3 | 63.2 | 21.5 | 23.1 |
| 12 | / | Abel-7B-002 GAIR | / | 34.4 | 66.1 | 16.0 | 17.9 |
| 13 | / | InternLM2-Math-Plus-7B Shanghai AI Lab | / | 33.9 | 36.6 | 9.0 | 9.5 |
| 14 | / | InternVL2-7B-Plus Shanghai AI Lab | / | 32.3 | 54.2 | 10.7 | 12.1 |
| 15 | / | Deepseek-Math-7B-rl DeepSeek | / | 32.2 | 65.9 | 18.9 | 20.6 |
| 16 | / | MetaMath-Mistral-7B Meta | / | 31.0 | 26.5 | 0.4 | 0.7 |
| 17 | / | Abel-70B-001 GAIR | / | 31.0 | 48.4 | 5.3 | 6.1 |
| 18 | / | NuminaMath-7B-TIR AI2 | / | 30.4 | 54.1 | 13.0 | 13.7 |
| 19 | / | Xwin-Math-13B-V1.0 Xwin-LM | / | 30.2 | 31.3 | 1.2 | 1.7 |
| 20 | / | MAmmoTH2-8x7B-Plus AIDC | / | 28.8 | 51.4 | 14.1 | 15.5 |
| 21 | / | Mathstral-7B-v0.1 Mistral | / | 28.2 | 38.9 | 7.5 | 7.9 |
| 22 | / | Xwin-Math-7B-V1.0 Xwin-LM | / | 28.1 | 31.3 | 1.2 | 1.7 |
| 23 | / | WizardMath-7B-v1.1 Microsoft | / | 27.9 | 43.2 | 6.4 | 7.2 |
| 24 | / | Xwin-Math-70B-V1.0 Xwin-LM | / | 25.5 | 25.2 | 1.4 | 1.7 |
| 25 | / | WizardMath-70B-v1.0 Microsoft | / | 24.2 | 52.9 | 6.3 | 7.4 |
| 26 | / | Abel-13B-001 GAIR | / | 22.4 | 24.4 | 0.8 | 0.8 |
| 27 | / | rho-math-7b-interpreter-v0.1 Microsoft | / | 22.3 | 18.3 | 1.9 | 2.1 |
| 28 | / | InternLM2-Math-Plus-20B Shanghai AI Lab | / | 18.4 | 28.8 | 8.4 | 9.5 |