COUNTERMATH
Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

What is COUNTERMATH?

COUNTERMATH is a high-quality, university-level mathematical benchmark that evaluates Large Language Models' ability to conduct mathematical reasoning and proof through counterexamples. The dataset contains 1,216 mathematical statements that require LLMs to prove or disprove claims by providing counterexamples, thereby assessing their grasp of mathematical concepts.

Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, COUNTERMATH aims to move beyond drill-based learning and enhance LLMs' deeper understanding of mathematical theorems and related concepts.

If you use COUNTERMATH in your research, please cite our paper:

@inproceedings{ li2025one, title={One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical {LLM}s}, author={Yinghui Li and Jiayi Kuang and Haojing Huang and Zhikun Xu and Xinnian Liang and Yi Yu and Wenlian Lu and Yangning Li and Xiaoyu Tan and Chao Qu and Ying Shen and Hai-Tao Zheng and Philip S. Yu}, booktitle={Forty-second International Conference on Machine Learning}, year={2025}, url={https://openreview.net/forum?id=A31Ep22iQ7} }

Getting started

COUNTERMATH is released under a CC BY-SA 4.0 License. The dataset and evaluation tools can be downloaded from the following links:

If you want to evaluate your model on our dataset, please refer to our submission guidelines. Once you have completed this step, you can submit your results using the method below. After that, we will score your results and update the leaderboard.

Submit Results

Stay connected!

Join our community to receive updates or participate in discussions about COUNTERMATH!

Follow our GitHub repository
View paper on OpenReview
Contact us: zhikunxu@asu.edu

Leaderboard

The following table shows the performance of various models on the COUNTERMATH (ranked by Judgement F1 (macro)):

Rank	Date	Model	Code	Judgement	Rationale Reasoning
Rank	Date	Model	Code	F1 (macro)	Examples (%)	Strict (%)	Loose (%)
1	/	Deepseek-R1 DeepSeek	/	80.7	86.8	54.2	65.3
2	/	Gemini2.5-pro Google	/	77.0	90.8	65.1	75.7
3	/	Claude3.7-sonnet Anthropic	/	64.8	78.0	45.0	52.5
4	/	OpenAI o1-preview OpenAI	/	60.1	55.8	39.8	40.9
5	/	GPT-4o OpenAI	/	59.0	44.7	19.7	21.3
6	/	Qwen-max Alibaba	/	58.9	61.8	30.4	33.9
7	/	Qwen2.5-Math-72B-Instruct Alibaba	/	41.8	76.6	38.9	41.6
8	/	QwQ-32B-Preview Alibaba	/	39.9	70.0	38.6	43.8
9	/	Qwen2.5-Math-7B-Instruct Alibaba	/	38.3	74.2	30.2	33.2
10	/	Eurus-2-7B-PRIME Mistral	/	37.5	64.8	28.5	32.0
11	/	InternLM2-Math-Plus-Mixtral8x22B Shanghai AI Lab	/	37.3	63.2	21.5	23.1
12	/	Abel-7B-002 GAIR	/	34.4	66.1	16.0	17.9
13	/	InternLM2-Math-Plus-7B Shanghai AI Lab	/	33.9	36.6	9.0	9.5
14	/	InternVL2-7B-Plus Shanghai AI Lab	/	32.3	54.2	10.7	12.1
15	/	Deepseek-Math-7B-rl DeepSeek	/	32.2	65.9	18.9	20.6
16	/	MetaMath-Mistral-7B Meta	/	31.0	26.5	0.4	0.7
17	/	Abel-70B-001 GAIR	/	31.0	48.4	5.3	6.1
18	/	NuminaMath-7B-TIR AI2	/	30.4	54.1	13.0	13.7
19	/	Xwin-Math-13B-V1.0 Xwin-LM	/	30.2	31.3	1.2	1.7
20	/	MAmmoTH2-8x7B-Plus AIDC	/	28.8	51.4	14.1	15.5
21	/	Mathstral-7B-v0.1 Mistral	/	28.2	38.9	7.5	7.9
22	/	Xwin-Math-7B-V1.0 Xwin-LM	/	28.1	31.3	1.2	1.7
23	/	WizardMath-7B-v1.1 Microsoft	/	27.9	43.2	6.4	7.2
24	/	Xwin-Math-70B-V1.0 Xwin-LM	/	25.5	25.2	1.4	1.7
25	/	WizardMath-70B-v1.0 Microsoft	/	24.2	52.9	6.3	7.4
26	/	Abel-13B-001 GAIR	/	22.4	24.4	0.8	0.8
27	/	rho-math-7b-interpreter-v0.1 Microsoft	/	22.3	18.3	1.9	2.1
28	/	InternLM2-Math-Plus-20B Shanghai AI Lab	/	18.4	28.8	8.4	9.5

COUNTERMATH Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

What is COUNTERMATH?

Getting started

Stay connected!

Leaderboard

COUNTERMATH
Counterexample-Driven Conceptual Reasoning in Mathematical LLMs