top of page

MATH benchmark - testing LLMs on math problems

The MATH benchmark is an important LLM evaluation tool.


It tests LLMs on math problems.


Let’s deep-dive.



LLM performance based on MATH

LLM performance based on MATH score

Last updated: July, 2024

Company

Model

MATH

Source

Open AI

GPT-4o

76.6%

Google

Gemini 1.5 Pro

67.7%

Anthropic

Claude3 Opus

61.0%

Meta

Llama 3 400B

57.8%

Google

Gemini 1.5 Flash

54.9%

Google

Gemini 1.0 Ultra

53.2%

xAI

Grok-1.5

50.6%

There is quite a big difference between the top LLMs in the MATH scores.


Open AI is on top with GPT-4o at 76.6%. Google and Anthropic also do well, although more than 10pp behind OpenAI.


Overall, the results indicate that there is still work to do in math problem-solving by even the best AI models.



What is the MATH benchmark?

The MATH benchmark is for once not an acronym, but simply stands for math. It was introduced by Hendrycks et al. (2021).


The test consists of 12,500 problems from high school math competitions.


The test is hard and the researchers introducing the dataset found that a PhD student who does not especially like mathematics scored 40%, and a three-time IMO gold medalist got 90%.


They also showed that LLMs achieved 6.9% at highest, which shows how far the frontier models have come today.



Pros & cons with the MATH benchmark

Pros

Cons

  • Covers a wide range of topics and difficulties, providing a comprehensive test of mathematical problem-solving skills

  • Problems are scored with exact match, providing a clear and precise evaluation metric

  • Generating step-by-step solutions can sometimes decrease accuracy because errors in intermediate steps can derail the entire solution


Comments


bottom of page