top of page

GPQA benchmark leaderboard: testing LLMs on graduate-level science questions

  • Jul 17, 2024
  • 2 min read

Updated: Mar 30

Ever wondered which AI model is best at hard science?


The GPQA diamond benchmark is one of the strongest tools we have to measure this.


It tests how well AI models handle graduate-level questions in biology, physics, and chemistry, and it is designed to be difficult even for PhD-level experts.


GPQA benchmark leaderboard of frontier LLMs
Benchmark data last checked: March 2026

Why should you care?

GPQA is not a trivia quiz. It is a brutally hard science test.


For LLMs, it is one of the best proxies we have for:

  • domain knowledge you can trust

  • step-by-step scientific reasoning

  • fewer confident wrong answers


So if your work touches research, engineering, healthcare, climate, or technical content, GPQA scores are a useful signal when choosing an AI model.


Not sure which AI model to pick?

Read our full guide to the best LLMs


Best LLM on the GPQA diamond benchmark (leaderboard)

The GPQA diamond benchmark shows something interesting: Even with harder questions, frontier models are still scoring very high.


The top is slowing down

Google’s Gemini 3 Pro currently leads the benchmark with 94%.

Right behind it are several other top-tier models:

  • OpenAI GPT-5.4 — 93%

  • Claude Opus 4.6 — 91%

  • Grok 4 — 87%


At this level, the difference is no longer about basic scientific knowledge.

Instead, it reflects how reliably a model can reason through the hardest expert-level questions in biology, physics, and chemistry.


A few additional frontier systems are also performing strongly:

  • DeepSeek Reasoner — 83%

  • Qwen 3 — 80%


These scores suggest that multiple AI labs are now building models capable of serious scientific reasoning, not just factual recall.


The gap still grows quickly outside the frontier

Performance drops off noticeably once you move beyond the leading models.

  • Meta Llama 4 Maverick — 67%

  • Mistral Medium 2.5 — 60%


That difference matters.

If you rely on AI for technical research, engineering support, or scientific analysis, small accuracy gaps can translate into much higher rates of incorrect answers on complex problems.


GPQA diamond therefore remains one of the clearest benchmarks for separating true frontier reasoning models from the rest.


What is the GPQA benchmark?

GPQA example

GPQA stands for "graduate-level Google-proof Q&A".


It was introduced by Rein et al. (2023) to evaluate how well LLMs handle questions that require real scientific expertise.


The test includes 448 questions across:

  • Biology

  • Physics

  • Chemistry


It is extremely difficult for humans:

  • PhD-level experts average around 65%

  • Skilled non-experts, even with full web access, reach only 34%


That is why GPQA has become a key benchmark for testing scientific reasoning.


What is the GPQA diamond benchmark?

So it is the big brother of the GPQA benchmark. The questions have become harder.


It includes only the most challenging 198 questions, selected to separate true experts from everyone else.


The gap is striking:

A test where top human experts still struggle is now close to solvable by the best AI systems.


That is why GPQA diamond matters. It is one of the few benchmarks left that can still distinguish the very best models from the rest.


Ready to apply AI to your work?

Benchmarks are useful, but real business impact is about execution.

We run hands-on AI workshops and build tailored AI solutions, fast.



Comments


bottom of page