GPQA benchmark leaderboard: testing LLMs on graduate-level science questions
- Jul 17, 2024
- 2 min read
Updated: Mar 30
Ever wondered which AI model is best at hard science?
The GPQA diamond benchmark is one of the strongest tools we have to measure this.
It tests how well AI models handle graduate-level questions in biology, physics, and chemistry, and it is designed to be difficult even for PhD-level experts.

Why should you care?
GPQA is not a trivia quiz. It is a brutally hard science test.
For LLMs, it is one of the best proxies we have for:
domain knowledge you can trust
step-by-step scientific reasoning
fewer confident wrong answers
So if your work touches research, engineering, healthcare, climate, or technical content, GPQA scores are a useful signal when choosing an AI model.
Not sure which AI model to pick?
Read our full guide to the best LLMs
Best LLM on the GPQA diamond benchmark (leaderboard)
The GPQA diamond benchmark shows something interesting: Even with harder questions, frontier models are still scoring very high.
The top is slowing down
Google’s Gemini 3 Pro currently leads the benchmark with 94%.
Right behind it are several other top-tier models:
OpenAI GPT-5.4 — 93%
Claude Opus 4.6 — 91%
Grok 4 — 87%
At this level, the difference is no longer about basic scientific knowledge.
Instead, it reflects how reliably a model can reason through the hardest expert-level questions in biology, physics, and chemistry.
A few additional frontier systems are also performing strongly:
DeepSeek Reasoner — 83%
Qwen 3 — 80%
These scores suggest that multiple AI labs are now building models capable of serious scientific reasoning, not just factual recall.
The gap still grows quickly outside the frontier
Performance drops off noticeably once you move beyond the leading models.
Meta Llama 4 Maverick — 67%
Mistral Medium 2.5 — 60%
That difference matters.
If you rely on AI for technical research, engineering support, or scientific analysis, small accuracy gaps can translate into much higher rates of incorrect answers on complex problems.
GPQA diamond therefore remains one of the clearest benchmarks for separating true frontier reasoning models from the rest.
What is the GPQA benchmark?

GPQA stands for "graduate-level Google-proof Q&A".
It was introduced by Rein et al. (2023) to evaluate how well LLMs handle questions that require real scientific expertise.
The test includes 448 questions across:
Biology
Physics
Chemistry
It is extremely difficult for humans:
PhD-level experts average around 65%
Skilled non-experts, even with full web access, reach only 34%
That is why GPQA has become a key benchmark for testing scientific reasoning.
What is the GPQA diamond benchmark?
So it is the big brother of the GPQA benchmark. The questions have become harder.
It includes only the most challenging 198 questions, selected to separate true experts from everyone else.
The gap is striking:
A test where top human experts still struggle is now close to solvable by the best AI systems.
That is why GPQA diamond matters. It is one of the few benchmarks left that can still distinguish the very best models from the rest.
Ready to apply AI to your work?
Benchmarks are useful, but real business impact is about execution.
We run hands-on AI workshops and build tailored AI solutions, fast.



Comments