GPQA benchmark leaderboard: Testing LLMs on graduate-level questions

Falk Thomassen
Jul 17, 2024
1 min read

Updated: Jul 31

The GPQA benchmark is an important LLM evaluation tool.

It assesses how well LLMs handle complex, domain-specific questions in subjects like biology, physics, and chemistry.

Let’s dive in.

Best LLM for the GPQA benchmark

Comparing the main frontier models on the GPQA benchmark.

GPQA benchmark leaderboard of frontier LLMs

Last updated: July, 2025

Company	Model	Score	Source
xAI	Grok 4	87.5%	link
Google	Gemini 2.5 Pro	86.4%	link
Anthropic	Claude Sonnet 4	83.8%	link
OpenAI	GPT-o3	83.3%	link
Meta	Llama 4	69.8%	link

xAI’s Grok 4 now leads the GPQA benchmark with an impressive 87.5%, marking a notable jump ahead of the competition.

Google’s Gemini 2.5 Pro follows closely with 86.4%, maintaining its position among the top performers. Anthropic’s Claude Sonnet 4 comes in third at 83.8%, still showcasing strong reasoning capabilities.

OpenAI’s GPT-o3 scores 83.3%, reflecting steady performance, while Meta’s Llama 4 lags behind with 69.8%, a significant gap from the frontrunners.

What is the GPQA benchmark leaderboard?

GPQA stands for graduate-level Google-proof Q&A.

It was introduced by Rein et al. (2023) to evaluate how well LLMs can handle challenging questions that require reasoning and domain expertise.

The test includes 448 questions across:

Biology
Physics
Chemistry

The GPQA test is extremely difficult:

Experts achieve an average accuracy of 65%
Non-experts with internet access average only 34%

This makes the GPQA benchmark leaderboard a valuable tool for assessing an LLM’s domain-specific reasoning capabilities.

Other LLM benchmarks

At BRACAI, we keep track of how the main frontier models perform across multiple benchmarks.

If you have any questions about these benchmarks, or how to get started with AI in your business, feel free to reach out.