top of page

GPQA benchmark leaderboard: Testing LLMs on graduate-level questions

  • Falk Thomassen
  • Jul 17, 2024
  • 2 min read

Updated: May 22

The GPQA benchmark is an important LLM evaluation tool.


It assesses how well LLMs handle complex, domain-specific questions in subjects like biology, physics, and chemistry.


Let’s dive in.



Best LLM for the GPQA benchmark

Comparing the main frontier models on the GPQA benchmark.

GPQA benchmark leaderboard of frontier LLMs


Last updated: May, 2025

Company

Model

Score

Source

Anthropic

Claude 3.7 Sonnet

84.8%

Google

Gemini 2.5

84.0%

OpenAI

GPT-o3

83.3%

xAI

Grok-3

75.4%

Meta

Llama 4 Maverick

69.8%

While Google's Gemini 2.0 used to lead with 62.1% (as of December 2024), it has now been surpassed by Anthropic's Claude 3.7 Sonnet, which achieved the top score of 84.8%.


But Google's latest Gemini 2.5 model scores a close 84.0%, just behind Anthropic's Claude 3.7 Sonnet.


Following up comes OpenAI's GPT-o3, which scored 83.3% and xAI's Grok-3 with a score of 75.4%. Meta's Llama 4 Maverick trails last of the models in GPQA with a score of 69.8%.


The GPQA benchmark leaderboard shows solid capabilities across all models, but Claude 3.7 Sonnet's 84.8% remains as the top score, followed up with Gemini 2.5's 84.0%.



What is the GPQA benchmark leaderboard?

GPQA stands for graduate-level Google-proof Q&A.


It was introduced by Rein et al. (2023) to evaluate how well LLMs can handle challenging questions that require reasoning and domain expertise.


The test includes 448 questions across:

  • Biology

  • Physics

  • Chemistry


The GPQA test is extremely difficult:

  • Experts achieve an average accuracy of 65%

  • Non-experts with internet access average only 34%


This makes the GPQA benchmark leaderboard a valuable tool for assessing an LLM’s domain-specific reasoning capabilities.


Other LLM benchmarks

At BRACAI, we keep track of how the main frontier models perform across multiple benchmarks.


If you have any questions about these benchmarks, or how to get started with AI in your business, feel free to reach out.

Comments


bottom of page