top of page

GPQA benchmark - testing LLMs on graduate-level questions

The GPQA benchmark is an important LLM evaluation tool.


It assesses how well LLMs handle complex, domain-specific questions in biology, physics, and chemistry.


Let’s deep-dive.



LLM performance based on GPQA

LLM performance based on GPQA score

Last updated: July, 2024

Company

Model

GPQA

Source

Open AI

GPT-4o

53.6%

Meta

Llama 3 400B

48.0%

Google

Gemini 1.5 Pro

46.2%

Meta

Llama 3 70B

39.5%

Google

Gemini 1.5 Flash

39.5%

Inflection

Inflection 2.5

38.4%

Open AI

GPT-4

35.7%

Open AI's GPT-4o leads with 53.6%, showing strong capabilities. Meta’s Llama 3 400B at 48.0% and Google’s Gemini 1.5 Pro at 46.2% follow closely.


The competitive landscape demonstrates high proficiency across models. Even Open AI’s GPT-4, scoring 35.7%, shows room for improvement.



What is the GPQA benchmark?

MMLU stands for "graduate-level Google-proof Q&A." It was introduced by Rein et al. (2023).


The test includes 448 questions across biology, physics, and chemistry.


The test is extremily difficult with expert reach 65% accuracy and non-experts with internet access average only 34%.



Pros & cons with the GPQA benchmark

Pros

Cons

  • Evaluates LLMs on advanced, real-world questions

  • Ensures high-quality, expert-validated questions

  • Tests practical knowledge application in specialized domains

  • Some questions may be overly complex

  • Limited dataset size

  • Requires substantial domain knowledge, limiting broader use


Comments


bottom of page