top of page

MMLU benchmark - testing LLMs multi-task capabilities

Updated: Jul 17

The MMLU benchmark is an important LLM evaluation tool.


It tests LLMs' multi-task capabilities.


Let’s deep-dive.



LLM performance based on MMLU

LLM performance based on MMLU score

Last updated: July, 2024

Company

Model

MMLU

Source

Open AI

GPT-4o

88.7%

Anthropic

Claude3 Opus

86.8%

Open AI

GPT-4

86.4%

Meta

Llama 3 400B

86.1%

Google

Gemini 1.5 Pro

85.9%

Inflection

Inflection 2.5

85.5%

Google

Gemini 1.0 Ultra

83.7%

All top seven LLMs are showing high MMLU scores. The scores range from 83.7% to 88.7%. This shows high competition and big progress across the board.


Open AI stands out with the highest score for GPT-4o (88.7%). This shows their leadership in language model tech. Anthropic and Meta also show strong performance with their models. This shows a competitive landscape in AI development.



What is the MMLU benchmark?

MMLU stands for "massive multitask language understanding." It was introduced by Hendrycks et al. (2021).


The test covers 57 tasks. These include elementary math, US history, computer science, law, and more.


The MMLU checks how well LLMs use their knowledge to solve real-world problems. It's not just about what the model knows, but how it uses that knowledge.


The final MMLU score is an average of the model's performance across all tasks. This gives a full view of its skills.



Pros & cons with the MMLU benchmark

Pros

Cons

  • Standard framework for comparing models

  • Tests many topics at different levels

  • Some questions lack context, making it hard for models to answer right

  • Dataset appears to contain errors


Comments


bottom of page