top of page

MMMU-Pro benchmark leaderboard: testing multimodal AI for visual understanding

Updated: Jan 29

Ever wondered which AI model is best at understanding images?


The MMMU-Pro benchmark is one of the better tests we have for this.


It measures multimodal skill on real, messy, expert-level questions. The kind that include diagrams, tables, charts, and screenshots.


MMMU-Pro benchmark
Benchmark data last checked: January 2026

Why should you care?

This is not an “image caption” test. It is a reasoning benchmark.


For multimodal AI, it is one of the best proxies we have for:

  • visual understanding you can trust

  • combining text + images correctly

  • solving real-world diagram problems

  • fewer confident wrong answers


So if your work involves reports, engineering docs, medical scans, or technical visuals, MMMU-Pro scores actually matter.


Not sure which AI model to pick?

Read our full guide to the best LLMs


Best LLM on the MMMU-Pro benchmark (leaderboard)

The MMMU-Pro benchmark shows something important:

Multimodal reasoning is improving fast…

But it is still not solved.


The top tier is close

  • Gemini 3 Pro leads with 81%

  • GPT-5.2 follows right behind at 80%

  • Claude Opus comes in at 74%


At this level, the difference is no longer “can it see?”

It is: "How reliably can it reason over the hardest visuals?”


The middle tier drops quickly

  • Qwen scores 69%

  • Ernie lands at 65%

  • Grok and Llama sit around 62–63%


That gap matters.

These models will fail more often on complex diagrams and technical charts.


Mistral falls behind

  • Mistral scores 56%


That is far behind the frontier leaders.

Vision remains a capability where only the top models compete.


DeepSeek is the clear outlier

  • DeepSeek V3.2-exp scores 5%


That is basically unusable for multimodal reasoning.

Most likely this reflects: a weak or incomplete vision stack or a misconfigured evaluation run.

Either way, it is a reminder: Not every model marketed as “frontier” is truly multimodal.


What is the MMMU benchmark?

MMMU stands for Massive Multi-discipline Multimodal Understanding and Reasoning.


It was introduced by Yue et al. (2024) to evaluate multimodal models on expert-level tasks that integrate text and images.


The benchmark includes 11.5K college-level questions sourced from:

  • exams

  • quizzes

  • textbooks


Covering six disciplines:

  • art and design

  • business

  • science

  • health and medicine

  • humanities and social science

  • tech and engineering


Unlike simpler vision tests, MMMU focuses on deep reasoning, not just perception.


What is the MMMU-Pro benchmark?

MMMU-Pro is the upgraded version.

It was introduced to make the benchmark harder and more realistic.


The key idea:

  • Modern models were starting to saturate MMMU

  • So MMMU-Pro adds more challenging visual setups


For example:

  • noisy backgrounds

  • real-world formatting

  • harder document-style questions


It is closer to what multimodal AI faces in business and research settings.

MMMU-Pro is now one of the best benchmarks left for testing true visual reasoning.


Ready to apply AI to your work?

Benchmarks are useful, but real business impact is about execution.

We run hands-on AI workshops and build tailored AI solutions, fast.



bottom of page