top of page

DeepSeek performance: How it compares to top AI models

Updated: 3 days ago

DeepSeek has released a remarkable new LLM that’s going head-to-head with ChatGPT on key performance benchmarks.


Curious how it stacks up—and what it means for the future of AI?


Let’s dive in.


How does DeepSeek perform vs frontier AI models?

Here’s how DeepSeek’s performance metrics stack up against top AI models on key LLM benchmarks.

DeepSeek performance compared to top AI models on key benchmarks like MMLU, MATH, and Chatbot Arena

Last updated: February 2025

LLM

Company

MMLU

MATH

GPQA

HumanEval

Chatbot Arena

GPT-4o

OpenAI

88.7%

76.6%

53.6%

90.2%

98.8%

Claude3 Opus

Anthropic

86.8%

60.1%

50.4%

84.9%

92.8%

Gemini 2.0

Google

76.4%

89.7%

62.1%

N/A

100.0%

Llama 3.1 405B

Meta

88.6%

73.8%

51.1%

89.0%

91.8%

Grok-2

xAI

87.5%

76.1%

56.0%

88.4%

93.2%

DeepSeek-V3

DeepSeek

88.5%

90.2%

59.1%

82.6%

95.3%

Source: DeepSeek-V3 Technical Report. Additional sources available here.


DeepSeek outperforms on the MATH benchmark, scoring an impressive 90.2%, the highest in the field. It also holds strong in GPQA and ranks competitively in the Chatbot Arena, where conversational capabilities are put to the test. While it lags slightly in HumanEval, the metrics show it’s closing the gap with more established models like OpenAI's GPT-4o and Anthropic’s Claude3 Opus.


What is DeepSeek?

DeepSeek is a Chinese AI lab founded by hedge fund manager Lian Wenfeng. In January 2025, the company introduced DeepSeek V3, a cost-effective LLM.


✅ The good:

  • Built for a fraction of ChatGPT’s cost, showing how efficiently they’ve been able to innovate

  • Considerably cheaper to run (relevant for developers)

  • Strong in reasoning capabilities and solving complex problems


⚠️ The bad:

  • Collects a lot of data (chat history, files, personal info, payment details)

  • Questionable privacy controls

  • Data is stored on servers in China, making it subject to government regulations and access

  • Censorship issues (e.g. topics like Taiwan, Tiananmen Square)

  • Can be very slow right now due to high demand


Tip: Use DeepSeek within Azure to avoid censorship, with data stored in the US and Europe.


DeepSeek vs. OpenAI

DeepSeek has shown how to develop an LLM cost-effectively.


So, is OpenAI nervous?


Not exactly, but they’re paying attention. They just released O3-Mini (January 31, 2025), an update to their existing O1 model.


So, how do they compare?


DeepSeek R1

OpenAI O3-Mini

Performance

Strong on benchmarks, but slower with more errors under load

Faster, more reliable, with usage limits (150 messages/day for paid users)

Cost (for developers)

Much cheaper (up to 87% less expensive)

Significantly more expensive

Data handling

Weak data privacy (data stored in China, questionable controls)

Enterprise-grade security with strong data protection standards


Implications for the AI landscape

DeepSeek highlights several key shifts in the AI landscape:

  • China is catching up to the US in Generative AI

  • Open-source models are becoming more accessible

  • Scaling isn’t the only path to AI progress


In other words, we’ll see more Chinese models in the coming years. When OpenAI launched ChatGPT in November 2022, the US was significantly ahead of China in generative AI. That’s no longer the case.


We’ll also see open-source models play a bigger role in the AI supply chain. Many companies will adopt them. If the US continues to hold back on open-source development, China could take the lead in this space, with companies increasingly relying on Chinese models.


Lastly, scaling up isn’t the only way to advance AI. While bigger models drive progress, this focus has overshadowed other valuable approaches. For example, due to US chip restrictions on China, DeepSeek had to optimize for less powerful GPUs, building a strong model with under $6 million in compute.


Conclusion

DeepSeek shows strong performance and is more cost-effective than many of its competitors. Its rise signals potential geopolitical shifts in the AI landscape that are yet to be worked out.


If you have any questions about DeepSeek, or how to get started with AI for your business, feel free to reach out.


FAQ

Is DeepSeek good at math?

Yes. DeepSeek excels in mathematical reasoning and problem-solving, scoring 90.2% on the MATH benchmark—the highest in the field.

How does DeepSeek's efficiency compare to other AI models? 

What are the key performance metrics for DeepSeek?

How does DeepSeek's performance compare to other top AI models?

What are the limitations of DeepSeek's performance?

Is DeepSeek better than ChatGPT?

How does DeepSeek R1 compare to OpenAI’s O3-Mini?

What are top LLMs from China?


コメント


bottom of page