DeepSeek performance: How it compares to top AI models

Jan 27, 2025
4 min read

Updated: Feb 3, 2025

DeepSeek has released a remarkable new LLM that’s going head-to-head with ChatGPT on key performance benchmarks.

Curious how it stacks up—and what it means for the future of AI?

Let’s dive in.

How does DeepSeek perform vs frontier AI models?

Here’s how DeepSeek’s performance metrics stack up against top AI models on key LLM benchmarks.

DeepSeek performance compared to top AI models on key benchmarks like MMLU, MATH, and Chatbot Arena

Last updated: February 2025

LLM	Company	MMLU	MATH	GPQA	HumanEval	Chatbot Arena
GPT-4o	OpenAI	88.7%	76.6%	53.6%	90.2%	98.8%
Claude3 Opus	Anthropic	86.8%	60.1%	50.4%	84.9%	92.8%
Gemini 2.0	Google	76.4%	89.7%	62.1%	N/A	100.0%
Llama 3.1 405B	Meta	88.6%	73.8%	51.1%	89.0%	91.8%
Grok-2	xAI	87.5%	76.1%	56.0%	88.4%	93.2%
DeepSeek-V3	DeepSeek	88.5%	90.2%	59.1%	82.6%	95.3%

Source: DeepSeek-V3 Technical Report. Additional sources available here.

DeepSeek outperforms on the MATH benchmark, scoring an impressive 90.2%, the highest in the field. It also holds strong in GPQA and ranks competitively in the Chatbot Arena, where conversational capabilities are put to the test. While it lags slightly in HumanEval, the metrics show it’s closing the gap with more established models like OpenAI's GPT-4o and Anthropic’s Claude3 Opus.

What is DeepSeek?

DeepSeek is a Chinese AI lab founded by hedge fund manager Lian Wenfeng. In January 2025, the company introduced DeepSeek V3, a cost-effective LLM.

✅ The good:

Built for a fraction of ChatGPT’s cost, showing how efficiently they’ve been able to innovate
Considerably cheaper to run (relevant for developers)
Strong in reasoning capabilities and solving complex problems

⚠️ The bad:

Collects a lot of data (chat history, files, personal info, payment details)
Questionable privacy controls
Data is stored on servers in China, making it subject to government regulations and access
Censorship issues (e.g. topics like Taiwan, Tiananmen Square)
Can be very slow right now due to high demand

Tip: Use DeepSeek within Azure to avoid censorship, with data stored in the US and Europe.

DeepSeek vs. OpenAI

DeepSeek has shown how to develop an LLM cost-effectively.

So, is OpenAI nervous?

Not exactly, but they’re paying attention. They just released O3-Mini (January 31, 2025), an update to their existing O1 model.

So, how do they compare?

	DeepSeek R1	OpenAI O3-Mini
Performance	Strong on benchmarks, but slower with more errors under load	Faster, more reliable, with usage limits (150 messages/day for paid users)
Cost (for developers)	Much cheaper (up to 87% less expensive)	Significantly more expensive
Data handling	Weak data privacy (data stored in China, questionable controls)	Enterprise-grade security with strong data protection standards

Implications for the AI landscape

DeepSeek highlights several key shifts in the AI landscape:

China is catching up to the US in Generative AI
Open-source models are becoming more accessible
Scaling isn’t the only path to AI progress

In other words, we’ll see more Chinese models in the coming years. When OpenAI launched ChatGPT in November 2022, the US was significantly ahead of China in generative AI. That’s no longer the case.

We’ll also see open-source models play a bigger role in the AI supply chain. Many companies will adopt them. If the US continues to hold back on open-source development, China could take the lead in this space, with companies increasingly relying on Chinese models.

Lastly, scaling up isn’t the only way to advance AI. While bigger models drive progress, this focus has overshadowed other valuable approaches. For example, due to US chip restrictions on China, DeepSeek had to optimize for less powerful GPUs, building a strong model with under $6 million in compute.

Conclusion

DeepSeek shows strong performance and is more cost-effective than many of its competitors. Its rise signals potential geopolitical shifts in the AI landscape that are yet to be worked out.

If you have any questions about DeepSeek, or how to get started with AI for your business, feel free to reach out.

FAQ

Is DeepSeek good at math?

Yes. DeepSeek excels in mathematical reasoning and problem-solving, scoring 90.2% on the MATH benchmark—the highest in the field.

How does DeepSeek's efficiency compare to other AI models?

DeepSeek claims its model was developed using significantly fewer Nvidia chips than U.S. competitors, showcasing exceptional resource efficiency.

What are the key performance metrics for DeepSeek?

DeepSeek performs competitively with models like GPT-4o, especially in benchmarks like MATH, GPQA, and the Chatbot Arena.

How does DeepSeek's performance compare to other top AI models?

It holds its own against industry leaders like GPT-4o, particularly excelling in MATH performance and cost efficiency.

What are the limitations of DeepSeek's performance?

While strong in math and reasoning, DeepSeek lags slightly in coding tasks like HumanEval and can be slower under heavy load.

Is DeepSeek better than ChatGPT?

It depends. DeepSeek is better at STEM tasks and complex reasoning, while ChatGPT offers faster, more consistent responses with stronger data privacy controls.

How does DeepSeek R1 compare to OpenAI’s O3-Mini?

DeepSeek is cheaper and better at complex problem-solving. O3-Mini is faster, more reliable, and offers stronger security features.

What are top LLMs from China?

DeepSeek, Qwen, Kimi, and InternVL are leading the way in China’s LLM development.