DeepSeek has released a remarkable new LLM that’s going head-to-head with ChatGPT on key performance benchmarks.
Curious how it stacks up—and what it means for the future of AI?
Let’s dive in.
How does DeepSeek perform vs frontier AI models?
Here’s how DeepSeek’s performance metrics stack up against top AI models on key LLM benchmarks.

Last updated: February 2025
LLM | Company | MMLU | MATH | GPQA | HumanEval | Chatbot Arena |
GPT-4o | OpenAI | 88.7% | 76.6% | 53.6% | 90.2% | 98.8% |
Claude3 Opus | Anthropic | 86.8% | 60.1% | 50.4% | 84.9% | 92.8% |
Gemini 2.0 | 76.4% | 89.7% | 62.1% | N/A | 100.0% | |
Llama 3.1 405B | Meta | 88.6% | 73.8% | 51.1% | 89.0% | 91.8% |
Grok-2 | xAI | 87.5% | 76.1% | 56.0% | 88.4% | 93.2% |
DeepSeek-V3 | DeepSeek | 88.5% | 90.2% | 59.1% | 82.6% | 95.3% |
Source: DeepSeek-V3 Technical Report. Additional sources available here.
DeepSeek outperforms on the MATH benchmark, scoring an impressive 90.2%, the highest in the field. It also holds strong in GPQA and ranks competitively in the Chatbot Arena, where conversational capabilities are put to the test. While it lags slightly in HumanEval, the metrics show it’s closing the gap with more established models like OpenAI's GPT-4o and Anthropic’s Claude3 Opus.
What is DeepSeek?
DeepSeek is a Chinese AI lab founded by hedge fund manager Lian Wenfeng. In January 2025, the company introduced DeepSeek V3, a cost-effective LLM.
✅ The good:
Built for a fraction of ChatGPT’s cost, showing how efficiently they’ve been able to innovate
Considerably cheaper to run (relevant for developers)
Strong in reasoning capabilities and solving complex problems
⚠️ The bad:
Collects a lot of data (chat history, files, personal info, payment details)
Questionable privacy controls
Data is stored on servers in China, making it subject to government regulations and access
Censorship issues (e.g. topics like Taiwan, Tiananmen Square)
Can be very slow right now due to high demand
Tip: Use DeepSeek within Azure to avoid censorship, with data stored in the US and Europe.
DeepSeek vs. OpenAI
DeepSeek has shown how to develop an LLM cost-effectively.
So, is OpenAI nervous?
Not exactly, but they’re paying attention. They just released O3-Mini (January 31, 2025), an update to their existing O1 model.
So, how do they compare?
DeepSeek R1 | OpenAI O3-Mini | |
Performance | Strong on benchmarks, but slower with more errors under load | Faster, more reliable, with usage limits (150 messages/day for paid users) |
Cost (for developers) | Much cheaper (up to 87% less expensive) | Significantly more expensive |
Data handling | Weak data privacy (data stored in China, questionable controls) | Enterprise-grade security with strong data protection standards |
Implications for the AI landscape
DeepSeek highlights several key shifts in the AI landscape:
China is catching up to the US in Generative AI
Open-source models are becoming more accessible
Scaling isn’t the only path to AI progress
In other words, we’ll see more Chinese models in the coming years. When OpenAI launched ChatGPT in November 2022, the US was significantly ahead of China in generative AI. That’s no longer the case.
We’ll also see open-source models play a bigger role in the AI supply chain. Many companies will adopt them. If the US continues to hold back on open-source development, China could take the lead in this space, with companies increasingly relying on Chinese models.
Lastly, scaling up isn’t the only way to advance AI. While bigger models drive progress, this focus has overshadowed other valuable approaches. For example, due to US chip restrictions on China, DeepSeek had to optimize for less powerful GPUs, building a strong model with under $6 million in compute.
Conclusion
DeepSeek shows strong performance and is more cost-effective than many of its competitors. Its rise signals potential geopolitical shifts in the AI landscape that are yet to be worked out.
If you have any questions about DeepSeek, or how to get started with AI for your business, feel free to reach out.
FAQ
Is DeepSeek good at math?
Yes. DeepSeek excels in mathematical reasoning and problem-solving, scoring 90.2% on the MATH benchmark—the highest in the field.
How does DeepSeek's efficiency compare to other AI models?
What are the key performance metrics for DeepSeek?
How does DeepSeek's performance compare to other top AI models?
What are the limitations of DeepSeek's performance?
Is DeepSeek better than ChatGPT?
How does DeepSeek R1 compare to OpenAI’s O3-Mini?
What are top LLMs from China?
コメント