Benchmarks Discover ‘DeepSeek-V3-0324 Is Extra Weak Than Qwen2.5-Max’

With the newest secure launch dated January 28, 2025, Qwen2.5-Max is classed as a Combination-of-Specialists (MoE) language mannequin developed by Alibaba. Like different language fashions, Qwen2.5-Max is able to producing textual content, understanding totally different languages, and performing superior logic. In accordance with current benchmarks, it’s also safer than DeepSeek-V3-0324.

Utilizing Recon to scan for vulnerabilities

A crew of analysts with Defend AI, the corporate behind a pink teaming and safety vulnerability scanning software often called Recon, just lately used their platform to match the safety of Qwen2.5-Max towards that of DeepSeek-V3.

The crew’s evaluation reads, partly: “We noticed that DeepSeek-V3-0324 is extra susceptible than Qwen2.5-Max, with Recon reaching an nearly 25% increased assault success price (ASR).”

Whereas it could be safer than its competitors, Qwen2.5-Max isn’t precisely good. In accordance with their checks, the AI mannequin is most vulnerable to immediate injection assaults, as these represented nearly 48% of all profitable cyberattacks towards Qwen2.5-Max. Evasion and jailbreak assaults proved to be much less profitable with an approximate ASR of 40% for each.

Exposing vulnerabilities in DeepSeek-V3

Recon makes use of a complete Assault Library to scan current-gen AI fashions and establish vulnerabilities throughout six particular classes:

Evasion methods
System immediate leaks
Immediate injection assaults
AI jailbreak makes an attempt
Basic security controls
Adversarial suffix resistance

Along with simulated cyberattacks, Recon additionally assesses the AI fashions’ resistance to producing doubtlessly dangerous or unlawful content material. For instance, throughout adversarial suffix resistance checks, Recon makes an attempt to control the AI mannequin into producing dangerous or unlawful content material.

The Defend AI crew ran Recon towards each Qwen2.5-Max and DeepSeek-V3, with the previous boasting a decrease assault success price (ASR) throughout a wide range of assaults; together with jailbreaks, immediate injection, and evasion methods.

Whereas Qwen2.5-Max had a 47% ASR towards immediate injection assaults, in comparison with DeepSeek-V3’s notably increased 77%. In opposition to evasion methods, Qwen2.5-Max scored a 39.4% ASR towards evasion methods, whereas DeepSeek-V3 scored 69.2%. Each AI fashions displayed related outcomes throughout different simulated cyberattacks.

Analyzing DeepSeek-V3’s strengths

Regardless of its safety weaknesses, DeepSeek-V3-0324 nonetheless outperforms Qwen2.5-Max in a number of totally different benchmarks. Not like the ASR, the next rating in these checks truly signifies higher efficiency.

DeepSeek-V3-0324	Qwen2.5-Max
MMLU-Professional	81.2	75.9
GPQA Diamond	68.4	59.1
MATH-500	94.0	90.2
AIME 2024	59.4	39.6
LiveCodeBench	49.2	39.2

In accordance with these benchmarks, DeepSeek-V3-0324’s strengths embody basic language understanding (MMLU-Professional), superior subjects corresponding to biology, physics, and chemistry (GPQA Diamond), arithmetic (MATH-500, AI in medication (AIME 2024), and coding (LiveCodeBench).

Benchmarks Discover ‘DeepSeek-V3-0324 Is Extra Weak Than Qwen2.5-Max’

Utilizing Recon to scan for vulnerabilities

Exposing vulnerabilities in DeepSeek-V3

Analyzing DeepSeek-V3’s strengths

Latest stories

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron...

PNNL: Integrating AI into Biological Research

Rick Stevens on the Genesis Mission and the Future of...

Inside the DOE’s 26 AI Challenges for Genesis Mission

You might also like...

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron Star Data

PNNL: Integrating AI into Biological Research