Claude Mythos Preview Benchmark Breakdown: How It Compares to GPT-5 and Opus 4.6

Key Points

Mythos Preview achieves 93.9% on SWE-bench Verified — the highest score reported by any model
100% on Cybench CTF (pass@1), compared to 75% for Opus 4.6
97.6% on USAMO 2026, a more than twofold improvement over Opus 4.6's 42.3%
These results represent a "step-change" rather than incremental improvement
Estimated GPT-5 scores (prefixed with ~) are from public reports and may reflect different testing conditions

The Step-Change Narrative

When Anthropic described Claude Mythos Preview as a "step-change" rather than an incremental improvement, many observers were skeptical. In AI, companies routinely oversell progress. But the benchmark data published in the system card tells a compelling story.

Across every major benchmark, Mythos Preview doesn't just edge out its predecessor — it opens up a significant gap. The most dramatic improvement is in cybersecurity, where the model goes from strong performance to what amounts to near-perfect scores.

Software Engineering Performance

The SWE-bench family of benchmarks tests a model's ability to autonomously resolve real software issues from GitHub repositories:

Benchmark	Mythos Preview	Opus 4.6	GPT-5 (est.)	Gemini 2.5 Pro (est.)
SWE-bench Verified	93.9%	80.8%	~85%	~78%
SWE-bench Pro	77.8%	53.4%	~60%	~52%

The leap from 80.8% to 93.9% on SWE-bench Verified is noteworthy because each additional percentage point becomes harder to achieve at this level. The model is solving software issues that even very capable models consistently fail to resolve.

SWE-bench Pro, which tests harder real-world problems, shows an even more dramatic improvement: from 53.4% to 77.8%, representing a 45.7% relative improvement.

Cybersecurity Benchmarks

Here is where the story becomes most distinctive:

Benchmark	Mythos Preview	Opus 4.6	GPT-5 (est.)
Cybench (CTF)	100% (pass@1)	75%	~70%
CyberGym	0.83	0.67	~0.62

A perfect score on Cybench CTF at pass@1 — meaning the model solves every challenge correctly on its first attempt — is without precedent. Previous frontier models achieved strong but imperfect scores. Mythos Preview solves them all.

CyberGym, which measures both offensive and defensive cybersecurity capabilities on a 0-1 scale, shows a 24% relative improvement. This composite score includes tasks like vulnerability identification, exploit development, defense analysis, and threat modeling.

Mathematical Reasoning

Benchmark	Mythos Preview	Opus 4.6	GPT-5 (est.)
USAMO 2026	97.6%	42.3%	~55%
OSWorld	79.6%	72.7%	~74%

The USAMO result stands out dramatically. Going from 42.3% to 97.6% on mathematical olympiad problems represents a qualitative shift — the model can now solve problems that require deep mathematical reasoning and multi-step proof construction that previously stumped all AI systems.

What "Step-Change" Actually Means

In AI research, performance improvements typically follow a predictable curve where each percentage point requires exponentially more compute and data. A step-change — where performance jumps dramatically from one generation to the next — suggests either a fundamental architectural breakthrough or a training approach that unlocked capabilities not present in previous models.

Anthropic has not disclosed the specific techniques behind these improvements. The company's system card attributes them to advances in training methodology and architecture, without providing enough detail for external reproduction.

Caveats and Methodology

Several important caveats apply to these comparisons:

Different testing conditions: GPT-5 and Gemini 2.5 Pro scores (marked with ~) are estimated from public reports and may not use identical evaluation protocols
Benchmark saturation: As models approach 100% on specific benchmarks, the benchmarks themselves may need to evolve
Task distribution: High benchmark scores don't necessarily translate to uniformly better real-world performance
Reproducibility: External researchers cannot independently verify these results without access to the model

For the full interactive comparison, see our benchmarks page.

Claude Mythos Preview Benchmark Breakdown: How It Compares to GPT-5 and Opus 4.6

Key Points

The Step-Change Narrative

Software Engineering Performance

Cybersecurity Benchmarks

Mathematical Reasoning

What "Step-Change" Actually Means

Caveats and Methodology

Sources

Related Articles

Anthropic Announces Claude Mythos Preview — Its Most Powerful AI Model Yet

Claude Mythos Preview System Card: What It Reveals About the Model's Capabilities

Project Glasswing Explained: Anthropic's Controlled AI Security Initiative