Claude Mythos Preview Benchmark Breakdown: How It Compares to GPT-5 and Opus 4.6

BenchmarksApril 9, 20268 min readBy Mythos Preview Daily Staff
Share:
Abstract sculptural bar chart forms in muted blue-green

Key Points

  • Mythos Preview achieves 93.9% on SWE-bench Verified — the highest score reported by any model
  • 100% on Cybench CTF (pass@1), compared to 75% for Opus 4.6
  • 97.6% on USAMO 2026, a more than twofold improvement over Opus 4.6's 42.3%
  • These results represent a "step-change" rather than incremental improvement
  • Estimated GPT-5 scores (prefixed with ~) are from public reports and may reflect different testing conditions

The Step-Change Narrative

When Anthropic described Claude Mythos Preview as a "step-change" rather than an incremental improvement, many observers were skeptical. In AI, companies routinely oversell progress. But the benchmark data published in the system card tells a compelling story.

Across every major benchmark, Mythos Preview doesn't just edge out its predecessor — it opens up a significant gap. The most dramatic improvement is in cybersecurity, where the model goes from strong performance to what amounts to near-perfect scores.

Software Engineering Performance

The SWE-bench family of benchmarks tests a model's ability to autonomously resolve real software issues from GitHub repositories:

BenchmarkMythos PreviewOpus 4.6GPT-5 (est.)Gemini 2.5 Pro (est.)
SWE-bench Verified93.9%80.8%~85%~78%
SWE-bench Pro77.8%53.4%~60%~52%

The leap from 80.8% to 93.9% on SWE-bench Verified is noteworthy because each additional percentage point becomes harder to achieve at this level. The model is solving software issues that even very capable models consistently fail to resolve.

SWE-bench Pro, which tests harder real-world problems, shows an even more dramatic improvement: from 53.4% to 77.8%, representing a 45.7% relative improvement.

Cybersecurity Benchmarks

Here is where the story becomes most distinctive:

BenchmarkMythos PreviewOpus 4.6GPT-5 (est.)
Cybench (CTF)100% (pass@1)75%~70%
CyberGym0.830.67~0.62

A perfect score on Cybench CTF at pass@1 — meaning the model solves every challenge correctly on its first attempt — is without precedent. Previous frontier models achieved strong but imperfect scores. Mythos Preview solves them all.

CyberGym, which measures both offensive and defensive cybersecurity capabilities on a 0-1 scale, shows a 24% relative improvement. This composite score includes tasks like vulnerability identification, exploit development, defense analysis, and threat modeling.

Mathematical Reasoning

BenchmarkMythos PreviewOpus 4.6GPT-5 (est.)
USAMO 202697.6%42.3%~55%
OSWorld79.6%72.7%~74%

The USAMO result stands out dramatically. Going from 42.3% to 97.6% on mathematical olympiad problems represents a qualitative shift — the model can now solve problems that require deep mathematical reasoning and multi-step proof construction that previously stumped all AI systems.

What "Step-Change" Actually Means

In AI research, performance improvements typically follow a predictable curve where each percentage point requires exponentially more compute and data. A step-change — where performance jumps dramatically from one generation to the next — suggests either a fundamental architectural breakthrough or a training approach that unlocked capabilities not present in previous models.

Anthropic has not disclosed the specific techniques behind these improvements. The company's system card attributes them to advances in training methodology and architecture, without providing enough detail for external reproduction.

Caveats and Methodology

Several important caveats apply to these comparisons:

  1. Different testing conditions: GPT-5 and Gemini 2.5 Pro scores (marked with ~) are estimated from public reports and may not use identical evaluation protocols
  2. Benchmark saturation: As models approach 100% on specific benchmarks, the benchmarks themselves may need to evolve
  3. Task distribution: High benchmark scores don't necessarily translate to uniformly better real-world performance
  4. Reproducibility: External researchers cannot independently verify these results without access to the model

For the full interactive comparison, see our benchmarks page.

Sources

Share:

Related Articles