Claude Mythos Preview Benchmark Breakdown: How It Compares to GPT-5 and Opus 4.6

Key Points
- Mythos Preview achieves 93.9% on SWE-bench Verified — the highest score reported by any model
- 100% on Cybench CTF (pass@1), compared to 75% for Opus 4.6
- 97.6% on USAMO 2026, a more than twofold improvement over Opus 4.6's 42.3%
- These results represent a "step-change" rather than incremental improvement
- Estimated GPT-5 scores (prefixed with ~) are from public reports and may reflect different testing conditions
The Step-Change Narrative
When Anthropic described Claude Mythos Preview as a "step-change" rather than an incremental improvement, many observers were skeptical. In AI, companies routinely oversell progress. But the benchmark data published in the system card tells a compelling story.
Across every major benchmark, Mythos Preview doesn't just edge out its predecessor — it opens up a significant gap. The most dramatic improvement is in cybersecurity, where the model goes from strong performance to what amounts to near-perfect scores.
Software Engineering Performance
The SWE-bench family of benchmarks tests a model's ability to autonomously resolve real software issues from GitHub repositories:
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5 (est.) | Gemini 2.5 Pro (est.) |
|---|---|---|---|---|
| SWE-bench Verified | 93.9% | 80.8% | ~85% | ~78% |
| SWE-bench Pro | 77.8% | 53.4% | ~60% | ~52% |
The leap from 80.8% to 93.9% on SWE-bench Verified is noteworthy because each additional percentage point becomes harder to achieve at this level. The model is solving software issues that even very capable models consistently fail to resolve.
SWE-bench Pro, which tests harder real-world problems, shows an even more dramatic improvement: from 53.4% to 77.8%, representing a 45.7% relative improvement.
Cybersecurity Benchmarks
Here is where the story becomes most distinctive:
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5 (est.) |
|---|---|---|---|
| Cybench (CTF) | 100% (pass@1) | 75% | ~70% |
| CyberGym | 0.83 | 0.67 | ~0.62 |
A perfect score on Cybench CTF at pass@1 — meaning the model solves every challenge correctly on its first attempt — is without precedent. Previous frontier models achieved strong but imperfect scores. Mythos Preview solves them all.
CyberGym, which measures both offensive and defensive cybersecurity capabilities on a 0-1 scale, shows a 24% relative improvement. This composite score includes tasks like vulnerability identification, exploit development, defense analysis, and threat modeling.
Mathematical Reasoning
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5 (est.) |
|---|---|---|---|
| USAMO 2026 | 97.6% | 42.3% | ~55% |
| OSWorld | 79.6% | 72.7% | ~74% |
The USAMO result stands out dramatically. Going from 42.3% to 97.6% on mathematical olympiad problems represents a qualitative shift — the model can now solve problems that require deep mathematical reasoning and multi-step proof construction that previously stumped all AI systems.
What "Step-Change" Actually Means
In AI research, performance improvements typically follow a predictable curve where each percentage point requires exponentially more compute and data. A step-change — where performance jumps dramatically from one generation to the next — suggests either a fundamental architectural breakthrough or a training approach that unlocked capabilities not present in previous models.
Anthropic has not disclosed the specific techniques behind these improvements. The company's system card attributes them to advances in training methodology and architecture, without providing enough detail for external reproduction.
Caveats and Methodology
Several important caveats apply to these comparisons:
- Different testing conditions: GPT-5 and Gemini 2.5 Pro scores (marked with ~) are estimated from public reports and may not use identical evaluation protocols
- Benchmark saturation: As models approach 100% on specific benchmarks, the benchmarks themselves may need to evolve
- Task distribution: High benchmark scores don't necessarily translate to uniformly better real-world performance
- Reproducibility: External researchers cannot independently verify these results without access to the model
For the full interactive comparison, see our benchmarks page.


