Claude Mythos Preview Benchmarks
How does Claude Mythos Preview compare to its predecessors and competitors? Here are the benchmark results from Anthropic's system card and independent reports.
Key finding:Claude Mythos Preview achieves 93.9% on SWE-bench Verified, 100% on Cybench CTF (pass@1), and 97.6% on USAMO 2026 — representing what observers describe as a "step-change" over previous frontier models across software engineering, cybersecurity, and mathematical reasoning benchmarks.
Software Engineering Benchmarks
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5 | Gemini 2.5 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 93.9% | 80.8% | ~85% | ~78% |
| SWE-bench Pro | 77.8% | 53.4% | ~60% | ~52% |
Sources: Anthropic system card (April 2026), independent analyses. Values prefixed with ~ are estimated from public reports.
Cybersecurity Benchmarks
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5 |
|---|---|---|---|
| Cybench (CTF) | 100% (pass@1) | 75% | ~70% |
| CyberGym | 0.83 | 0.67 | ~0.62 |
Sources: Anthropic system card (April 2026). Cybench measures capture-the-flag performance; CyberGym measures defensive/offensive capability.
Reasoning & General Benchmarks
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5 |
|---|---|---|---|
| USAMO 2026 | 97.6% | 42.3% | ~55% |
| OSWorld | 79.6% | 72.7% | ~74% |
Sources: Anthropic system card (April 2026), USAMO 2026 competition results. OSWorld measures desktop automation.
Methodology Note
Benchmark values for Claude Mythos Preview and Claude Opus 4.6 are sourced from Anthropic's published system card (April 7, 2026). Values for GPT-5 and Gemini 2.5 Pro are estimated from publicly available benchmarks and may not reflect identical testing conditions. Values marked with "~" indicate estimates.
This data is provided for informational comparison only. Different evaluation protocols, prompting strategies, and scoring methods can significantly affect reported results. We encourage readers to consult primary sources for definitive benchmarking.