Move over, Mythos. Here comes any model with a good harness.

Blog

News

Move over, Mythos. Here comes... pretty much any other model with a good harness

Written by

Dania Durnas

Published on:

Jun 1, 2026

Mythos doesn’t need to be treated as the biggest and baddest in the room.

Don’t get me wrong. Depending on the benchmark you’re evaluating against, Mythos is among the top models available today, and generally the best at reasoning. But it’s not leaps and bounds ahead of the race.

And when it comes to practical use cases, throwing a general model, even a cutting-edge frontier model, at a problem doesn’t get the best results. Nor is it scalable or cost-effective. When it comes to finding vulnerabilities, the harness used for a model matters more than the models themselves. And Fable 5, the public version of Mythos? It won't even touch cybersecurity topics.

We’ll first look at why Mythos isn’t the model to solve every problem, and then how a good harness produces high-quality results at scale.

Mythos is a little hypey

First, let’s look at some facts. Mythos is good, one of the best AI models to date, and it continues to perform highly on benchmarks. Mythos excels in constructing exploit chains and generating proofs of concepts, so since its release, it has accumulated a long resume of finding zero-day vulnerabilities.

However, while some fear and excitement were merited, the world’s response was disproportionately large to its improvement over previous models. Each new frontier model released on the market is always better than the last, but by a small proportion.

And at this point, other frontier models are also mostly on par, especially since GPT-5.5 came out in April. The UK's AI Security Institute had it benchmarked at roughly the same cyber capability tier as Mythos. In the hardest category of their evaluation suite, GPT-5.5 reached 71.4%, while Mythos reached 68.6%. Between Mythos and GPT 5.5, one beats the other depending on the task.

Mythos is not perfect, and it still isn’t a silver bullet for finding all security vulnerabilities by itself. For example, someone ran Mythos against the cURL library codebase and emailed the results to its founder and maintainer Daniel Stenberg. Mythos turned up five "confirmed security vulnerabilities.” But after Stenberg's team reviewed them, they found that three were false positives, one was a non-security bug, and only one was a real vulnerability. A few days later, Stenberg received 17 vulnerabilities from people running other AI tools. He said on LinkedIn, "Mythos is not even close to the end of this race," and in his blog about the experience wrote that he thinks Mythos hype is "primarily marketing."

Fable 5 was recently released by Anthropic, which is Mythos 5 with guardrails. These guardrails cause the model to stop if it encounters any request related to cybersecurity or biology. So it can't be benchmarked or leveraged for finding vulnerabilities at all.

The harness matters more than the model

With different models now excelling at many different tasks and the top-tier models converging on capabilities, the biggest variable in optimizing vulnerability discovery is the harness.

A harness is the orchestration layer that wraps around a model (or multiple models). This includes the logic that decides which agent runs when, what context it receives, how findings get validated, and when to escalate to a stronger model. It is code, workflow design, and prompt architecture working together, with the model serving as just another one of those components.

Harnesses focus LLMs from being general to being highly suited to a given domain and tasks. They also take advantage of non-determinism in LLMs, which causes them to find slightly different results each time. With a harness, multiple agents review a codebase, with the expectation that no one agent will find 100% of the vulnerabilities (including agents running on Mythos).

In the context of vulnerability research, Cloudflare’s research outlines an example of what a solid harness setup often looks like:

A recon stage that reads the repository and creates a task queue for everything downstream
A hunt stage where many agents run in parallel, each searching for vulnerabilities
A validation stage where an independent agent, using a different prompt and with no ability to generate its own findings, tries to disprove what the hunting agent found
A tracing stage that follows confirmed findings across the repo to determine whether attacker-controlled input can actually reach the bug from outside the system
Deduplication logic to consolidate findings that have the same root cause

Harness design is so impactful that it often matters more than model choice itself. UCSB researchers ran the same Claude Opus 4.6 on the same tasks with different harnesses and found that the best harness passed four times as many tests as the worst harness. For comparison, the spread between frontier models like Opus 4.6 and GPT-5.4 on standard coding benchmarks is only about one percentage point. That means teams obsessing over which model to use are over-optimizing the wrong variable.

Niels Provos demonstrated the same concept from the other direction. He built a harness that found an 18-year-old vulnerability in a popular library, then swapped in the open-weight GLM 5.1 and got comparable results. He showed that a strong harness can make the model a swappable component, rather than the primary driver.

Mozilla's security team's research explains why investing in harness design pays off over time. Once their harness pipeline was solid, each new model they dropped in immediately improved bug-finding, proof-of-concept generation, and impact analysis without any rearchitecting. When Mythos became available to them, they were able to slot it and immediately benefit. Build the harness right, and model progress becomes something you absorb for free rather than scramble to adopt.

Money talks

Another problem with using Mythos for everything is an economic one. Bigger models are always more performant, but they’re also way more expensive.

Running Mythos once costs real money, around tens of thousands of dollars, to do a single thorough scan of a repository for what might be a few vulnerabilities. Run Opus 4.6, or even GPT-5.4 nano, ten times for the same cost as running Mythos once, and you generally find more. Cost doesn’t scale 1-to-1 with ability. For example, both inputs and outputs for GPT 5.4 cost half of those for GPT 5.5, but the former doesn’t have half the reasoning power of the latter. Internally, we found that eight GPT-5.4-mini agents outperform one GPT-5.5 agent in some cases, and they’re about the same cost. Cheaper models let you turn the number of agents into an advantage.

A man is holding a dollar bill in his hand and dances with it. Then he shoots the dollar out of his hand. A parody of throwing lots of bills around.

The smaller model will generally produce more false positives than a frontier model, since it’s indeed less precise. But in this rare case, quantity matters as much as quality, since you want to make sure you capture as many vulnerabilities as possible. This is where harnesses can help filter out extra noise, where other agents can verify the exploit chains and clean up, and it’s a whole lot more economical than running Mythos and frontier models to find everything.

For threat actors, what are they actually going to use? Not Mythos. First off, they don’t have it. And Fable 5 was nerfed to prevent this very group from getting access. No, attackers are going to want to use whatever runs cheaply, repeatedly, at scale, and they’re not going to wait in line. Open-weight models with decent harnesses work well, and that’s probably what they’re doing right now.

And for organizations, what's sustainable? Running a frontier model on every code change certainly isn't. Running a multi-tier orchestration that uses cheap models regularly and expensive ones precisely… that is.

Pay no attention to the model behind the curtain

Mythos was a fascinating moment in our timeline. It got everyone's attention about what models can do now. But high-quality, capable autonomous vulnerability discovery is accessible through alternative and cheaper means than being limited to Mythos or Project Glasswing.

Vendors locked to a single model have to make that one model perfect. Vendor-agnostic platforms get to pick the right tool for the right job. A smaller model can sweep wide and surface candidates, while a stronger model can deep-dive into the ones that look interesting and require higher reasoning capabilities. To get the best results in AppSec and AI pentesting, you want to prioritize systems with sophisticated harnesses that use the right models, rather than getting too concerned about having the fanciest model involved.

The Wizard of Oz: A man is behind a green curtain working a large contraption. Toto the dog pulls the curtain away to reveal the man.

At Aikido, we figured out early that quantity, orchestration, and the freedom to pick the right tool for the job beat chasing whatever's currently behind the highest paywall. As an AppSec provider, we see our responsibility as building the orchestration that lets the model layer keep evolving underneath. If you want to learn more about how our pentesting can help you secure your application, talk to us today.

PS. We’ve also written a Mythos-ready checklist to help teams prepare for threats from agentic AI (whether powered by Mythos or many GPT 5.4 minis).

Last updated on:

Jun 18, 2026

Text Link

Subscribe for news

4.7/5

Tired of false positives? 
Try Aikido like 100k others.

Start Now

Get a personalized walkthrough

Trusted by 100k+ teams

Book Now

Scan your app for IDORs and real attack paths

Trusted by 100k+ teams

Start Scanning

See how AI pentests your app

Trusted by 100k+ teams

Start Testing

Start Now

How to maintain code quality standards with AI code and vibe coding

Vibe coding ships features fast and leaves review debt behind. See how benchmarked, per-rule code quality checks give teams one consistent answer across PRs and repos.

Code Quality

Vibe Coding

July 1, 2026

•

News

And another one. GitHub ships break-glass credential revocation

Break-glass credential revocation is live on GitHub Enterprise. The Trivy and Microsoft durabletask repeats show why fast, complete revocation was needed..

GitHub Security

June 26, 2026

•

News

npm now freezes high-impact accounts after risky account changes

A look at npm's new 72-hour account freeze, what triggers it, what it blocks, and how it works alongside trusted and staged publishing.

NPM

open-source

Get secure now

Secure your code, cloud, and runtime in one central system.
Find and fix vulnerabilities fast automatically.

Start Scanning

Book a demo

No credit card required | Scan results in 32secs.