Aikido

Move over, Mythos. Here comes... pretty much any other model with a good harness

Written by
Dania Durnas

Mythos doesn’t need to be treated as the biggest and baddest in the room.

Don’t get me wrong. Depending on the benchmark you’re evaluating against, Mythos is among the top models available today, and generally the best at reasoning. But it’s not leaps and bounds ahead of the race.

And when it comes to practical use cases, throwing a general model, even a cutting-edge frontier model, at a problem doesn’t get the best results. Nor is it scalable or cost-effective. When it comes to finding vulnerabilities, the harness used for a model matters more than the models themselves.

We’ll first look at why Mythos isn’t the model to solve every problem, and then how a good harness produces high-quality results at scale.

Mythos is a little hypey

First, let’s look at some facts. Mythos is good, one of the best AI models to date, and it continues to perform highly on benchmarks. Mythos excels in constructing exploit chains and generating proofs of concepts, so since its release, it has accumulated a long resume of finding zero-day vulnerabilities.

However, while some fear and excitement were merited, the world’s response was disproportionately large to its improvement over previous models. Each new frontier model released on the market is always better than the last, but by a small proportion.

And at this point, other frontier models are also mostly on par, especially since GPT-5.5 came out in April. The UK's AI Security Institute had it benchmarked at roughly the same cyber capability tier as Mythos. In the hardest category of their evaluation suite, GPT-5.5 reached 71.4%, while Mythos reached 68.6%. Between Mythos and GPT 5.5, one beats the other depending on the task.

Mythos is not perfect, and it still isn’t a silver bullet for finding all security vulnerabilities by itself. For example, some ran Mythos against the cURL library codebase and emailed the results to its founder and maintainer Daniel Stenberg. Mythos turned up five "confirmed security vulnerabilities.” But after Stenberg's team reviewed them, they found that three were false positives, one was a non-security bug, and only one was a real vulnerability. A few days later, Steinberg received 17 vulnerabilities from people running other AI tools. He said on LinkedIn, "Mythos is not even close to the end of this race," and in his blog about the experience wrote that he thinks Mythos hype is "primarily marketing."

The harness matters more than the model

With different models now excelling at many different tasks and the top-tier models converging on capabilities, the biggest variable in optimizing vulnerability discovery is the harness. 

A harness is the orchestration layer that wraps around a model (or multiple models). This includes the logic that decides which agent runs when, what context it receives, how findings get validated, and when to escalate to a stronger model. It is code, workflow design, and prompt architecture working together, with the model serving as just another one of those components. 

Harnesses focus LLMs from being general to being highly suited to a given domain and tasks. They also take advantage of non-determinism in LLMs, which causes them to find slightly different results each time. With a harness, multiple agents review a codebase, with the expectation that no one agent will find 100% of the vulnerabilities (including agents running on Mythos). 

In the context of vulnerability research, Cloudflare’s research outlines an example of what a solid harness setup often looks like:

  • A recon stage that reads the repository and creates a task queue for everything downstream
  • A hunt stage where many agents run in parallel, each searching for vulnerabilities
  • A validation stage where an independent agent, using a different prompt and with no ability to generate its own findings, tries to disprove what the hunting agent found
  • A tracing stage that follows confirmed findings across the repo to determine whether attacker-controlled input can actually reach the bug from outside the system
  • Deduplication logic to consolidate findings that have the same root cause

Harness design is so impactful that it often matters more than model choice itself. UCSB researchers ran the same Claude Opus 4.6 on the same tasks with different harnesses and found that the best harness passed four times as many tests as the worst harness. For comparison, the spread between frontier models like Opus 4.6 and GPT-5.4 on standard coding benchmarks is only about one percentage point. That means teams obsessing over which model to use are over-optimizing the wrong variable.

Niels Provos demonstrated the same concept from the other direction. He built a harness that found an 18-year-old vulnerability in a popular library, then swapped in the open-weight GLM 5.1 and got comparable results. He showed that a strong harness can make the model a swappable component, rather than the primary driver. 

Mozilla's security team's research explains why investing in harness design pays off over time. Once their harness pipeline was solid, each new model they dropped in immediately improved bug-finding, proof-of-concept generation, and impact analysis without any rearchitecting. When Mythos became available to them, they were able to slot it and immediately benefit. Build the harness right, and model progress becomes something you absorb for free rather than scramble to adopt.

Money talks

Another problem with using Mythos for everything is an economic one. Bigger models are always more performant, but they’re also way more expensive.

Running Mythos once costs real money, around tens of thousands of dollars, to do a single thorough scan of a repository for what might be a few vulnerabilities. Run Opus 4.6, or even GPT-5.4 nano, ten times for the same cost as running Mythos once, and you generally find more. Cost doesn’t scale 1-to-1 with ability. For example, both inputs and outputs for GPT 5.4 cost half of those for GPT 5.5, but the former doesn’t have half the reasoning power of the latter.  Internally, we found that eight GPT-5.4-mini agents outperform one GPT-5.5 agent in some cases, and they’re about the same cost. Cheaper models let you turn the number of agents into an advantage.

A man is holding a dollar bill in his hand and dances with it. Then he shoots the dollar out of his hand. A parody of throwing lots of bills around.

The smaller model will generally produce more false positives than a frontier model, since it’s indeed less precise. But in this rare case, quantity matters as much as quality, since you want to make sure you capture as many vulnerabilities as possible. This is where harnesses can help filter out extra noise, where other agents can verify the exploit chains and clean up, and it’s a whole lot more economical than running Mythos and frontier models to find everything.

For threat actors, what are they actually going to use? Not Mythos. First off, they don’t have it. But they're going to want to use whatever runs cheaply, repeatedly, at scale, and they’re not going to wait in line. Open-weight models with decent harnesses work well, and that’s probably what they’re doing right now.

And for organizations, what's sustainable? Running a frontier model on every code change certainly isn't. Running a multi-tier orchestration that uses cheap models regularly and expensive ones precisely… that is.

Pay no attention to the model behind the curtain

Mythos was a fascinating moment in our timeline. It got everyone's attention about what models can do now. But high-quality, capable autonomous vulnerability discovery is accessible through alternative and cheaper means than being limited to Mythos or Project Glasswing

Vendors locked to a single model have to make that one model perfect. Vendor-agnostic platforms get to pick the right tool for the right job. A smaller model can sweep wide and surface candidates, while a stronger model can deep-dive into the ones that look interesting and require higher reasoning capabilities. To get the best results in AppSec and AI pentesting, you want to prioritize systems with sophisticated harnesses that use the right models, rather than getting too concerned about having the fanciest model involved.

The Wizard of Oz: A man is behind a green curtain working a large contraption. Toto the dog pulls the curtain away to reveal the man.

At Aikido, we figured out early that quantity, orchestration, and the freedom to pick the right tool for the job beat chasing whatever's currently behind the highest paywall. As an AppSec provider, we see our responsibility as building the orchestration that lets the model layer keep evolving underneath. If you want to learn more about how our pentesting can help you secure your application, talk to us today.

PS. We’ve also written a Mythos-ready checklist to help teams prepare for threats from agentic AI (whether powered by Mythos or many GPT 5.4 minis).

Share:

https://www.aikido.dev/blog/mythos-vs-harness

Subscribe for news

4.7/5
Tired of false positives?

Try Aikido like 100k others.
Start Now
Get a personalized walkthrough

Trusted by 100k+ teams

Book Now
Scan your app for IDORs and real attack paths

Trusted by 100k+ teams

Start Scanning
See how AI pentests your app

Trusted by 100k+ teams

Start Testing

Get secure now

Secure your code, cloud, and runtime in one central system.
Find and fix vulnerabilities fast automatically.

No credit card required | Scan results in 32secs.