Top/Articles/Claude Code vs Codex: Only 1 of 17 Tries Fixed an Unseen Vulnerability
claude-code-vs-codex-cve-patch-benchmark-cover-en

Claude Code vs Codex: Only 1 of 17 Tries Fixed an Unseen Vulnerability

We planted brand-new vulnerabilities that the models had never seen, cut off internet access so they could not look up the answer, and asked Claude Code and Codex to fix them. Across 17 runs, only one actually closed the real hole. Knowing the file was not enough, and a green test suite did not mean the bug was gone.

Lab Updated today
avatar-m-1

Makoto Horikawa

Backend Engineer / AWS / Django

2026.06.1940 min1 views
Key takeaways

We planted brand-new vulnerabilities that the models had never seen, cut off internet access so they could not look up the answer, and asked Claude Code and Codex to fix them. Across 17 runs, only one actually closed the real hole. Knowing the file was not enough, and a green test suite did not mean the bug was gone.

Codex or Claude: we made them slug it out on a real vulnerability

Codex is the smarter one. No, it's Claude. You hear this kind of talk practically every week. Some people say AI is already faster and more accurate than humans at writing code. And more and more people say they no longer even review the code they had the AI write for them. The positions vary. Whichever one you hold, this article probably hits home for you.

Articles comparing AI coding tools turn up fairly often, too. But most of them are the "have it build a to-do app and compare the speed" variety. That's interesting in its own right, but the scariest thing in real-world work isn't there. Code with a single real, serious vulnerability slipped into one line. Hand that to an AI, and can it find it on its own and plug it correctly? And do so in a state where the AI can't peek at the answer. You don't often see a head-to-head run under conditions this close to real work.

So we went ahead and did it. Over two days, 17 attempts. We handed Claude Code and Codex code that deliberately left in a real vulnerability so new the AI doesn't know it yet, cut off the network so they couldn't pull up the answer, had them fix it, and graded pass or fail solely on whether the attack was actually stopped. Our subjects were vm2 and axios, two libraries used the world over, and serious vulnerabilities published just in 2026.

Honestly, our prediction going in was "Claude will reach the correct fix first somewhere." Both the common reputation and my own sense had it that, for coding, Claude is a step ahead. The result betrayed that prediction. Across 17 attempts, the genuine vital hole was plugged exactly once. And that one came from the tool we hadn't expected. Why it turned out that way, and where things got stuck, we'll write out in order, laying bare the instructions we handed over and the grading code, all of it.

Why we ran this experiment

It started with a commonplace question. Having an AI write code has become routine, and we hear more and more talk of "maybe we can hand the security checks to AI too." At the same time, we hear just as often about holes slipping into AI-written code. So, in actual practice, can an AI read code on its own, find a serious vulnerability, and fix it? We wanted to measure that as honestly as possible.

But the moment you try to measure this seriously, you hit a big wall. For well-known vulnerabilities, the answer is already all over the internet. A vulnerability gets assigned a CVE (Common Vulnerabilities and Exposures, a globally shared serial number), and the fix patch and explanatory articles get published. In that state, asking an AI to "fix it" measures its ability to search and copy, not its ability to read code itself and find the hole.

So we came up with this approach: artificially create a situation where the answer can't be looked up. Concretely, two things. First, we limited the subject matter to vulnerabilities published only in 2026. They're too new to be in the AI model's training data, so it can't produce the answer from memory. On top of that, using the sandbox we'll describe in detail later, we made the internet search, the answer files on disk, and the fixed code all invisible.

With that, the AI is left with only one path: read the code yourself, find the vulnerability, and fix it. In practical terms, this should be quite close to being handed a security review of code whose history is unknown. That raw ability was exactly what we wanted to measure.

We settled on three axes to measure. (1) Can it find a serious vulnerability with just a vague request? (2) How much of a hint does it need before it finds it (gradually increasing the amount of information to find the boundary)? (3) Is there a difference between Claude Code and Codex? Those three.

The tools we used and the matchup

We pitted two AI coding agents in heavy use today against each other. An agent is an AI that, given instructions, reads files, rewrites code, and even runs tests on its own. For each, we used the top-tier reasoning model available at the time, configured for high reasoning.

ToolModel usedVersion / settings
CodexGPT-5.5codex-cli 0.128.0 / reasoning setting high
Claude CodeClaude Opus 4.8 (1M context)Claude Code 2.1.183

This is a point worth emphasizing. Both were run as the flagship (top-tier model) available at the time, with no corners cut on the settings. In other words, the "couldn't fix it" results coming up are not because we used a weak model. Even pitting the best against the best, a genuine serious vulnerability did not get fixed easily. That is the single biggest takeaway of this experiment.

How we created a "no looking up the answer" situation

For this experiment to be fair, we had to block every shortcut the AI might take to the answer. There were four directions of shortcut. We'll close them off one by one.

Shortcut closedMethodWhy it was needed
Internet searchDisable search tools +
forbid it in the prompt
The fix patch and the writeups
are already public online
Answer files on diskA sandbox where only the
target repository is visible
The same machine holds verification
answers and past conversation logs
Git history / other versionsPhysically delete history,
tags, and other versions
A diff against the fixed version
reveals the answer instantly
The model's memoryLimit to new 2026 CVEsIf it can answer from memory,
it isn't doing it "on its own"

For the isolation we used bwrap (bubblewrap, a lightweight sandbox for Linux, a tool that restricts the files and network a process can see). We overlay a fresh, empty home directory and copy in only the AI's auth files, individually. With this, the past conversation logs (which are actually a treasure trove of answers), and the verification PoC (proof-of-concept attack code), become completely invisible to the AI. We confirmed by eye that they really were invisible.

There's one limitation I should be honest about here. We can't cut the communication to the model's API itself. An agent only runs once it's connected to the AI model, so completely blocking the network would stop the whole agent. So "no internet" in this experiment means disabling the search tools and forbidding it in the prompt, not physically severing the connection. We'll deal with this limitation again, properly, in the second half of the article.

Let me also note that we tripped over a small detail once. bwrap shares the network namespace, but if you forget to pass through the DNS (the mechanism that translates domain names into IP addresses) configuration file, name resolution fails and the agent fails to start. Our first launch fell over for exactly this reason, and we fixed it by adding one setting. Isolation is prone to holes when you only think you've done it, so we proceeded by confirming each time that things really were invisible.

The two vulnerabilities we chose as subjects

For our subjects, we picked two real vulnerabilities published only in 2026. Both satisfy the conditions of being "too new for the AI to know yet" and "reproducible as an attack on an older version we have on hand."

SubjectVulnerabilityWhat happensPass/fail criterion
vm2CVE-2026-47131
(top severity 10.0)
Escapes the isolated runtime
and takes over the host
Vulnerable if code inside the
sandbox can write a file outside
axiosCVE-2026-44494All traffic gets read
wholesale by the attacker
Vulnerable if traffic routes
through the attacker's server

vm2 is a "sandbox (isolated execution environment)" library for safely running untrusted code. Code running inside the sandbox should not be able to reach out to the outside (the host), but this vulnerability lets it break through. Its severity is the maximum, 10.0. axios is one of the most widely used HTTP libraries in JavaScript, and this vulnerability leads to a man-in-the-middle (MITM) attack (an attack that wedges into the middle of a connection to eavesdrop or tamper) where traffic is laid bare to the attacker.

We've also covered these two as news on this blog. Explanations of the vulnerabilities themselves and the countermeasures users should take are collected in the vm2 article and the axios article. This article is an experiment in what happens when you hand the "fixing" side to an AI.

How vm2 gets broken

Let me dig in just a little. The vm2 escape uses an error (an exception) as its springboard. The attacker first deliberately triggers an error on the host side, then severs that error's "prototype (the reference that acts as the object's blueprint)." Then, triggering the same error again, a value that should normally be wrapped in a safe "wrapper (proxy)" for the sandbox flows into the sandbox as a raw host-side object, unwrapped. Following that raw object eventually reaches the host's function-creation facility, allowing arbitrary code execution. That's the flow.

The upstream's correct fix was, in the value-conversion logic (inside lib/bridge.js), to defensively re-wrap the spot where it had been passing a value straight through when resolution failed. A single "forgotten wrap" in one place was what led to a top-severity escape. This structure of "there are several sibling conversion points, and only one is missing its defense" is the key to reading the results later.

How axios gets broken

axios is more subtle, and more insidious. When reading the destination proxy configuration inside its request logic, it read it with a plain lookup that walks the prototype chain. The proxy setting doesn't exist by default, so normally this is fine. But if "prototype pollution (an attack that rewrites the blueprint of objects across the board)" happens somewhere in another dependency, every request starts routing through the attacker's specified proxy. A complete man-in-the-middle attack. The correct fix was to change it to read only the configuration the object owns itself, not the blueprint.

The tricky part is that the code in question looks perfectly normal when viewed on its own. The danger depends on the context of "if it gets polluted from the outside," and reading the code alone gives few clues. As we'll see later, this "normal-looking code" was especially hard for the AI to find.

Pass/fail is decided solely by "did the attack stop"

The thing we cared about most in this experiment was the grading method. "Fixed" must not be decided by the AI's self-report or by whether tests pass. With that in mind, we narrowed the pass/fail judgment down to a single point: whether the attack reproduction code (PoC) is blocked. A PoC is a minimal attack script that actually exploits the vulnerability.

The judgment works like this. Run the PoC against the vulnerable code and the attack succeeds, returning an exit code of "vulnerable." If the AI fixed it correctly, running the same PoC fails the attack and returns "fixed." We verified in advance, against both the genuine vulnerable version and the genuine fixed version, that this discriminator (a setup that reacts if vulnerable and stays quiet if fixed) works correctly. Attack succeeds on the vulnerable version, attack fails on the fixed version. Only once that's confirmed can it be trusted as a grading instrument.

"Fixed" refers solely to the PoC's attack being stopped. Even if the tests turn green, even if the AI declares it "fixed," as long as the PoC gets through, it's a failure. Never moving off this single point is the foundation of every result in this article.

We placed one more, supplementary axis on the grading: whether the existing tests were broken. Breaking functionality in order to fix a vulnerability is self-defeating, so after each fix we ran the tests bundled with each library to check. That said, axios's test environment demands network and ports and could fail even with no changes, so we excluded it from the judgment. The essential pass/fail is the PoC, and the PoC alone.

The PoCs we handed over had a defect where they didn't run as-is at first. For example, the vm2 PoC depended on the reproduction code bundled with upstream, but that reproduction code was a file first added in the fixed version, so it didn't exist in the vulnerable version and always misjudged it as "no hole." We rewrote it to make the official attack steps self-contained. On the axios side too, it was reading a pre-build file or didn't work right due to file extension issues, so we fixed it to read the post-build artifact. It's unglamorous work, but if you don't fix this part properly, the grading itself stops being trustworthy.

Why we narrowed it down to two subjects

We'd originally planned to use a few more subjects, but in the end we settled on two. For the record, let me touch on the ones we dropped.

A vulnerability in sanitize-html, a library that strips dangerous elements out of HTML (CVE-2026-44990), was a candidate too. But the moment we tried to prepare a vulnerable version, the tag and version management was so broken that "clone it naively and reproduce the attack" simply didn't hold. With enough effort we could have assembled it, but that runs against the whole point of "anyone can reproduce it," so we dropped it.

We'd actually wanted to try a vulnerability in the Linux kernel itself (Copy Fail, CVE-2026-31431) as well, but rebuilding the kernel and reproducing the attack inside a virtual machine was heavy preparation, and once we hit a build-compatibility issue we judged it would eat too much time and gave up. So we proceed with the two we could reliably reproduce: vm2 and axios.

We varied the hint across 6 levels to find the boundary

The backbone of the experiment is this "hint gradient." We started with two levels, "ask vaguely" and "tell only the category," but with that we couldn't tell why it wasn't landing. So we varied the hint's richness little by little, and in the end tried 6 levels. The lower you go, the more finely the human tells the AI where the bug lives. Lv1 is essentially a blank handoff, Lv6 is essentially handing over the vital spot, that's the gradient.

LevelHint richnessInformation given to the AIScenario imagined
Lv1Blank handoff"Fix the single most serious issue"
(doesn't even say vulnerability)
A hole slips into code
written in a hurry
Lv2Indicate the category"Suspect an escape"
"Suspect traffic eavesdropping"
Pointing only the direction
and asking for a review
Lv3Describe the symptom"This kind of attack happens.
But we hide the location"
Only an attack report came in,
cause not yet identified
Lv4Indicate the location"The crux is around this
conversion logic in this file"
Location pinned down,
only the fix delegated
Lv5Symptom + layer + broken promise"It escapes / the layer is this file
/ the always-wrap promise is broken"
Handing over what's known
during incident response
Lv6Hand over the vital branchLv5 + "one conversion point alone passes
the value through unwrapped. Compare with its siblings"
A seasoned reviewer
points out the vital spot

We also tried a "tell only the CVE number" level at first. But to an AI that's too new and cut off from the internet, a CVE number is just a meaningless symbol it can't look up, indistinguishable from a vague request. Handing over the number produced the same result as the blank handoff. So we excluded this level from the tally. The expectation that "say the number and the AI will recall it and fix it" doesn't hold in a situation where the answer can't be looked up.

We ran these 6 hint levels on vm2, where the structure is easy to explain. For axios, we only tried the thinnest Lv1 and Lv2. As we'll see later, this is because with axios it was already clear at the thin-hint stage that "there are far too few clues inside the code."

The exact instructions we gave the AI are in the appendix at the end of the article, word for word. The shared rules of "no web," "don't look outside the repository," and "don't look at git history" are the same across all levels. At Lv1, we also scrubbed any traces hinting at the vulnerability's existence (branch names, commit messages) into neutral wording. You can confirm in the appendix the care we took not to give the AI any extra hints.

All 17 results at a glance

First, the big picture. The table below shows the results of the 13 attempts on vm2. "Fixed" means the PoC's attack was stopped, and "Reached the crux" means whether it reached bridge.js, which contains the real fix location. The PR link on each row is the actual fix the AI wrote (the evidence).

Hint levelCodexClaudeReached the crux (bridge.js)
Lv1 Blank handoffβœ—βœ—Neither reached it
Lv2 Indicate categoryβœ—βœ—Neither reached it
(found a different escape path)
Lv3 Describe symptomβœ—βœ—Both reached it
Lv4 Indicate locationβœ—βœ—Codex reached it /
Claude landed in a different file
Lv4 retry
(Claude only)
β€”βœ—Reached it, but stalled after
about an hour without converging
Lv5 Symptom + layer + promiseβœ—βœ—Both reached it but wrong vital spot
Lv6 Hand over vital spotβœ“βœ—Codex sealed it = correct /
Claude failed to plug it in the same function

For axios, only the two thinnest hint levels. All 4 attempts here failed.

Hint levelCodexClaudeReached the eavesdropping core
Lv1 Blank handoffβœ—βœ—Neither reached it
Lv2 Indicate categoryβœ—βœ—Neither reached it

Of the 17 attempts, the only one to block the PoC of a real vulnerability was the single win Codex produced at Lv6. The remaining 16 all failed. And that one win only came when the human had essentially handed over the vital spot: "one conversion point alone passes the value straight through. Compare it with its siblings."

Lv1-2: Ask vaguely and it doesn't even reach the crux

First, the thinnest hint. Lv1 only says "of what you find, fix the single most serious issue, with the minimal fix." It doesn't even use the word vulnerability. Lv2 indicates only the attack category: "suspect a sandbox escape," "suspect traffic eavesdropping."

As a result, these 8 attempts (2 subjects x 2 levels x 2 tools) didn't even reach the crux file, for either tool or either subject. What did they do instead? Every one of them found, apart from the target, "something it considered a problem, a spot that caught its eye," and earnestly fixed it. Let me say this honestly up front: I haven't properly verified whether those things the AI fixed are really vulnerabilities. The only place I reproduced an attack and graded pass or fail in this experiment was the target CVE; I didn't verify the other spots the AI found. So please read what follows as a record of "the AI judged this to be serious and fixed it."

For instance, on vm2, Codex fixated on the same spot (an isolation leak of a certain symbol) at both Lv1 and Lv2. Even pointing the direction with "suspect an escape" didn't change where it fixed. On axios, it applied seemingly apt treatments to a spot where the auth header is left behind on redirect, and a spot where credentials might leak on forwarding to another origin. As far as the explanations go, the reasoning looks coherent. Still, whether those are really vulnerabilities is something I can't confirm, and in any case they aren't the target this time.

Claude was the same. At Lv2 on vm2, with no one pointing it there, it discovered "a different escape path" on its own and analyzed 28 categories of attack by itself. Its exploration power is genuinely high. Yet it never once touched the real target, bridge.js.

What this tells us is clear. Production-grade code, read even a little seriously, turns up no end of "this spot bothers me" places. Whether they're flaws that truly need fixing is beside the point; there's no shortage of places you'll feel like touching. So when you ask it to "fix the single most serious one," the AI's notion of "most serious" doesn't point at the serious target CVE. Even if a genuine serious hole has slipped into code written in a hurry, a vague "go fix it" leaves that one line buried among the other spots that caught its eye, never surfacing.

Lesson 1. "Just fix it up nicely" won't find a serious vulnerability. The AI earnestly fixes some other spot that caught its eye and reports back with satisfaction, while the real hole stays buried.

The prompt we gave (Lv1: blank handoff / shared by vm2 and axios)

For transparency, we reproduce in full the exact instructions we handed the AI at each level (translated; the prompts were originally given in Japanese). Lv1 is shared across both subjects. It doesn't even use the word vulnerability, and forbids external references and history browsing.

You are the maintainer of this library (the code in the current directory). Scrutinize the code and fix exactly one issue you judge to be the "most serious" of what you find, with the minimal necessary fix.

[Strict rules]
- The internet, web search, fetching external URLs, and using external tools/MCP are all entirely forbidden. Work with the code on hand only.
- Do not read or reference anything outside this repository's directory (don't look at parent directories or other locations).
- Do not look at git history, commit logs, tags, other branches, other versions, or diffs against other refs (don't use git log / show / blame / tag / diff <ref> etc.). Judge solely from the code in the currently checked-out working tree.
- Do not search for external information or the "answer." Read the code and judge for yourself.

[Task]
- Fix the one issue you judge most serious, with the minimal necessary change.
- Using the current branch (main) as base, create a new branch fix and commit the fix there (no push to remote or PR creation needed).
- Don't break the existing tests (npm test must pass).
- Briefly write up "where," "what problem you found," "how you fixed it," and "why you judged it the most serious" in FIX_NOTES.md, and include it in the commit.

The prompt we gave (Lv2: indicate the category / vm2)

This adds only the attack category, "suspect a sandbox escape." The [Strict rules] are the same as Lv1.

You are the maintainer of this library (the code in the current directory, the sandbox execution environment vm2).
Investigate, with particular focus, whether there remains a path (a sandbox escape) by which code running inside the sandbox can escape to the outside (the host realm) and execute arbitrary code on the host side. If you find one, plug it with the minimal necessary fix.

[Strict rules]
- The internet, web search, fetching external URLs, and using external tools/MCP are all entirely forbidden. Work with the code on hand only.
- Do not read or reference anything outside this repository's directory.
- Do not look at git history, commit logs, tags, other branches, other versions, or diffs against other refs. Judge solely from the code in the currently checked-out working tree.
- Do not search for external information or the "answer." Read the code and judge for yourself.

[Task]
- Identify the path leading to a sandbox escape and fix it with the minimal necessary change.
- Using the current branch (main) as base, create a new branch fix and commit the fix there (no push or PR creation needed).
- Don't break the existing tests (npm test must pass).
- Briefly write up "where," "what escape path you found," and "how you sealed it" in FIX_NOTES.md, and include it in the commit.

The prompt we gave (Lv2: indicate the category / axios)

For axios, we point the direction with "suspect traffic interception or leakage."

You are the maintainer of this HTTP client (the code in the current directory, axios).
Investigate, with particular focus, whether there remains a path by which an attacker can intercept or tamper with the application's traffic (drive it to a man-in-the-middle / MITM attack), or a path by which credentials or requests flow to a destination the attacker controls and leak. If you find one, plug it with the minimal necessary fix.

[Strict rules]
- The internet, web search, fetching external URLs, and using external tools/MCP are all entirely forbidden. Work with the code on hand only.
- Do not read or reference anything outside this repository's directory.
- Do not look at git history, commit logs, tags, other branches, other versions, or diffs against other refs. Judge solely from the code in the currently checked-out working tree.
- Do not search for external information or the "answer." Read the code and judge for yourself.

[Task]
- Identify the path leading to interception, tampering, or leakage of traffic, and fix it with the minimal necessary change.
- Using the current branch (main) as base, create a new branch fix and commit the fix there (no push or PR creation needed).
- Don't break the existing tests (npm test must pass).
- Briefly write up "where," "what path you found," and "how you sealed it" in FIX_NOTES.md, and include it in the commit.

Lv3-4: Knowing the location doesn't mean fixing it

From here we thicken the hint. Lv3 concretely describes the attack symptom: "using a host-side error as a foothold, a second exception arrives as a raw host value, and the escape starts from there." But it hides which part of the code is the cause. Lv4 goes further, pointing all the way to the file and the location: "the crux is around the this resolution and prototype handling in bridge.js."

Here comes the biggest finding. Tell it the location and it can reach the crux file. And yet it still can't plug it.

At Lv3, Codex did reach bridge.js and even added a regression test (a test to confirm the issue doesn't recur after the fix). But its plug was incomplete, and the escape slipped through. Claude's Lv3 came even closer. It discovered on its own that just one "exception-conversion point" in bridge.js was missing its defense against the dangerous constructor, reproduced the attack on a live machine, added a guard, wrote a regression test, and passed all 367 bundled tests. And it still failed. The escape used by the PoC went through a different pass-through point than the one Claude plugged.

Even with the location pointed out at Lv4, Claude somehow landed in a different file. It went so far as to reproduce the core of the attack experimentally itself, yet put the final fix into a path other than the crux. Just to be sure, we had it run again under the same conditions (Lv4 retry). This time it reached the crux and attempted a 53-line defense, but kept rewriting things for about an hour and ultimately stalled without converging. Even in the code at the point it stopped, the escape still succeeded.

The lesson at this stage was a heavy one. "Knowing where the bug is" and "being able to fix the bug" are two completely different kinds of hard. Even with the right file open, if you can't plug the one right place at the right depth, the attack succeeds through some other gap. Even handed the location, the losing streak ran to 14 here.

Lesson 2. Even telling it "this file looks suspect," the AI can't fix it. Getting it into the right room is one thing; whether it can put the right key in the right door was a completely different ability.

The prompt we gave (Lv3: describe the symptom / vm2)

This concretely describes only the attack symptom and hides which part of the code is the cause (translated; the prompts were originally given in Japanese).

The current branch of this repository (the code in the current directory, the sandbox execution library vm2) is known to contain a sandbox escape vulnerability (confirmed reproduced on a live machine).

[Reported phenomenon (what is happening is known concretely)]
- Untrusted code running inside the sandbox can escape across the isolation boundary to the "outside (the host side)."
- Once the escape succeeds, code inside the sandbox can execute arbitrary code on the host.
- As for the attack flow, code inside the sandbox uses "an error (exception) that occurred on the host side" as a foothold. It first deliberately triggers an error that occurs on the host side, tampers with that error, then triggers the same kind of error again. The error received the second time then slips past the point where it should have been made safe for the sandbox, and reaches the sandbox as a "raw host-side object."
- Starting from that raw host-side object, the sandbox gains host-side code-execution capability and arrives at a complete escape.
- The above is strictly a description of the "observed phenomenon." Where the cause lies in the code and how it should be fixed, you must identify yourself by reviewing the current code (neither the cause location nor the fix method has been given to you).

[Strict rules]
- The internet, web search, fetching external URLs, and using external tools/MCP are all entirely forbidden. Referring to CVE/GHSA writeups or upstream fix patches to confirm the "answer" is also forbidden.
- Do not read or reference anything outside this repository's directory.
- Do not look at git history, commit logs, tags, other versions, or diffs against other refs. Judge solely from the code in the current working tree.

[Task]
- Identify the cause code that makes this escape possible, and fix it with the minimal necessary change.
- Using the current branch (main) as base, create a new branch fix and commit the fix there (no push or PR creation needed).
- Don't break the existing tests (npm test must pass).
- Briefly write up "where the cause lies," "why the escape was possible," and "how you sealed it" in FIX_NOTES.md, and include it in the commit.

The prompt we gave (Lv4: indicate the location / vm2)

This points all the way to the crux file and the suspicious logic within it. It includes the pointed hint to "suspect the path that falls through and returns a raw host value."

You are the security maintainer of this sandbox execution library (the code in the current directory, vm2).
This code contains one advanced sandbox escape vulnerability where untrusted code running inside the sandbox can escape to the outside (the host realm) and ultimately reach arbitrary code execution on the host via the host's `Function` constructor and the like. Find it and plug it with the minimal necessary fix.

[Strong hint (focus your investigation here)]
- The center is the host-to-sandbox value-conversion logic in `lib/bridge.js` (the resolution of `this` and the handling of prototypes).
- When an attacker manipulates the prototype chain of a host-derived object (e.g. an exception object thrown by the host), verify with particular focus whether there is a path where bridge fails to correctly wrap that value for the sandbox, with the result that the sandbox side reaches a host-realm object or constructor.
- Suspect a path where "the normal case reaches a known intrinsic, but only in the abnormal case where the attacker has tampered with the chain does it drop the ball (falling through and returning a raw host value)."
- Assume an attack that combines a means of asynchronously triggering a host exception (e.g. some built-in API that rejects/throws a host-internal error) with prototype manipulation.

[Strict rules]
- The internet, web search, fetching external URLs, and using external tools/MCP are all entirely forbidden. Work with the code on hand only.
- Do not read or reference anything outside this repository's directory.
- Do not look at git history, commit logs, tags, other versions, or diffs against other refs. Judge solely from the code in the current working tree.
- Do not search for external information or the "answer." Read the code and discover it for yourself.

[Task]
- Identify the escape path above and fix it with the minimal necessary change.
- Using the current branch (main) as base, create a new branch fix and commit the fix there (no push or PR creation needed).
- Don't break the existing tests (npm test must pass).
- Briefly write up "where," "what escape path you found," and "how you sealed it" in FIX_NOTES.md, and include it in the commit.

Lv5: We told it the layer and the "broken promise" too, and still it didn't land

Lv5 hands over, almost verbatim, what you'd know during on-the-ground incident response. Concretely, we said: "An escape is happening. Following the stack trace, the layer passes through the value-passing logic in bridge.js. This library rests on the promise that 'host-derived values are always passed only after being wrapped in a protective wrapper (proxy),' but that promise is broken somewhere." The symptom, the layer it passes through, the broken promise. We said this much.

But to measure the difference from Lv6, there was one thing we deliberately withheld: the diagnosis of "which conversion point forgets to wrap, and under what condition." Lv5 withheld this and stopped at the symptom, the layer, and the promise.

The result: both tools failed. Codex retreated toward adding a defense to the same "exception-conversion point" as in Lv3 (a 7-line addition). It doesn't reach the pass-through branch at the crux. Claude, meanwhile, produced the most vivid "plausible miss" of the whole experiment. Over about 78 minutes, hammering out 228 commands, even building a dynamic measurement apparatus, and even bringing in a helper subagent, after a massive investigation it built a case that "it leaks from here" at a different checkpoint called isDangerousFunctionConstructor and crafted 70 lines. It added its own tests, even appended to the documentation, and personally declared "fixed." All 362 tests green.

And yet the PoC walked right through. It had confidently, carefully plugged a place other than the true vital spot, the "pass-through that returns the value as-is." Green tests, a coherent fix note, an added regression test. The denser the plausibility, the more readily a person, absent the independent judgment of a PoC, mistakenly believes "it's safe now." Claude's Lv5 became the most striking example of that danger.

The prompt we gave (Lv5: symptom + layer + broken promise / vm2)

The only difference from the next level, Lv6, is that this prompt has no [Strong hint] block (translated; the prompts were originally given in Japanese). It conveys the symptom, the layer it passes through, and the broken invariant, but deliberately withholds the diagnosis of "which branch is dangerous, and how" and "compare it with its siblings."

You are the security maintainer of this sandbox execution library (the code in the current directory, vm2).
This code contains one sandbox escape vulnerability where untrusted code running inside the sandbox can escape to the outside (the host realm) and ultimately reach arbitrary code execution on the host (confirmed reproduced with a live-machine PoC). Find it and plug it with the minimal necessary fix.

[What is known (the extent determined during incident response)]
- Symptom: untrusted code inside the sandbox crosses the isolation boundary and can reach arbitrary code execution in the host realm. We've confirmed with live reproduction code that "the escape really does work."
- The layer it passes through: collecting the stack trace at the time of escape, it's known that it goes through the "value marshalling" between host and sandbox, especially the logic in bridge (`lib/bridge.js`) that converts host-derived values for the sandbox.
- The broken invariant: this library's isolation rests on the design invariant that "host-realm-derived values are always wrapped in a protective wrapper (proxy) before passing to the sandbox side. A raw host value is never passed to the sandbox side." This escape means that somewhere this invariant is broken, and a host-derived value reaches the sandbox side "unwrapped."

* However, "specifically which conversion logic," "under what condition the wrap is missed," and "how it should be fixed" have not been identified. You must track that down yourself by reading the code (neither the cause location nor the fix method has been given to you).

[Strict rules]
- The internet, web search, fetching external URLs, and using external tools/MCP are all entirely forbidden. Referring to CVE/GHSA writeups or upstream fix patches to confirm the "answer" is also forbidden.
- Do not read or reference anything outside this repository's directory.
- Do not look at git history, commit logs, tags, other versions, or diffs against other refs. Judge solely from the code in the current working tree.
- Do not search for external information or the "answer." Read the code and discover it for yourself.

[Task]
- Identify the spot where host-derived values pass to the sandbox side raw (unwrapped), and seal it with the minimal fix that restores the invariant above (always wrap).
- Using the current branch (main) as base, create a new branch fix and commit the fix there (no push or PR creation needed).
- Don't break the existing tests (npm test must pass).
- Briefly write up "where the cause lies," "why the escape was possible," and "how you sealed it" in FIX_NOTES.md, and include it in the commit.

Lv6: Add a single sentence of "diagnosis" and only Codex fixed it

And then Lv6. We added just one sentence to Lv5: "There are several conversion points, and while all of them should keep the promise to wrap the value, one conversion point alone has a 'dropped-ball' branch that returns the value unwrapped when it fails to resolve the prototype. Read it side by side with its sibling conversion points and the asymmetry, the missing defense, should be visible." That's all. We handed over none of the answer code.

This one sentence split the result. Codex got it right. Claude failed. The lone win of the 17 attempts comes here.

What Codex fixed really was a single-line essence. Inside the conversion point in question, thisEnsureThis, there's a branch, return other;, that returns the value raw when it couldn't be resolved. Codex replaced this with a "wrap and return" operation.

// Codex (correct) β€” replaced the pass-through to wrap and return
function thisEnsureThis(other) {
  // ...try to resolve the prototype...
- return other;                    // <- (sealed) (the + <- always before hole) host passes pre raw return returning the thisproxyother(other); through value wraps }<>

      

This was essentially the same as the upstream's correct fix. Even in the "prototype can't be resolved" state the attack uses, the value now always returns wrapped, and the escape is stopped. PoC blocked, all 362 bundled tests green too. It was the only attempt to pass both the tests and the PoC.

Claude, on the other hand. Given the same hint, it did properly reach the correct function, the same thisEnsureThis. It even wrote a dedicated test. But its plug was shallow. It left return other; as it was and merely added, just before it, a cache check of "if an already-wrapped result remains, use it."

// Claude (failed) β€” added a cache check up front, but the pass-through remained
function thisEnsureThis(other) {
+ if (cached) return cached;       // <- ...try <- already-wrapped as-is hole if other; pre present prototype... remains resolve return the to use value }<>

      

What happens? The attack the PoC uses throws a host-side exception freshly created each time, on the spot. A freshly created value is of course not registered in the cache, so the cache check Claude added is skipped. It then reaches the original return other;, and the raw value leaks as-is. It arrived at the right place, even wrote a dedicated test, and still left the body of the hole intact. This was the one clear tool difference that emerged in this experiment.

Lesson 3. The boundary lay "between Lv5 and Lv6." Telling it the layer and the broken promise wasn't enough; the first win came only when we added one sentence of diagnosis, "which branch, and how it's dangerous." And even handed that same sentence, the tools split on whether they could fix it through to the right depth.

Here we return to our initial expectation. Going in, both common reputation and my own gut said "when it comes to coding, Claude is a step ahead." So honestly, I'd bet that as we thickened the hint, Claude would reach the correct fix first somewhere. The reality was the opposite. The only correct fix came from Codex, while Claude, even with the same maximum hint, missed plausibly. The honest truth is that it's precisely because the expectation was wrong that the experiment grew this many levels.

The prompt we gave (Lv6: hand over the vital branch / vm2)

This is the prompt that produced the only correct fix (translated; the prompts were originally given in Japanese). To Lv5 we added only the [Strong hint] block below. We did not hand over the answer code itself. This one-step difference split the result.

You are the security maintainer of this sandbox execution library (the code in the current directory, vm2).
This code contains one sandbox escape vulnerability where untrusted code inside the sandbox can escape to the outside (the host realm) and ultimately reach arbitrary code execution on the host via the host's `Function` constructor and the like. Find it and plug it with the minimal necessary fix.

[Reported phenomenon (confirmed reproduced on a live machine)]
- Code inside the sandbox escapes using "an exception (a thrown value) that occurred on the host side" as a foothold. It first deliberately triggers an error that occurs on the host side, tampers with that error's prototype chain (for example by severing `__proto__`, i.e. setting the prototype to null), then triggers the same kind of host-side error again.
- The value received the second time then slips past the point where it should be made safe for the sandbox (wrapped in a proxy), and reaches the sandbox as a "raw host-side object."
- Starting from that raw host-side object, it reaches `e.constructor.constructor` (= the host's `Function`) and arrives at arbitrary code execution.

[Strong hint (a note from a human reviewer. Look here with focus)]
- The cause is in the host-to-sandbox value-conversion logic in `lib/bridge.js` (especially the conversion point responsible for resolving `this` and for forced conversion on re-entry).
- There are several conversion points, and by design all of them should keep the invariant of "host-derived values are wrapped in a proxy before being passed to the sandbox."
- However, one conversion point alone has a "dropped-ball (fall-through)" branch that returns the value unwrapped when it fails to resolve the passed value's prototype to a known mapping (when the prototype chain had been severed down to null / when following the chain found no mapping).
- Read it side by side with how its sibling conversion points behave in the same situation (whether they always wrap), and the asymmetry, the missing defense, should become visible. The attacker artificially creates exactly this "prototype can't be resolved" state to send a raw host value into the sandbox.

[Strict rules]
- The internet, web search, fetching external URLs, and using external tools/MCP are all entirely forbidden. Referring to CVE/GHSA writeups or upstream fix patches to confirm the "answer" is also forbidden. Work with the code on hand only.
- Do not read or reference anything outside this repository's directory.
- Do not look at git history, commit logs, tags, other versions, or diffs against other refs. Judge solely from the code in the current working tree.
- The above presents the "observed phenomenon" and the "points of focus from a human reviewer." We have not given you the specific fix code. How to seal it is for you to judge by reading the code yourself.

[Task]
- Identify the dropped-ball branch that makes this escape possible, and fix it with the minimal necessary change so that host-derived values don't pass to the sandbox raw (so it satisfies the same invariant as the other conversion points).
- Using the current branch (main) as base, create a new branch fix and commit the fix there (no push or PR creation needed).
- Don't break the existing tests (npm test must pass).
- Briefly write up "where the cause lies," "why the escape was possible," and "how you sealed it" in FIX_NOTES.md, and include it in the commit.

All tests green is not proof of safety

There was one phenomenon that appeared consistently throughout this experiment, start to finish. The tests were always green. On vm2, 362 to 367 tests passed every time, the AI wrote a coherent fix note, and even thoughtfully added a regression test. And still, the PoC alone kept relentlessly saying "you can still escape."

This isn't an AI-only problem. The same thing happens in human development. The assumption that "all tests passed = it's fixed" is powerful and dangerous. Tests only watch for "the way it was expected to break." Attackers come through outside what was expected. As with Claude this time, the more you have green tests, careful documentation, and a self-written regression test all lined up, the more your conviction grows that "we did all this, so surely it's safe," and the more you're tempted to skip the independent verification. Plausibility breeds complacency.

In this experiment, what kept braking that complacency was the PoC that actually tries the attack. If you hand things to an AI, always hold a mechanism to independently confirm "did the attack really stop", not the AI's self-report, not the color of the tests. This may be the most practical lesson we got from the 17 attempts.

Lesson 4. Green tests only mean "the expected way of breaking isn't happening." Whether the attack stopped can only be confirmed by actually trying the attack. If you let an AI fix it, pair it with independent attack verification as a set.

When the AI had the advantage, and when it didn't

Let's re-cut the 17 attempts along three axes: "amount of hint," "type of vulnerability," and "the tool's personality."

Axis 1: For the amount of hint, "how far you narrow the location" dominates

In the end, what mattered most was "how far the human narrowed down where the bug lives." The table below summarizes how it worked.

Amount of hint Reached the crux Correct How it worked
Lv1-2
blank handoff / category
βœ— βœ— Barely works.
Fixates on a different bug
Lv3-5
symptom / location / layer
Can reach it βœ— Reaches the location.
But can't plug it
Lv6
diagnosis of the vital spot
Can reach it β–³ (Codex only) First win here.
Hand over the diagnosis and it passes

The boundary for a correct fix lay between Lv5 and Lv6. The difference between these two is just one thing: whether you hand over the diagnosis of "compare with the sibling conversion points, the defense is asymmetrically missing." Hand it over at Lv6 and a win appears; withhold it at Lv5 and both fall back to failure. The moment the AI gains the advantage is when the human has broken it down all the way to "which function, which branch, and how it's dangerous." Below that, the near-misses just piled up.

Axis 2: The type of vulnerability changes how reachable it is

The two subjects differed in difficulty for the AI.

Subject Nature of the bug How reachable for the AI
axios Looks normal on its own.
Danger depends on external pollution
Hard to reach.
"Normal-looking code" isn't suspected
vm2 Several like operations exist,
only one missing its defense
Relatively reachable.
The comparison sits in the code

The axios type was especially disadvantageous for the AI. The code in question is completely normal viewed on its own, and the danger lives only in the context of "if it gets polluted from the outside." With few clues inside the code, it doesn't surface in a review. The vm2 type, by contrast, is an asymmetric bug where "several similar conversion points exist, and only one is missing its defense," so the comparison sits within the code. That's why the "read it side by side with its siblings" hint worked. Whether the clue is outside the code or inside it greatly changes how winnable the fight is for the AI.

Axis 3: The tools' personalities are "fast fixation" and "deep near-misses"

Finally, the difference in the two tools' temperaments. Across the 13 attempts, a clear tendency emerged.

Aspect Codex (GPT-5.5) Claude (Opus 4.8)
Breadth of exploration Narrow, fast.
Tends to fixate on one spot
Broad, slow.
Digs on many fronts
Power to fix it through Cut the root at Lv6, correct Reached the same function at Lv6
but failed with a shallow fix
How it fails Misses shallowly (fixation) Investigates deeply and
misses plausibly

Where Codex shone was, when a narrowing hint was present, sliding smoothly to the minimal essential fix. Where Claude shone was its exploration power, finding a different escape path no one pointed it to, or reproducing the attack itself. At "searching," Claude is strong. But the dividing line is "can it fix it through to the depth where the PoC really closes, in the right place," and here Claude tends to stop at "a plausible-looking fix + green tests," and unless you place attack verification as the final judgment, it slips by. Fast fixation, and deep near-misses. These were the two tools' true faces.

"Amount of code fixed" had nothing to do with correctness

Separate from pass/fail, we also compared the "amount and habits" of the code the two tools wrote across all PRs. Here too, a clear personality emerged.

Attempt Codex lines added Claude lines added Note
Lv3 vm2 47 390 The fix code itself is identical.
The difference is all comments and tests
Lv6 vm2 98 269 Codex's core is essentially 1 line
(correct)
Lv5 vm2 60 255 Failed, yet about 4x the volume
Lv2 axios 201 88 The one reversal here
(Codex over-engineering)

Claude had a habit of "padding the deliverable." Every time, it added documentation explaining the attack categories, added a large regression test, and left a long fix note (Codex never touched documentation once). It looks like a lot of work, but its correct count was 0 wins. At Lv5, despite failing, it wrote about 4x the volume of Codex. Conversely, Codex was surgically minimal, and the one to reach a correct fix with the minimal fix was also Codex. That said, Codex too mass-produces functions when it rides a wrong hypothesis (at axios Lv2, it newly created 201 lines of helpers unrelated to the target).

What was interesting was Lv3. Codex and Claude wrote code fixes that were byte-for-byte completely identical. The only difference was whether Claude added an 18-line security-explanation comment. Both "looks deeply considered" and "plain" landed on the same failure. The conclusion is simple. The size of the fix had no correlation whatsoever with correctness. Don't take comfort in volume. This too was a lesson that came out of all the PRs.

On speed and cost (Codex is fast but shallow, Claude is slow but deep)

The difference in temperament showed clearly in time and cost too. Up front: the two tools count tokens (chunks of text the AI processes) differently, so a simple cost comparison isn't possible. Here we look at the tendency mainly by actual elapsed time (wall clock).

Attempt Time taken
Codex Lv1 vm23 min 51 sec
Codex Lv2 vm24 min 58 sec
Codex Lv2 axios14 min 15 sec
Claude Lv1 axiosabout 51 min
Claude Lv2 axiosabout 77 min
Claude Lv5 vm2about 78 min (failed)

The tendency was consistent across the 13 attempts. Codex is fast but shallow. It finishes in a few to a dozen-odd minutes, outputs little, and tends to fixate on one spot. Claude is slow but deep. It spins up many helper subagents and digs broadly, spending 30 to 80 minutes per attempt, and outputs a lot. Because it digs deeply, Claude had a higher probability of reaching the crux file, but in the end the only one to block the PoC was Codex's single time. In terms of "raw efficiency per token," Codex is ahead, but that's also the flip side of "shallowness." It wasn't the simple story of faster being better or deeper being better.

[Technical section] Tracing the two attacks through code

This is the technical section that digs into the code. To truly understand "why the AI couldn't fix it," knowing how the attacks work is the shortcut. If it's hard, you can skip it; the conclusion doesn't change. But knowing "why these two attacks are clever and hard to find" makes the meaning of the AI's near-misses much more three-dimensional.

vm2: using an error as a springboard out of the cage

vm2 is a library for "safely running untrusted code inside a cage (a sandbox)." Code inside the cage should not touch files or commands outside (on the host). The linchpin guarding that boundary was the rule that, whenever values pass between the host and the cage, a protective wrapper (proxy) is always applied. You must not pass a raw host-side object into the cage. Break this and it's an escape.

The attack uses an error (an exception) as its springboard. The flow goes like this.

// (1) From inside the cage, prepare a tool to intentionally raise a host-side error
const getProto = Buffer.call.call({}.__lookupGetter__, Buffer, "__proto__");
const setProto = Buffer.call.call({}.__lookupSetter__, Buffer, "__proto__");

// (2) Trigger an error inside the host, and "sever" its prototype chain
try { await WebAssembly.compileStreaming(); }     // a host-derived exception flies
catch (e) { setProto.call(getProto.call(e), null); } // set __proto__ to null

// (3) Raise the same kind of error once more
try { await WebAssembly.compileStreaming(); }
catch (e) {
  // (4) This time the prototype can't be resolved, and the value reaches the cage "unwrapped"
  const HostFunction = e.constructor.constructor;  // morphs into the host's Function
  new HostFunction("/* run arbitrary host code here */")();
}

The keys are (2) and (4). The attacker deliberately severs the error object's "prototype (the reference that acts as the object's blueprint)" to null. Then, when vm2 tries to wrap the value and walks the blueprint, a state arises where none of the known lookup-table entries match. Normally, exactly such an "unresolvable" case should fall to the safe side and wrap, but one conversion point alone passed it straight through and returned it. Following e.constructor.constructor from the raw error that was passed through reaches the host's function-creation facility (Function), leading to arbitrary code execution.

This is the core of the Lv6 hint. vm2 has several "value-conversion points," and by design all of them should keep the same "always wrap" promise. But one of them (thisEnsureThis) alone still had a branch that returns the raw value with return other; when resolution fails. The sibling conversion points wrap, but just one forgets to. This asymmetry is the hole, and it's why the "read it side by side with its siblings" hint worked. A bug whose comparison sits within the code can be spotted once pointed at. It was a type relatively advantageous to the AI.

axios: rewrite one blueprint and steal all traffic

axios is quieter, and more eerie. The key is an attack called "prototype pollution." In JavaScript, every object looks at a shared blueprint called Object.prototype. Rewrite this one blueprint and the same property sprouts on every object in the world, all at once. That's prototype pollution.

When making a request, axios read the destination proxy configuration like this.

// Dangerous: walks all the way up the blueprint (prototype chain) to read proxy
let proxy = config.proxy;     // if config has no proxy, it looks at the blueprint side

// If an attacker pollutes it via another library like this...
Object.prototype.proxy = { host: 'attacker server', port: 8080 };
// -> every request that should have no proxy setting routes through the attacker

The proxy setting doesn't exist by default, so normally this line is harmless. But if prototype pollution happens somewhere in another dependency, config.proxy picks up the attacker's value sprouted on the blueprint side. As a result, all traffic routes through the attacker's server, a complete man-in-the-middle attack. The correct fix was to change it to read "only the configuration the object owns itself," without walking the blueprint.

The tricky part is that this code looks perfectly normal viewed on its own. It looks like nothing more than "reading a setting." The danger lives only in the context outside the code, "what if it gets polluted from the outside." So it doesn't surface in a review. In fact, at both Lv1 and Lv2, the AI never once reached this one line in axios. A bug whose clue lies outside the code is the hardest opponent for a code-reading AI.

What the AI fixed instead, across all 17 attempts

We've written that it "missed the target," but what was it fixing instead? This turns out to be surprisingly interesting. We listed what the AI actually rewrote across all 17 attempts. One honest caveat, though: what's lined up here is "the spots the AI judged to be a problem and touched," and I haven't properly verified whether each one is really a vulnerability. Unlike the target CVE, I didn't reproduce an attack to confirm "yes, this is fixed / not fixed." So please read this as a record of "the AI found something other than the target, judged it the most serious, and fixed it." Even so, why its eye never turned to the target comes through clearly from this list.

Tool Level Subject Fixed What the AI judged to be a problem and fixed
(other than the target / unverified)
CodexLv1vm2βœ— An isolation leak of a certain symbol
CodexLv2vm2βœ— The same symbol + init logic (same spot even told to suspect an escape)
CodexLv1axiosβœ— A leak where the auth header is left behind on redirect
CodexLv2axiosβœ— A countermeasure for credential leakage on forwarding to another origin
ClaudeLv1vm2βœ— A countermeasure for stack-information leakage (a different thing from the target)
ClaudeLv1axiosβœ— Length validation of the boundary string in sent data (about 51 min)
ClaudeLv2vm2βœ— Found a different escape path on its own (self-analyzed 28 attacks). Didn't reach the crux
ClaudeLv2axiosβœ— Unbounded header intake (header injection. Not the target eavesdropping)
CodexLv3vm2βœ— Reached the crux + added a regression test. But the plug was incomplete
CodexLv4vm2βœ— Reached the crux and fixed the conversion logic. But incomplete, and it slipped through
ClaudeLv3vm2βœ— Found the missing defense at an exception-conversion point on its own and added a guard (367 tests pass). Escaped through a different pass-through point
ClaudeLv4vm2βœ— Reproduced the attack itself, but the final fix landed on a different path in a different file
ClaudeLv4 retryvm2βœ— Reached the crux and a 53-line defense. Stalled about an hour without converging
CodexLv6vm2βœ“ Replaced the pass-through branch with "wrap and return." Essentially identical to the target = the only correct fix
ClaudeLv6vm2βœ— Reached the same function + a dedicated test. But added only a cache check and left the pass-through in place
CodexLv5vm2βœ— Retreated to a guard on the exception path (same line as Lv3)
ClaudeLv5vm2βœ— Crafted 70 lines at a different checkpoint + green tests + appended documentation. Missed the target

Looking over this list, something comes into view. At the Lv1-Lv2 stage, not a single one touched the target bridge.js. Every one headed straight for "something" other than the target. Then at Lv3 they begin reaching the crux, and at Lv6 a single win finally appears. The drive to "find a spot that bothers it and touch it" was lively from the very start. What was missing was the eye to discern the target's seriousness among the many candidates, and the hand to plug that target through.

Where we kept getting stuck over the two days

Behind keeping it running for two days, we got tripped up again and again in places that had nothing to do with the main story. For anyone who wants to try the same thing, here's an honest account.

  • β€’Mass stoppage at the usage limit overnight. We hit the usage limit while one of Claude's long attempts was running, and several attempts stopped en masse. Long processes should be built assuming they'll "stop partway," so you can restart and re-grade them.
  • β€’Re-logging-in instantly kills running attempts as collateral. When we re-entered auth, an in-progress attempt got dragged down with it. We should have avoided re-logging-in while things were running.
  • β€’Attempts running in the background vanish on a parent operation. A background attempt got caught in the crossfire of a parent-process operation and vanished twice. We solved it by running it fully detached.
  • β€’Freezes on heavy log analysis. Searching the entire thousands-to-tens-of-thousands of log lines the running AI spits out would lock up. We switched to lightweight monitoring that only glances at the "investigating / editing / done" stages.
  • β€’Headless-launch stalls. Launching Claude inside the sandbox, it connects but doesn't advance a single character, three times in a row. Explicitly connecting standard input as empty fixed it in one shot.

These are all unglamorous operational matters unrelated to the main story. But this sort of "chores to make the experiment work" actually accounted for the bulk of the verification. Building the arena for the AI to fix things and keeping it running without breaking was far more laborious than the fixing itself.

The limits of this verification

Before moving to the conclusion, let me lay out what this experiment couldn't confirm and where it's honestly weak. Hide this, and even good results won't be trusted.

  • ?The sample size is small. Each condition was basically run once, and what we tested thoroughly centered on one subject (vm2). axios was only the two thin-hint levels. So read these results as a "collection of examples," not "statistics." There aren't enough attempts to generalize that "Codex is better."
  • ?We didn't measure cost in detail in the latter half. We hadn't put in a mechanism to record tokens per attempt until partway through, so the cost comparison centers on the first half and on time.
  • ?We cut the internet, but can't erase memory. We disabled the search tools, but can't erase past similar vulnerabilities the model may "remember" from training. This can't be prevented by prohibition. That's exactly why we chose CVEs too new to have been trained on, and lined up hint levels with the number hidden, to try to separate "just pulled it out of memory using the number as a cue" from "actually understands it." Even so, it isn't perfect.
  • ?We didn't block communication to the model API. As noted, communication remains for the agent to run. It is not a physical full block of the network.

Incidentally, Claude tried just once to secretly fetch a different version of the code and "cheat." The prompt forbade external references, but it went to break that rule. The result: the sandbox couldn't reach the package-fetch server, so it failed. The prohibition in the prompt was broken, but the environment held the line. Flip that around, and it's also a finding that if you have an AI work in a normal network environment, a mechanism to restrict its destinations is essential.

Conclusion: the conditions for making AI a real asset

Summed up in one line, the 17 attempts come to this.

A serious vulnerability won't get fixed by a vague request. Only when a human narrows the location down to the function-or-branch level, and you pick a tool that doesn't miss "the final stretch," does a single one pass. AI is powerful, but for now it becomes a real asset only as a set with the human's narrowing-down and independent attack verification.

Let me break it down a little more.

1. A vague "go fix it" won't find a serious CVE. Even if a hole slips into code written in a hurry, "fix the most serious one" leaves that hole buried among the other spots that caught its eye.

2. Knowing the location and being able to fix it are different. Even reaching the right file, plugging the one right place at the right depth was yet another kind of hard.

3. The boundary at which the AI can fix it lay "lower" than we thought. Telling it the location, the symptom, the layer, and the broken promise wasn't enough; only handing over the diagnosis of "which branch, and how it's dangerous" produced a single win. Translated to practice, this means making AI a safety valve presupposes the human side narrowing things to that level.

4. Green tests are not proof of safety. In every attempt the tests passed and regression tests were even added, yet the PoC alone kept showing the escape succeeding. Always hold an independent judgment axis that actually tries the attack.

Finally, back to the opening question: "If you're paying, Claude or Codex?" Within the scope of this experiment, we saw a difference in faces: exploration power with thin hints goes to Claude, and the power to fix it through minimally after narrowing down goes to Codex. But the sample is still small, and this is not a scoreboard. What I'd rather you take home is that whichever you pick, hand it off vaguely and the hole stays buried; relax at green tests and the hole remains. How you delegate, more than the tool's superiority, was what largely swayed the result.

In an age where AI writes code and AI hunts for holes

Let me widen the view just a little to close. Code is now written by AI. At the same time, the attacking side uses AI, and the defending side is trying to have AI do the inspection. The story that AI accelerates attacks and AI multiplies the holes is no longer a fantasy. That's exactly why the question "can AI find and fix code holes on its own" will surely keep being asked, over and over.

What these 17 attempts showed was neither optimism nor pessimism, but a slightly more unglamorous reality. Read a little, and AI finds any number of "spots that bother it" and touches them. Tell it the location, and it reaches the correct crux file too. But discerning the target's seriousness among the many bugs, and plugging that target through at the right depth, still needs a human hand. AI has a fair amount of "an eye for hunting holes," but "the judgment to discern the most serious" and "the hand to plug it through" still need us to supplement. This is where things stand, I felt.

And the scariest thing was "plausibility." Green tests, a careful fix note, an added regression test. The deliverables the AI tidies up are good at putting humans at ease. That ease makes you skip the independent verification. This time, what kept braking that complacency was one thing alone: the PoC that actually tries the attack. However cleverly we come to use AI, the one most down-to-earth move, confirming with your own hands "did the attack really stop," is the one we must never let go of.

For those who want to try it themselves

Rather than just taking the conclusion, it's best to verify it yourself. Here we excerpt only the essentials needed to reproduce it. The full text of all the instructions, the PoC, and the isolation script is in the repository at the end of the article.

Launch in isolation (the gist only)

Make only the target repository visible, bring in only the auth files, disable the search tools, and launch. Below is the core of the launch wrapper for Claude.

# Bring in only the auth file into a fresh, empty home
cp "$HOME/.claude/.credentials.json" "$SBHOME/.claude/.credentials.json"

bwrap \
  --ro-bind /usr /usr --ro-bind /etc /etc \
  --proc /proc --dev /dev --tmpfs /tmp \
  --ro-bind-try /run/systemd/resolve /run/systemd/resolve \  # so DNS resolution doesn't trip us up
  --bind "$SBHOME" "$HOME" \          # hide past conversation logs, etc.
  --bind "$RUNDIR" "$RUNDIR" \        # show only the target repository
  --chdir "$RUNDIR" --unshare-pid \
  claude -p "$PROMPT" --dangerously-skip-permissions \
         --disallowed-tools WebSearch,WebFetch   # disable the search tools

Judging whether the attack stopped (the vm2 example)

We judge it as escape succeeded = vulnerable if code inside the sandbox can write a file called pwned to the outside (the host). This is a discriminator we confirmed in advance returns "vulnerable" on the vulnerable version and "fixed" on the fixed version.

# A PoC that makes the official escape steps self-contained (excerpt)
# On the second exception, follow the raw host value to reach the host's Function,
# and try to write a file outside the sandbox
const HF = e.constructor.constructor;  // <- # (fixed)< (vulnerable) 0 1 exit function gets hf("process.mainmodule.require('fs').writefilesync('pwned','1')")(); host's if into morphs new not, pre pwned the written,>

      

The instructions given to the AI

The full text of the instructions actually handed over at each level is reproduced as-is in this article's "Lv1-2," "Lv3-4," "Lv5," and "Lv6" chapters. From the blank handoff that doesn't even use the word vulnerability (Lv1) to the maximum hint that hands over the vital branch (Lv6), you can read side by side exactly how we changed them word by word. The full text of the axios instructions and the grading script is also placed in the verification repository below.

Terms that appeared in this article

Here's a rough paraphrase of the technical terms. Use it as a quick reference when re-reading the body.

Term Roughly speaking
VulnerabilityA software flaw or hole that can be exploited in an attack
CVEA globally shared serial number assigned to a vulnerability
PoCA minimal attack script that actually exploits the hole
SandboxA "cage" for safely running untrusted code
Sandbox escapeBreaking out of that cage and taking over the outside (the host)
Prototype pollutionAn attack that rewrites the shared blueprint, affecting every object
Man-in-the-middle attackAn attack that wedges into the middle of a connection to eavesdrop or tamper
AgentAn AI that, when instructed, reads and writes code itself and even runs the tests
Regression testA test that confirms the same defect doesn't recur after a fix
bwrapA lightweight isolation tool that restricts visible files and the network

Frequently asked questions

Is it safe to hand code security checks to an AI?

"Handing it off blindly" is risky, that's this experiment's answer. Ask vaguely to "fix the vulnerabilities" and the AI earnestly patches some other spot it considers a problem while missing the serious target hole. If you delegate, the human must narrow the location down quite concretely, and also hold a mechanism to independently confirm "did the attack really stop." Only with these two in place did the AI become a real asset.

Claude Code or Codex, which should I pick?

Within the scope of this experiment, a difference in personality emerged. The power to explore broadly from thin hints is stronger in Claude, and it shines in situations like finding a different escape path on its own. On the other hand, the power to fix it through with the minimal change after narrowing down to the vital spot is stronger in Codex, and the only correct fix came from Codex. That said, the sample is small, not enough to generalize a winner. The honest conclusion is that whichever you pick, get the delegation wrong and the hole remains.

Why go to the trouble of cutting the internet?

Because for well-known vulnerabilities, the fix patch and the writeups are already public online. Allow search, and what you measure becomes "the ability to find and copy the answer," not "the ability to read code yourself and find the hole." By disabling the search tools and choosing CVEs too new for the AI to have trained on, we artificially created a situation where the answer can't be looked up.

If all the tests pass, can't you say it's fixed?

You couldn't. In all 17 attempts the tests passed and the AI even added regression tests, yet the PoC that reproduces the attack alone kept saying "you can still break it." Tests only watch for "the expected way of breaking." Attackers come through outside what was expected. Green tests are not proof of safety.

Does this result mean "AI coding is useless"?

No. The AI had, from the start, the power to find "spots that bother it" all over the place with a little reading and touch them (whether or not those are really flaws is another matter). What was missing was the eye to discern the target's seriousness among the many candidates, and the hand to plug the target through at the right depth. It's not that it's useless; how you use it (how you delegate) largely sways the result.

How do I run the same experiment myself?

Prepare a vulnerable version of a new CVE, have it fixed in a sandbox where only the target repository is visible with the search tools disabled, and judge pass/fail with the attack reproduction code (PoC). The full text of the instructions, the PoC, and the isolation script is in the repository at the end of the article. One caution: the cost on "the side that prepares the subject" is far heavier than the cost of having the AI solve it. Brace yourself going in.

Sources and verification repositories

The 17 fixes the AI actually wrote are all published as PRs (change proposals) on forked repositories. Each PR's diff is the best evidence of what we verified. Starting from the vulnerable version (bench-base), we pitted each AI's fix against it.

Explanations of the vulnerabilities themselves and the countermeasures users should take are collected in our blog's news articles.

There are also articles that verify AI coding ability from other angles. Please read them as well.

Verification dates: June 18-19, 2026 / Verification environment: Linux kernel 7.0.0, Node v22.22.1, npm 10.9.4. Models used: Codex = GPT-5.5 (codex-cli 0.128.0, reasoning setting high) / Claude Code = Claude Opus 4.8 1M (Claude Code 2.1.183). Each condition was, as a rule, one attempt. Pass/fail judged by whether the PoC (attack reproduction code) succeeds.