Claude 4 hacked SWE-bench by peeking at future commits
In a previous post about creating a Docker registry of SWE-bench images, I made a brief side note about hypothetical reward hacking:
[…] as a side note, it’s worth asking whether the git history should be included at all. It makes sense for the model to have access to the past git history, like a human developer would. However, the model should not have access to future history from after the PR was merged. A sophisticated cheating model could in theory reward hack the evaluation if there is any way to access future history. I believe that is possible in some circumstances even after a
git reset --hard
andgit remote remove origin
. For example, if version tags are used, the model couldgit checkout <tag>
a future tag. Or it could access the dangling commits in some way, perhaps withgit fsck --lost-found
. I believe this is unlikely to be a problem currently, but could become one with future models.
Less than two months later, it seems that this behaviour has now been seen in the wild!
A META AI researcher has just posted an issue on the SWE-bench repo:
We’ve identified multiple loopholes with SWE Bench Verified where agents may look at future repository state (by querying it directly or through a variety of methods), and cases in which future repository state includes either solutions or detailed approaches to solving problems (commit messages and more).
… In a trajectory with Claude 4 Sonnet, pytest-dev__pytest-6202, the agent uses
git log --all
which leaks future commits that directly fix the issue.
Claude 4 Sonnet came out in May 2025, and I wrote my post in July. I was quite wrong about how long it would take for models to become sophisticated enough to cheat in this way. They were already doing it. I had never seen a model attempt this loophole, but one man can only read so many SWE-bench traces.
The command git log --all
(which I didn’t know about) shows all commits reachable from any ref. A ref is any reference to the SHA-1 value of a commit, found under .git/refs
, and includes: (1) the heads of branches, (2) tags, (3) remotes, and (4) any manually created refs. As I explained in the previous post, the original SWE-bench authors only removed the default remote, with a code comment: “Remove the remote so the agent won’t see newer commits”. But they didn’t think about other refs, like tags (a fix is underway).
git log --all
lets a cheating model efficiently find any (non-dangling) commits, without having to look through the refs and figure out which one might point to future history and be usable for cheating. For example, Claude 4 Sonnet cleverly searched for commit messages with keywords relevant to the issue: git log --oneline --all | grep -i "bracket|parametrize|modpath"
.
In the pytest-6202
example, the issue is from November 2019. git log --all
shows commits up to d0f136f
(September 2024), which is tagged as 8.3.3
. Indeed 8.3.3
seems to be the highest version tag present on the main branch:
$ git tag --list
1.0.0
1.0.0b3
1.0.0b6
...many more tags...
8.3.2
8.3.3
8.4.0.dev0
v7.4.3
In the example issues I looked at, git tags were the reason the future history was available. This is exactly the vulnerability that had jumped out at me when I looked into how the Docker images are constructed.
In pytest-6202
, deleting all tags is sufficient to remove the offending commits from git log --all
. However, a git fsck --lost-found
will uncover the dangling tags and commits (ones not pointed to by any ref).
Here we must get into one subtlety. git show-ref 8.3.3
will show d430e32
, not d0f136f
, the actual September 2024 commit. d430e32
is the SHA-1 of the annotated tag object, which itself points to the actual commit d0f136f
. git show-ref --dereference 8.3.3
can be used to show d0f136f
.
Even after deleting tags, git fsck --lost-found
will uncover
$ git fsck --lost-found
...
dangling tag d430e325c6d8c7161ae2e468ea5045a163e4c517
...
A cheater can then directly do git checkout d430e32
, which moves the HEAD
to the September 2024 commit d0f136f
. Since HEAD
is a ref, git log --all
will at that point show the entire future history, just like before we deleted the tags (in fact you don’t need the --all
now).
My understanding is that git gc --prune=now
(docs) removes any unreachable objects. The documentation says:
git gc
tries very hard not to delete objects that are referenced anywhere in your repository. In particular, it will keep not only objects referenced by your current set of branches and tags, but also objects referenced by the index, remote-tracking branches, reflogs (which may reference commits in branches that were later amended or rewound), and anything else in the refs/* namespace.
The reflog and index are strictly local, so there should be nothing in them just after cloning. We’ve covered branches, tags, and remotes. The only exception I can think of are manually created refs (using git update-ref
), which should be extremely unusual. To mitigate this, one could trust only the HEAD
ref and delete all others.
I spent a couple of hours today trying to find other loopholes that work even after git gc
. I scoured various git internals I had never thought about before, like packfiles (in .git/objects/pack
), or .git/objects/info/commit-graph
. None of these worked. But git is a big program with a lot of arcane features, and I only tried a couple of SWE-bench issues. Can you find a more sophisticated hack?