Reference Tables

SWE-Bench Pro trajectory analysis. 4 models, 2920 trajectories.

0. Resolution and Submission

Of 730 tasks, the highest resolve rate is 44% (Sonnet 4.5). Most submitted patches do not resolve the issue.

Sonnet 4.5
319/730
Gemini 2.5 Pro
142/730
GLM 4.5
259/730
GPT-5
265/730
resolved submitted, not resolved not submitted
modelnsubmitted (clean)submitted (w/ error)not submittedresolvedresolve rate
claude45730643751231943.7%
gemini25pro7305381524014219.5%
glm457305951181725935.5%
gpt573043818510726536.3%

Each SWE-Bench Pro trajectory ends with an exit status. The agent either submits a patch or doesn't.

submitted (clean): exit_status is exactly submitted.

submitted (w/ error): the agent produced a submission, but the harness also recorded an error condition (timeout, context overflow, cost limit, format error).

not submitted: the agent never ran the submit command. It hit an error before submitting.

resolved: the submitted patch actually fixes the failing tests (from benchmark evaluation).

Method: we read info.exit_status and info.submission from each .traj file.

0b. Exit Status Breakdown

exit_statusclaude45gemini25proglm45gpt5
submitted643538595438
submitted (exit_command_timeout)1760378
submitted (exit_error)37183320
submitted (exit_format)239535
submitted (exit_context)202963
exit_error922653
submitted (exit_total_execution_time)315018
submitted (exit_cost)142001
exit_command_timeout19024
exit_format15107
exit_total_execution_time10013
exit_context0009
exit_cost0300
0101
exit_command0010

Every distinct exit_status string from the .traj files, with counts per model.

submitted: clean exit after submitting.

submitted (exit_*): submitted, but also hit an error condition (timeout, context, cost, format, etc.).

exit_* (without submitted prefix): the agent hit that condition and never submitted.

These statuses are set by the SWE-Agent harness, not by the model itself.

1. Trajectory Metadata

Median trajectory length varies from 29 to 78 steps (Gemini 2.5 Pro vs. Sonnet 4.5). The longest trajectories take ~2.7x more steps per task.

Sonnet 4.5
78
63–89
Gemini 2.5 Pro
29
17–49
GLM 4.5
52
40–65
GPT-5
53
34–76
median p25–p75
modelntotal_stepsavgmedianp25p75minmaxresolvedresolve_rate
claude457305654877.578.06389221531943.7%
gemini25pro7303171943.529.01749125114219.5%
glm457303918853.752.04065121825935.5%
gpt57304340459.553.03476225126536.3%

Summary statistics for trajectory length (number of action-observation steps per task).

n: trajectories (one per task instance).

total_steps: sum of all steps across all trajectories for this model.

avg / median / p25 / p75 / min / max: distribution of steps per trajectory.

resolved: trajectories where the submitted patch fixes the failing tests (from agent_runs_data.csv).

resolve_rate: resolved / n.

1b. Intent Classification Taxonomy

Every step is labelled by a deterministic, priority-ordered ruleset in scripts/classify_intent.py — no model inference. Each row shows the intent, what it means, the literal rule that fires it, and how often it fires per model.

Classification order (first rule that matches wins):

  1. Empty action → empty. submit prefix → submit.
  2. str_replace_editor {view, create, str_replace, insert, undo_edit} → classified by sub-command and filename pattern (test/config/repro/verify/doc).
  3. Bash is unwrapped: strip bash -lc "...", leading cd ... &&, source ... &&, timeout N, and FOO=bar env prefixes.
  4. If the observation shows a shell-level error (syntax error, command not found, unexpected token, …), the command head routes to the matching (failed) label.
  5. Otherwise match the command head: test runners, compile/syntax, search, read, list, git, python/node scripts (named vs. inline, with inline further sub-classified by code shape), file cleanup, install, service, tool-exists, metadata, echo.
  6. Anything that reached the end is bash-other (<2% of steps by design).
intentdescriptionclassification ruleSonnet 4.5Gemini 2.5 ProGLM 4.5GPT-5
read
read-file-fullview an entire source file via str_replace_editorstr_replace_editor view <file> (fallback once test, config, range, and truncated views are ruled out)3.1k3.9k3.0k5.0k
read-file-rangeview a specific line range (--view_range)str_replace_editor view with --view_range6.0k1.6k6.4k6.0k
read-file-full(truncated)view a file that was too large, got abbreviatedstr_replace_editor view where the observation contains too large to display198361229245
read-test-fileview a test file (test_*, _test.*, conftest)str_replace_editor view on a filename matching test_*, *_test.*, or conftest*64456259635
read-config-fileview package.json, pytest.ini, setup.cfg, go.mod, Makefile, etc.str_replace_editor view on package.json, pytest.ini, setup.cfg, setup.py, go.mod, Makefile, config.json264872206
read-via-bashcat, head, tail, sed -n, nl, awkcat, head, tail, sed -n, nl, awk2.3k1401333.0k
read-via-inline-scriptinline snippet that reads a file and prints contentinline snippet that reads a file (.read(), open(...,'r'), readFileSync) and prints, without writing76321373
search
view-directoryview a directory listing via str_replace_editorstr_replace_editor view where path has no extension, or observation lists “files and directories”1.1k122.5k2.1k
list-directoryls, tree, pwdls, tree, pwd8431.4k86708
search-keywordgrep, rg, ag for a patterngrep, rg, ag7.0k9052.8k6.5k
search-files-by-namefind ... -name (locating files by name/path)find ... -name with no grep/xargs pipe1.8k31026649
search-files-by-contentfind ... -exec grep / find | xargs grepfind ... -exec grep or find ... | xargs grep3.3k3285110
inspect-file-metadatawc, file, statwc, file, stat24611422
check-versioninline snippet that checks python/node versioninline snippet matching --version, -V, sys.version, or node -v6002
reproduce
create-repro-scriptcreate a file named repro*, reproduce*, demo*str_replace_editor create on a filename containing repro, reproduce, or demo157551372463
run-repro-scriptrun a file named repro*, reproduce*, demo*run a named script whose basename matches repro* or reproduce* (python, node, sh, bash, go run)3751.8k1.1k1.1k
run-inline-snippetpython -c, python - <<, python3 -c, node -epython -c, python - <<, node -e — residual when no inline sub-pattern matches47292146193
edit
edit-sourcestr_replace on a non-test, non-repro source filestr_replace_editor str_replace on a filename not matching test/repro/verify/check5.2k10.3k7.3k5.0k
insert-sourcestr_replace_editor insert on a source filestr_replace_editor insert122.0k16803
apply-patchapplypatch command (GPT-specific alternative to str_replace)applypatch command (GPT-specific)00094
create-filecreate a file that doesn't match repro/test/verify/doc patternsstr_replace_editor create on a filename not matching repro/test/verify/doc patterns595686575326
edit-via-inline-scriptinline snippet that reads, modifies, and writes a fileinline snippet that writes (.write(), writeFileSync) together with reading or .replace()/re.sub()501245
create-file-via-inline-scriptinline snippet that writes a file without reading firstinline snippet that writes a file with no prior read210041
verify
run-test-suitepytest, go test, npm test, npx jest, mocha (broad)pytest, go test, npm test, npx jest, mocha, yarn test, python -m unittest (broad; no :: or -k)5.9k7891.1k585
run-test-specificpytest with -k or :: (targeting specific tests)a test runner command containing :: or -k 1.1k0168370
create-test-scriptcreate a file named test_*, *test.py, *test.js, *test.gostr_replace_editor create on a filename matching test_*, *test.py, *test.js, *test.go2.6k1211.9k18
run-verify-scriptrun a file named test_*, verify*, check*, validate*, edge_case*run a named script whose basename contains test_, verify, check, validate, or edge_case3.4k2083.0k113
create-verify-scriptcreate a file named verify*, check*, validate*str_replace_editor create on a filename matching verify*, check*, or validate*321119447
edit-test-or-reprostr_replace on a test or repro filestr_replace_editor str_replace on a filename containing test_, repro, verify, or check7121.1k2.3k243
run-custom-scriptrun a named script that doesn't match repro/test/verify patternsrun a named python/node/sh/bash/go script whose basename doesn’t match repro/test/verify patterns476115659111
syntax-checkpy_compile, compileall, node -cpy_compile, compileall, node -c18305718
compile-buildgo build, go vet, make, npx tsc, tscgo build, go vet, make, tsc, npx tsc, npm run build, yarn build1.1k15075441
run-inline-verifyinline snippet that imports project code or runs assertionsinline snippet with import/from + assert/print (smoke test or assertion)9991206696
git
git-diffgit diffgit diff (with or without -C <dir>)5380923
git-status-loggit status, git show, git loggit status, git show, git log65241323
git-stashgit stashgit stash28020
housekeeping
file-cleanuprm, mv, cp, chmodrm, mv, cp, chmod1.6k58842617
create-documentationcreate a file named summary, readme, changes, implementationstr_replace_editor create on a filename matching *summary*, *readme*, *changes*, *implementation*6611712
start-serviceredis-server, redis-cli, mongod, sleepredis-server, redis-cli, mongod, sleep2671424
install-depspip install, pip list, npm install, go get, aptpip install, pip list, npm install, go get, apt20129370
check-tool-existswhich, typewhich, type160102
failed
search-keyword(failed)grep/find that hit shell errorsgrep/find whose observation contains a shell error468332.7k
read-via-bash(failed)cat/head/sed that hit shell errorscat/head/sed/tail/ls whose observation contains a shell error23362994
run-script(failed)python/node run that hit shell errorspython/node whose observation contains a shell error4715590759
run-test-suite(failed)pytest/test that hit shell errorstest runner whose observation contains a shell error6943155
bash-command(failed)other bash that hit shell errorsany other bash command whose observation contains a shell error32264331.2k
other
submitsubmit the patchaction’s first line starts with submit6561.2k595537
emptyempty action string (rate limit, context window exit)action string is blank (rate-limit or context-window exit)770906964854
echoecho, printfecho, printf14093369
bash-otherunclassified bash commandfinal fallback — bash command that matched no other rule (<2% of steps by design)928843458631
undo-editstr_replace_editor undo_editstr_replace_editor undo_edit4662339

The label describes what the command is, derived from the action string and filename alone — no positional context (before/after first edit) and no outcome signal is used. A failed grep is still a search attempt.

(failed) variants classify by intended action, not outcome. They require a shell-level error in the first 500 chars of the observation.

run-inline-snippet is a residual — inline snippets (python -c, python - <<, node -e) are first routed to run-inline-verify / read-via-inline-script / edit-via-inline-script / create-file-via-inline-script / check-version by inspecting the code shape.

Pass/fail outcome for verify intents (used by seq-first-all-pass / seq-work-done) is a separate detector that reads the observation for unambiguous runner summaries: e.g. pytest N passed in Xs / N failed, go ok package / FAIL package, jest Tests: N passed / N failed. Ambiguous output returns unknown.

Canonical source: scripts/classify_intent.py and docs/intent-classification-rules.md.

2. Base Intent Frequencies

The top 10 intents account for the bulk of all steps; the long tail of ~40 others fills in the edges.

Sonnet 4.5 Gemini 2.5 Pro GLM 4.5 GPT-5
edit-source
9.2%
32.4%
18.5%
11.5%
read-file-range
10.6%
5.1%
16.3%
13.8%
search-keyword
12.4%
2.9%
7.2%
15.0%
read-file-full
5.5%
12.4%
7.8%
11.6%
run-test-suite
10.5%
2.5%
2.7%
1.3%
run-verify-script
6.0%
0.7%
7.7%
0.3%
read-via-bash
4.1%
0.4%
0.3%
6.9%
view-directory
2.0%
0.0%
6.3%
4.9%
search-keyword(failed)
0.1%
0.0%
0.1%
6.3%
insert-source
0.0%
6.3%
0.0%
1.9%
intentclaude45_nclaude45_%gemini25pro_ngemini25pro_%glm45_nglm45_%gpt5_ngpt5_%
edit-source52179.2%1027932.4%726818.5%498311.5%
read-file-range597410.6%16165.1%637716.3%599713.8%
search-keyword700212.4%9052.9%28047.2%649915.0%
read-file-full31255.5%394312.4%30417.8%502011.6%
run-test-suite594210.5%7892.5%10642.7%5851.3%
run-verify-script34206.0%2080.7%30117.7%1130.3%
view-directory11372.0%120.0%24886.3%21334.9%
read-via-bash23454.1%1400.4%1330.3%29746.9%
create-test-script26334.7%1210.4%19134.9%180.0%
edit-test-or-repro7121.3%11403.6%23376.0%2430.6%
run-repro-script3750.7%17555.5%10822.8%10672.5%
search-files-by-content32545.8%320.1%8512.2%100.0%
empty7701.4%9062.9%9642.5%8542.0%
submit6561.2%12393.9%5951.5%5371.2%
list-directory8431.5%13784.3%860.2%7081.6%
bash-other9281.6%8432.7%4581.2%6311.5%
insert-source120.0%20076.3%160.0%8031.9%
search-keyword(failed)460.1%80.0%330.1%27486.3%
file-cleanup15542.7%5881.9%4261.1%170.0%
search-files-by-name17923.2%3101.0%2660.7%490.1%
create-file5951.1%6862.2%5751.5%3260.8%
compile-build10881.9%1500.5%7541.9%410.1%
run-inline-verify9991.8%10.0%2060.5%6961.6%
run-test-specific11052.0%00.0%1680.4%3700.9%
read-test-file6441.1%560.2%2590.7%6351.5%
bash-command(failed)320.1%2640.8%330.1%12172.8%
create-repro-script1570.3%5511.7%3720.9%4631.1%
run-custom-script4760.8%1150.4%6591.7%1110.3%
read-via-bash(failed)230.0%360.1%20.0%9942.3%
run-script(failed)470.1%1550.5%900.2%7591.7%
read-file-full(truncated)1980.4%3611.1%2290.6%2450.6%
run-inline-snippet4720.8%920.3%1460.4%1930.4%
create-documentation6611.2%10.0%710.2%20.0%
undo-edit40.0%6622.1%30.0%390.1%
git-status-log6521.2%40.0%130.0%230.1%
git-diff5381.0%00.0%90.0%230.1%
create-verify-script3210.6%110.0%940.2%470.1%
read-via-inline-script760.1%30.0%210.1%3730.9%
read-config-file260.0%480.2%720.2%2060.5%
inspect-file-metadata2460.4%10.0%140.0%220.1%
run-test-suite(failed)60.0%940.3%30.0%1550.4%
syntax-check1830.3%00.0%570.1%180.0%
echo1400.2%90.0%330.1%690.2%
edit-via-inline-script50.0%00.0%10.0%2450.6%
install-deps200.0%1290.4%370.1%00.0%
start-service260.0%710.2%420.1%40.0%
apply-patch00.0%00.0%00.0%940.2%
create-file-via-inline-script210.0%00.0%00.0%410.1%
git-stash280.0%00.0%20.0%00.0%
check-tool-exists160.0%00.0%100.0%20.0%
check-version60.0%00.0%00.0%20.0%

Every trajectory step is classified into one of ~50 base intents using deterministic rules (regex matching on the action string, file names, and observation text). No LLM is used for classification.

_n: total count of that intent across all trajectories for the model.

_%: that count as a percentage of all steps for the model. Percentages sum to 100% within each model.

Sorted by total count across all models (most frequent first).

Method: each step's action string is pattern-matched against a priority-ordered ruleset in classify_intent.py. For example, an action starting with str_replace_editor view is classified as a read intent, while grep or rg becomes search-keyword.

3. High-Level Category Frequencies

Read and search dominate every model; the gap is in verify and edit proportions.

Proportion of steps by category
Sonnet 4.5
read 22%search 25%edit 10%verify 30%44
Gemini 2.5 Pro
read 19%search 8%reproduce 8%edit 41%verify 8%other 12%
GLM 4.5
read 26%search 17%4edit 20%verify 26%5
GPT-5
read 36%search 22%4edit 15%5failed 14%5
readsearchreproduceeditverifygithousekeepingfailedother
categoryclaude45_nclaude45_%claude45_per_trajgemini25pro_ngemini25pro_%gemini25pro_per_trajglm45_nglm45_%glm45_per_trajgpt5_ngpt5_%gpt5_per_traj
read1238821.9%17.0616719.4%8.41013225.9%13.91545035.6%21.2
search1428025.3%19.626388.3%3.6650916.6%8.9942321.7%12.9
reproduce10041.8%1.423987.6%3.316004.1%2.217234.0%2.4
edit585010.3%8.01297240.9%17.8786020.1%10.8649215.0%8.9
verify1687929.8%23.125358.0%3.51026326.2%14.122425.2%3.1
git12182.2%1.740.0%0.0240.1%0.0460.1%0.1
housekeeping22774.0%3.17892.5%1.15861.5%0.8250.1%0.0
failed1540.3%0.25571.8%0.81610.4%0.2587313.5%8.0
other24984.4%3.4365911.5%5.020535.2%2.821304.9%2.9

Each base intent maps to one of 9 high-level categories: read, search, reproduce, edit, verify, git, housekeeping, failed, other.

_n: total steps in that category.

_%: percentage of all steps.

_per_traj: average number of steps in that category per trajectory (total category steps / number of trajectories).

The mapping from base intent to category is defined in classify_intent.py (INTENT_TO_HIGH_LEVEL). For example, read-file-full, read-file-range, and read-via-bash all map to 'read'.

4. Phase Groupings

Claude spends 28% verifying; GPT-5 spends 3.6%. Gemini reads the most; Claude cleans up the most.

Proportion of steps by phase
Sonnet 4.5
understand 47%edit 10%verify 30%cleanup 6%
Gemini 2.5 Pro
understand 28%reproduce 8%edit 41%verify 8%
GLM 4.5
understand 42%4edit 20%verify 26%
GPT-5
understand 57%4edit 15%5
understandreproduceeditverifycleanup
phaseclaude45_%gemini25pro_%glm45_%gpt5_%
understand47.2%27.8%42.5%57.3%
reproduce1.8%7.6%4.1%4.0%
edit10.3%40.9%20.1%15.0%
verify29.8%8.0%26.2%5.2%
cleanup6.2%2.5%1.6%0.2%

The 9 high-level categories are further grouped into 5 phases that represent the broad arc of a trajectory:

understand = read + search. The agent is reading code and searching for information.

reproduce = reproduce. The agent is writing or running reproduction scripts to confirm the bug.

edit = edit. The agent is making source code changes.

verify = verify. The agent is running tests, compiling, or checking its work.

cleanup = git + housekeeping. The agent is reviewing changes (git diff/log) or cleaning up (rm, mv, writing docs).

These phases are used in the stacked area charts (Typical Trajectory Shape) to show how the mix of actions evolves from start to end.

5. Verify Sub-Intent Breakdown

How models verify differs in kind, not just amount. Claude and GLM lean on broad test suites; Gemini and GPT use more targeted runs and custom scripts.

test-suite verify-script c-test-script edit-test compile-build inline-verify test-specific custom-script c-verify-script syntax-check
Sonnet 4.5
16,879
Gemini 2.5 Pro
2,535
GLM 4.5
10,263
GPT-5
2,242
intentclaude45_ngemini25pro_nglm45_ngpt5_n
run-test-suite59427891064585
run-verify-script34202083011113
create-test-script2633121191318
edit-test-or-repro71211402337243
compile-build108815075441
run-inline-verify9991206696
run-test-specific11050168370
run-custom-script476115659111
create-verify-script321119447
syntax-check18305718

The 'verify' category contains ~10 sub-intents. This table shows where each model's verification volume comes from.

run-test-suite: broad test runs (pytest, go test, npm test, mocha) without targeting specific tests.

run-test-specific: targeted test runs using pytest -k or :: to run specific test functions.

run-verify-script: running a script named verify*, check*, validate*, or edge_case*.

create-test-script: creating a new test file (test_*, *test.py, etc.).

run-inline-verify: an inline python -c / node -e snippet that imports project code or runs assertions.

compile-build: go build, go vet, make, npx tsc. Compilation as a verification step.

edit-test-or-repro: editing an existing test or repro file (str_replace on test_* or repro* files).

run-custom-script: running a named script that doesn't match repro/test/verify naming patterns.

create-verify-script: creating a new file named verify*, check*, validate*.

syntax-check: py_compile, compileall, node -c. Quick syntax validation.

7. Verify Outcomes

About half of all verify steps yield a pass. Claude and GLM have nearly identical pass rates (~51%), despite Claude running 3x more verify steps.

passfailunknown
Sonnet 4.5
pass rate 80.0%
Gemini 2.5 Pro
pass rate 15.0%
GLM 4.5
pass rate 58.2%
GPT-5
pass rate 43.2%
modelpassfailunknowntotalpass_rate
claude4552761323499495654880.0%
gemini25pro130736308533171915.0%
glm451068767373533918858.2%
gpt5338445426214340443.2%

Only steps classified as one of these intents are evaluated for outcome: run-test-suite, run-test-specific, run-verify-script, run-custom-script, run-inline-verify, compile-build, syntax-check. All other steps get outcome ''.

pass: the observation's last 2000 characters match a framework-specific all-pass pattern. For pytest: the summary line (e.g. '200 passed in 12.3s') must contain 'passed' and must NOT contain 'failed' or 'error'. If even one test fails ('195 passed, 5 failed'), the outcome is 'fail', not 'pass'. For Go: all PASS/FAIL lines in output are checked; any FAIL makes it 'fail'. For Mocha: checks 'N passing' and 'N failing' counts. For Jest: checks summary line for 'failed' vs 'passed'. For compile-build: absence of error patterns in short output (< 200 chars) from go build/make = 'pass'. For syntax-check: py_compile with no output = 'pass'; any Error/SyntaxError = 'fail'.

fail: the observation matches a failure pattern. In priority order: framework-specific failure summaries (pytest 'failed', Go 'FAIL', Mocha failing > 0), then generic patterns: 'no tests ran', collection errors, tracebacks in the last 500 chars, Node.js throw/error, non-zero exit code.

unknown (''): no pattern matched. This happens when output is from an unrecognized framework, is truncated, is ambiguous (e.g. pytest ran but the summary line was cut off), or when the observation is empty.

pass_rate: pass / (pass + fail), excluding unknowns. This measures: of the verify steps where we could determine the outcome, what fraction had all tests passing?

Important caveat: 'pass' means 'all tests in that run passed', not 'the agent's fix is correct'. SWE-Bench Pro tasks come with existing test suites where most tests already pass on unmodified code. An agent running pytest before making any edits will often get 'pass' because the existing tests pass. This is why first_verify_pass can occur before first_edit (see Section 10).

8. Sequence Labels

Claude averages 6 edit-then-verify cycles per trajectory. GPT averages less than 1. The edit-verify loop is the defining structural difference.

Average edit-then-verify cycles per trajectory
Sonnet 4.5
2.8 avg (2,011 total)
Gemini 2.5 Pro
1.2 avg (892 total)
GLM 4.5
3.7 avg (2,683 total)
GPT-5
1.1 avg (826 total)
Key sequence counts per trajectory (avg)
sequence
Sonnet
Gemini
GLM
GPT-5
edit-then-verify
2.8
1.2
3.7
1.1
fix after failure
0.0
0.1
0.0
0.1
rerun without edit
13.3
0.3
3.4
0.9
submit after verify
0.9
0.9
0.8
0.5
labelclaude45_ngemini25pro_nglm45_ngpt5_n
seq-verify-rerun-no-edit96882152516626
seq-reread-edited-file2751244225662233
seq-verify-after-edit20118922683826
seq-repro-after-edit4481205789923
seq-submit-after-verify656667586379
seq-work-done5923729882
seq-repro-rerun-same-command21539140273
seq-verify-rerun-same-command29184176258
seq-repro-rerun-no-edit10995106292
seq-diagnose-read-after-failed-verify164426192
seq-edit-after-failed-verify4893296
seq-diagnose-search-after-failed-verify403136

Sequence labels classify steps by their context: what happened before, whether edits or verify steps preceded them.

seq-verify-after-edit: a verify step after a source edit. The core edit-then-test loop.

seq-verify-rerun-no-edit: a verify step where no edit happened since the last verify.

seq-edit-after-failed-verify: a source edit after a failed verify step. Fixing what a test revealed.

seq-submit-after-verify: submit after at least one verify step. The agent tested before submitting.

seq-first-all-pass: the first verify-pass after the last source edit. Marks implementation completion.

Method: classify_sequence_layer() in classify_intent.py walks the trajectory maintaining state (has a verify been seen? was there an edit since?).

8b. Failure Modes

GPT-5 records a 19.9% failure-step rate. 91.8% of its failures are tool-call friction, and 63.1% come from one shell-wrapper/apply_patch cluster (applypatch hallucination, trailing }, heredoc breakage, generic bash syntax, and the broken pipes they trigger).

tool failurescode failurestest failures
Sonnet 4.5
4.8% of steps flagged as failures · 42.3% tool friction
Gemini 2.5 Pro
26.2% of steps flagged as failures · 79.9% tool friction
GLM 4.5
7.7% of steps flagged as failures · 65.7% tool friction
GPT-5
19.9% of steps flagged as failures · 91.8% tool friction
modefamilySonnet 4.5Gemini 2.5 ProGLM 4.5GPT-5GPT shareGPT trajs
broken pipe
bash_broken_pipe
tool4292101174420.2%348
view_range out of bounds
strep_invalid_range
tool21326316114813.3%416
trailing `}` leak
bash_trailing_brace
tool000112513.0%307
apply_patch missing
apply_patch_cmd_not_found
tool00096811.2%291
str_replace no match
strep_no_match
tool3444714108294711.0%287
bash syntax error
bash_syntax_error
tool87956786610.0%225
heredoc unterminated
bash_heredoc_unterminated
tool0105926.8%178
command not found
bash_command_not_found
tool48261382222.6%138
Python traceback
py_traceback_other
code3875744142192.5%102
Node error
node_error
code226761481842.1%71
str_replace path missing
strep_file_not_found
tool24374011251.4%107
apply_patch shell syntax
apply_patch_shell_syntax
tool0001071.2%58
SyntaxError
py_syntax_error
code4521944951.1%56
test suite failed
test_failed
test806653294911.1%57
ModuleNotFoundError
py_module_not_found
code99131124841.0%51
bash quote nesting
bash_quote_nesting
tool000510.6%20
IndentationError
py_indentation_error
code1206320.4%19
create over existing file
strep_create_exists
tool224218200.2%20
bash `!` history expansion
bash_history_expansion
tool1975190.2%13
str_replace not unique
strep_multiple_matches
tool32240.0%3
apply_patch other
apply_patch_other
tool00040.0%2
test collection error
test_collection_error
test00000.0%0

Illustrative GPT-5 failures

broken pipe (1744 GPT-5 steps)
`grep: write error: Broken pipe` — usually downstream of one of the above (the rest of the pipeline already crashed).
instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan
action
bash -lc "grep -R --line-number \"\\.\\.\\/posts\\/cache\" /app/src | sed -n '1,200p'"}
observation
sed: -e expression #1, char 7: unexpected `}'
grep: write error: Broken pipe
trailing `}` leak (1125 GPT-5 steps)
Action ends with a stray `}` (from JSON tool-call serialization), so the shell sees `… || true}` and reports `true}: command not found`. Almost exclusive to GPT-5's `bash -lc "…"}` wrapping pattern.
instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan
action
bash -lc "rg -n --no-heading "src/posts/cache\.js|require\(['\"]\.\.\/posts\/cache['\"]\)|require\(['\"]\.\/cache['\"]\)" /app | sed -n '1,200p'"}
observation
bash: require([\"]\.\.\/posts\/cache["]): No such file or directory
bash: require([\"]\.\/cache["]) /app | sed -n '1,200p'}: No such file or directory
bash: line 1: rg: command not found
apply_patch missing (968 GPT-5 steps)
Agent invoked `applypatch` but the binary doesn't exist in the sandbox. GPT-5 hallucinates this from its OpenAI/Codex training where apply_patch is the canonical edit tool.
instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan
action
bash -lc "applypatch << 'PATCH'\n*** Begin Patch\n*** Update File: /app/src/meta/index.js\n@@\n-Meta.userOrGroupExists = Meta.slugTaken; // backwards compatiblity\n+// Alias with same single/array behaviour\n+Meta.userOrGroupExists = Meta.slugTaken; // backwards compatibility\n*** End Patch\nPATCH"}
observation
bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `PATCHn***')
bash: line 1: applypatch: command not found
bash: line 1: //: Is a directory
bash: line 1: //: Is a directory
heredoc unterminated (592 GPT-5 steps)
`warning: here-document at line N delimited by end-of-file` — the heredoc end-marker was emitted on the same line, escaped, or never sent.
instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan
action
bash -lc "python - << 'PY'\nimport os, re\nfor root, dirs, files in os.walk('/app/src'):\n    for f in files:\n        p=os.path.join(root,f)\n        try:\n            s=open(p,'rb').read()\n        except Exception as e:\n            continue\n        if b'slugTaken' in s or b'userOrGroupExists' in s or b'existsBySl…
observation
bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `PYnimport')
bash: -c: line 1: syntax error near unexpected token `('
bash: -c: line 1: `python - << 'PY'\nimport os, re\nfor root, dirs, files in os.walk('/app/src'):\n    for f in files:\n        p=os.path.join(root,f)\n        try:\n            s=open(p,'rb').read()\n        except Exception as e:\n            continue\n        if b's…
view_range out of bounds (1148 GPT-5 steps)
`Invalid view_range` — agent asked to view lines past EOF or with a reversed range.
instance_NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5-vnan
action
str_replace_editor view /app/src/user/admin.js  --view_range 1 200
observation
Invalid `view_range`: [1, 200]. Its second element `200` should be smaller than the number of lines in the file: `92`

This section uses data/failure_modes.json, produced by scripts/build_failure_modes.py. Each trajectory step is classified as either not-a-failure or one failure mode.

Counts are step counts, not unique incidents. A single trajectory can contribute many failure steps, and one underlying shell mistake can fan out into multiple observed failures.

Families: tool means the harness/tool call itself failed; code means the agent ran code that crashed; test means a test runner reported failures.

Interpretation caveat: bash_broken_pipe is often a secondary symptom. In GPT-5, many of those steps are downstream of the same wrapper pathologies that also produce trailing-brace, heredoc, or quoting failures.

The distinctive GPT-5 signature is not ordinary test failure. It is repeated interaction friction around shell wrapping and hallucinated applypatch usage.

9. Work-Done vs Resolved

wd + resolvedwd + unresolvedno wd + resolvedno wd + unresolved
Sonnet 4.5
276
316
43
95
37.8% wd+resolved
Gemini 2.5 Pro
130
563
1.6% wd+resolved
GLM 4.5
112
186
147
285
15.3% wd+resolved
GPT-5
47
218
430
6.4% wd+resolved
modelwd+resolvedwd+unresolvedno_wd+resolvedno_wd+unresolvedtotal
claude452763164395730
gemini25pro1225130563730
glm45112186147285730
gpt54735218430730

A confusion matrix crossing two signals: whether the agent reached 'work-done' and whether the benchmark evaluated the patch as correct.

work-done: the trajectory contains a seq-first-all-pass label. Specifically: we find the last step classified as a source edit (edit-source, insert-source, apply-patch, edit-via-inline-script). Then we check if any verify step after that point has a 'pass' outcome (all tests passed, per the rules in Section 7). If yes, the trajectory is 'work-done'. This is a stronger signal than first_verify_pass (Section 10), which can fire before any edits because existing tests pass on unmodified code.

resolved: the submitted patch actually fixes the failing tests, as judged by the SWE-Bench Pro benchmark evaluation (from agent_runs_data.csv). This is different from 'submitted', which only means the agent produced a patch.

wd+resolved: the agent's tests passed after its last edit, and the benchmark confirmed the patch is correct. The best case.

wd+unresolved: tests passed but the patch was wrong. The agent's own verification was a false positive.

no_wd+resolved: the agent never reached a clean test pass after its final edit, yet the benchmark accepted the patch. The agent submitted without confirmation that its code works.

no_wd+unresolved: the agent neither achieved passing tests nor produced a correct patch.

The stacked bars show how each model's 730 trajectories split across these four outcomes. The no_wd+resolved column is notably large across all models, meaning the agent's own test-passing signal is not a reliable predictor of benchmark resolution.

Method: 'work-done' = find the last step in SOURCE_EDIT_INTENTS (edit-source, insert-source, apply-patch, edit-via-inline-script), then scan forward for any step where classify_verify_outcome() returns 'pass'. If found, work-done is true. 'resolved' comes from the benchmark CSV (agent_runs_data.csv, field metadata.resolved), which records whether the submitted patch actually made the failing tests pass when evaluated by the benchmark harness.

Caveat: work-done can be a false positive. The agent's tests might pass because the test suite doesn't cover the specific failure the task requires fixing. The agent thinks it's done (tests pass), but the benchmark's evaluation finds the bug isn't actually fixed.

10. Structural Markers (% of trajectory)

first edit last edit first verify first pass submit
0%25%50%75%100%
Sonnet 4.5
Gemini 2.5 Pro
GLM 4.5
GPT-5
markerclaude45_medclaude45_p25claude45_p75claude45_ngemini25pro_medgemini25pro_p25gemini25pro_p75gemini25pro_nglm45_medglm45_p25glm45_p75glm45_ngpt5_medgpt5_p25gpt5_p75gpt5_n
first_edit34.627.842.670118.59.437.068433.926.145.370949.534.263.8642
last_edit61.947.978.170190.376.596.468480.064.989.470989.478.495.4642
first_verify23.117.330.971842.918.366.717544.726.865.067259.133.378.4391
first_verify_pass28.819.747.763971.634.089.75870.453.385.933171.840.785.2130
submit100.0100.0100.0644100.0100.0100.0558100.0100.0100.0595100.0100.0100.0443

Alternative view: replace first pass with last verify

This version drops first pass, which can happen on baseline tests before any edits, and instead shows the last verify step so the post-edit verification tail is easier to see.

first edit last edit first verify last verify submit
0%25%50%75%100%
Sonnet 4.5
Gemini 2.5 Pro
GLM 4.5
GPT-5
markerclaude45_medclaude45_p25claude45_p75claude45_ngemini25pro_medgemini25pro_p25gemini25pro_p75gemini25pro_nglm45_medglm45_p25glm45_p75glm45_ngpt5_medgpt5_p25gpt5_p75gpt5_n
first_edit34.627.842.670118.59.437.068433.926.145.370949.534.263.8642
last_edit61.947.978.170190.376.596.468480.064.989.470989.478.495.4642
first_verify23.117.330.971842.918.366.717544.726.865.067259.133.378.4391
last_verify96.193.797.471887.964.394.017596.894.098.167293.581.096.7391
submit100.0100.0100.0644100.0100.0100.0558100.0100.0100.0595100.0100.0100.0443

Key events in each trajectory, expressed as a percentage of the way through (0% = first step, 100% = last step). Aggregated across all trajectories per model.

The timeline shows median positions as shaped markers, with faint bands for the interquartile range (p25-p75). Hover over markers for exact values.

first_edit: the first step whose base intent is one of: edit-source (str_replace on a source file), insert-source (str_replace_editor insert), apply-patch, or edit-via-inline-script. Does not include create-file or edit-test-or-repro, which are classified differently.

last_edit: the last step matching those same intents. The gap between last_edit and submit is the 'tail' where the agent is verifying, cleaning up, or submitting but no longer changing source code.

first_verify: the first step whose intent is in SEQUENCE_VERIFY_INTENTS: run-test-suite, run-test-specific, run-verify-script, run-custom-script, compile-build, syntax-check, run-inline-verify.

first_verify_pass: the first step where classify_verify_outcome() returns 'pass' (see Section 7 for what 'pass' means). This does NOT mean 'the agent's fix worked'. It means 'the first time a test/build command produced output where all tests passed'. Because SWE-Bench Pro tasks have existing test suites that mostly pass on unmodified code, an agent that runs pytest before making any edits will often get a 'pass' here. This is why first_verify_pass can appear before first_edit for Claude (median 28.8% vs 34.6%): Claude runs the existing test suite early as a diagnostic baseline. For a marker that means 'the fix works', see work_done in Section 9, which requires a verify pass after the last source edit.

last_verify: the last step whose intent is in SEQUENCE_VERIFY_INTENTS. The alternate view replaces first pass with last verify to show where verification actually finishes, which makes the late verification tail clearer for Claude.

submit: the first step with intent 'submit'.

_med / _p25 / _p75: median, 25th percentile, and 75th percentile across trajectories where the event occurred.

_n: number of trajectories where this event occurred. Claude has 639 for first_verify_pass (out of 730) meaning 91 trajectories never had a fully-passing test run. GPT has only 130, meaning most GPT trajectories either never ran tests or never achieved a clean pass.

Method: for each trajectory, scan for the first (or last) step matching the relevant intent set, compute step_index / (total_steps - 1) * 100 to get a percentage position, then take the median across all trajectories where the event occurred.

11. Phase Profile Heatmap

Each trajectory divided into 20 time-slices. Cell intensity shows what proportion of steps in each slice belong to each category, normalized per column.

Sonnet 4.5
0%
25%
50%
75%
read
14%
54%
40%
32%
26%
22%
22%
22%
22%
20%
21%
18%
18%
17%
17%
16%
16%
17%
23%
23%
search
86%
45%
54%
51%
43%
35%
29%
21%
18%
19%
19%
19%
19%
15%
12%
13%
11%
9%
9%
8%
reproduce
0%
0%
1%
2%
3%
3%
3%
2%
2%
2%
1%
1%
2%
2%
2%
2%
2%
2%
3%
5%
edit
0%
0%
0%
1%
5%
13%
19%
26%
25%
22%
18%
14%
13%
10%
9%
8%
7%
6%
6%
3%
verify
0%
1%
6%
13%
21%
24%
25%
26%
30%
34%
38%
42%
44%
49%
52%
51%
49%
45%
40%
40%
git
0%
0%
0%
1%
1%
2%
2%
3%
3%
3%
2%
3%
2%
2%
2%
2%
3%
4%
5%
3%
housekeeping
0%
0%
0%
0%
1%
1%
1%
1%
1%
1%
1%
2%
3%
5%
6%
9%
12%
17%
14%
17%
Gemini 2.5 Pro
0%
25%
50%
75%
read
19%
37%
36%
34%
29%
30%
29%
23%
24%
26%
23%
23%
22%
20%
20%
20%
19%
19%
17%
12%
search
73%
42%
24%
16%
10%
8%
7%
7%
8%
5%
6%
5%
5%
5%
4%
6%
4%
5%
3%
3%
reproduce
1%
4%
9%
10%
11%
11%
9%
12%
10%
9%
7%
11%
9%
9%
9%
8%
11%
10%
11%
9%
edit
6%
16%
27%
34%
42%
43%
46%
49%
48%
50%
52%
51%
52%
54%
55%
50%
50%
46%
45%
47%
verify
1%
1%
3%
5%
6%
7%
7%
7%
8%
8%
10%
9%
9%
11%
10%
12%
13%
14%
12%
9%
git
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
housekeeping
0%
1%
1%
1%
1%
1%
2%
2%
2%
1%
2%
2%
2%
2%
2%
3%
3%
5%
13%
21%
GLM 4.5
0%
25%
50%
75%
read
12%
54%
61%
53%
43%
34%
31%
28%
26%
23%
26%
21%
19%
19%
16%
15%
15%
12%
11%
8%
search
88%
45%
32%
29%
25%
19%
16%
13%
11%
11%
9%
9%
9%
8%
8%
8%
7%
7%
5%
3%
reproduce
0%
0%
3%
5%
9%
9%
7%
8%
7%
6%
4%
4%
4%
4%
3%
3%
3%
3%
4%
5%
edit
0%
0%
1%
4%
9%
21%
28%
33%
35%
37%
34%
32%
30%
26%
25%
22%
18%
17%
14%
8%
verify
0%
1%
3%
8%
13%
17%
17%
17%
20%
22%
27%
32%
36%
42%
46%
51%
54%
58%
63%
61%
git
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
housekeeping
0%
0%
0%
0%
1%
0%
1%
1%
1%
1%
1%
1%
2%
1%
1%
2%
2%
3%
4%
15%
GPT-5
0%
25%
50%
75%
read
22%
60%
64%
62%
58%
52%
52%
50%
46%
43%
40%
38%
38%
37%
33%
33%
33%
29%
30%
29%
search
78%
40%
33%
34%
35%
36%
32%
30%
31%
30%
28%
26%
22%
21%
23%
20%
18%
19%
19%
19%
reproduce
0%
0%
1%
1%
1%
2%
4%
3%
3%
4%
5%
6%
8%
9%
7%
11%
12%
13%
14%
13%
edit
0%
0%
1%
2%
5%
8%
10%
14%
17%
19%
22%
25%
27%
25%
27%
25%
25%
22%
20%
20%
verify
0%
0%
1%
1%
2%
2%
2%
3%
3%
3%
5%
5%
6%
8%
9%
10%
12%
16%
18%
18%
git
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
housekeeping
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%

Read left-to-right as beginning to end of trajectory. Brighter cells indicate the dominant action in that time-slice.

Categories: read, search, reproduce, edit, verify, git, housekeeping. Failed and other are excluded.

Normalized per column: within each time-slice, the percentages show each category's share relative to only the displayed categories.

Method: each trajectory's step sequence is divided into 20 equal bins. Per bin, we count the fraction of steps belonging to each category, then average across all trajectories for that model.

12. Per-Repo Breakdown

Resolve rate by repository (each dot = one model, repos with <3 tasks omitted)
Sonnet 4.5Gemini 2.5 ProGLM 4.5GPT-5
ansible (384)
openlibrary (364)
flipt (340)
qutebrowser (316)
teleport (304)
webclients (260)
vuls (248)
navidrome (228)
element-web (220)
NodeBB (176)
tutanota (80)
0%20%40%60%80%100%
repoSonnet 4.5_nSonnet 4.5_avgSonnet 4.5_res%Sonnet 4.5_ver%Gemini 2.5 Pro_nGemini 2.5 Pro_avgGemini 2.5 Pro_res%Gemini 2.5 Pro_ver%GLM 4.5_nGLM 4.5_avgGLM 4.5_res%GLM 4.5_ver%GPT-5_nGPT-5_avgGPT-5_res%GPT-5_ver%
ansible9682.555.233.79644.126.06.89652.439.626.79665.549.08.8
openlibrary9174.154.938.09127.334.111.89146.446.231.79146.842.912.2
flipt8573.932.927.68565.610.67.28558.822.426.68562.217.61.7
qutebrowser7984.165.842.07939.429.113.67949.554.436.77964.464.610.0
teleport7674.230.322.57654.16.69.57664.525.022.17661.817.11.9
webclients6582.241.522.96534.321.55.86553.336.917.56543.921.52.9
vuls6270.143.525.56239.412.93.76252.835.524.66254.032.31.9
navidrome5774.238.628.65748.817.58.85760.831.626.55761.235.12.3
element-web5575.432.727.85529.418.25.45547.621.821.95560.636.42.4
NodeBB4477.425.023.64447.411.47.74450.031.828.44472.638.63.7
tutanota2092.140.019.32052.510.02.6205540.019.12078.545.03.3
Cross-model spread in resolve rate, largest first
repomin res%max res%spread (pp)best model
qutebrowser29.165.836.7Sonnet 4.5
tutanota10.045.035.0GPT-5
vuls12.943.530.6Sonnet 4.5
ansible26.055.229.2Sonnet 4.5
NodeBB11.438.627.2GPT-5
teleport6.630.323.7Sonnet 4.5
flipt10.632.922.3Sonnet 4.5
navidrome17.538.621.1Sonnet 4.5
openlibrary34.154.920.8Sonnet 4.5
webclients21.541.520.0Sonnet 4.5
element-web18.236.418.2GPT-5

Metrics broken down by source repository. SWE-Bench Pro tasks come from ~11 open-source repos. The dot plot shows whether one model dominates uniformly or whether there is repo-specific variation.

Dot plot: each dot is one model's resolve rate on that repo. When dots cluster, models perform similarly on that repo; when they spread apart, the repo differentiates models. Repos with fewer than 3 tasks per model are omitted from the plot to avoid noisy rates.

Spread table: the gap (in percentage points) between the best and worst model on each repo. Large spreads indicate repos where model choice matters most.

_n: number of task instances from this repo.

_avg: average steps per trajectory.

_res%: resolve rate (percentage of trajectories where the submitted patch fixes the failing tests).

_ver%: percentage of steps spent on verify actions.

Sorted by total number of instances across all models (most common repos first). Top 12 shown.