Reference Tables

SWE-Bench Pro trajectory analysis. 4 models, 2920 trajectories.

0. Resolution and Submission

Of 730 tasks, the highest resolve rate is 44% (Sonnet 4.5). Most submitted patches do not resolve the issue.

Sonnet 4.5

319/730

Gemini 2.5 Pro

142/730

GLM 4.5

259/730

GPT-5

265/730

resolved submitted, not resolved not submitted

model	n	submitted (clean)	submitted (w/ error)	not submitted	resolved	resolve rate
claude45	730	643	75	12	319	43.7%
gemini25pro	730	538	152	40	142	19.5%
glm45	730	595	118	17	259	35.5%
gpt5	730	438	185	107	265	36.3%

Each SWE-Bench Pro trajectory ends with an exit status. The agent either submits a patch or doesn't.

submitted (clean): exit_status is exactly submitted.

submitted (w/ error): the agent produced a submission, but the harness also recorded an error condition (timeout, context overflow, cost limit, format error).

not submitted: the agent never ran the submit command. It hit an error before submitting.

resolved: the submitted patch actually fixes the failing tests (from benchmark evaluation).

Method: we read info.exit_status and info.submission from each .traj file.

0b. Exit Status Breakdown

exit_status	claude45	gemini25pro	glm45	gpt5
submitted	643	538	595	438
submitted (exit_command_timeout)	17	60	3	78
submitted (exit_error)	37	18	33	20
submitted (exit_format)	2	39	53	5
submitted (exit_context)	2	0	29	63
exit_error	9	22	6	53
submitted (exit_total_execution_time)	3	15	0	18
submitted (exit_cost)	14	20	0	1
exit_command_timeout	1	9	0	24
exit_format	1	5	10	7
exit_total_execution_time	1	0	0	13
exit_context	0	0	0	9
exit_cost	0	3	0	0
	0	1	0	1
exit_command	0	0	1	0

Every distinct exit_status string from the .traj files, with counts per model.

submitted: clean exit after submitting.

submitted (exit_*): submitted, but also hit an error condition (timeout, context, cost, format, etc.).

exit_* (without submitted prefix): the agent hit that condition and never submitted.

These statuses are set by the SWE-Agent harness, not by the model itself.

1. Trajectory Metadata

Median trajectory length varies from 29 to 78 steps (Gemini 2.5 Pro vs. Sonnet 4.5). The longest trajectories take ~2.7x more steps per task.

Sonnet 4.5

63–89

Gemini 2.5 Pro

17–49

GLM 4.5

40–65

GPT-5

34–76

median p25–p75

model	n	total_steps	avg	median	p25	p75	min	max	resolved	resolve_rate
claude45	730	56548	77.5	78.0	63	89	2	215	319	43.7%
gemini25pro	730	31719	43.5	29.0	17	49	1	251	142	19.5%
glm45	730	39188	53.7	52.0	40	65	1	218	259	35.5%
gpt5	730	43404	59.5	53.0	34	76	2	251	265	36.3%

Summary statistics for trajectory length (number of action-observation steps per task).

n: trajectories (one per task instance).

total_steps: sum of all steps across all trajectories for this model.

avg / median / p25 / p75 / min / max: distribution of steps per trajectory.

resolved: trajectories where the submitted patch fixes the failing tests (from agent_runs_data.csv).

resolve_rate: resolved / n.

1b. Intent Classification Taxonomy

Every step is labelled by a deterministic, priority-ordered ruleset in scripts/classify_intent.py — no model inference. Each row shows the intent, what it means, the literal rule that fires it, and how often it fires per model.

Classification order (first rule that matches wins):

Empty action → empty. submit prefix → submit.
str_replace_editor {view, create, str_replace, insert, undo_edit} → classified by sub-command and filename pattern (test/config/repro/verify/doc).
Bash is unwrapped: strip bash -lc "...", leading cd ... &&, source ... &&, timeout N, and FOO=bar env prefixes.
If the observation shows a shell-level error (syntax error, command not found, unexpected token, …), the command head routes to the matching (failed) label.
Otherwise match the command head: test runners, compile/syntax, search, read, list, git, python/node scripts (named vs. inline, with inline further sub-classified by code shape), file cleanup, install, service, tool-exists, metadata, echo.
Anything that reached the end is bash-other (<2% of steps by design).

intent	description	classification rule	Sonnet 4.5	Gemini 2.5 Pro	GLM 4.5	GPT-5
read
read-file-full	view an entire source file via str_replace_editor	str_replace_editor view <file> (fallback once test, config, range, and truncated views are ruled out)	3.1k	3.9k	3.0k	5.0k
read-file-range	view a specific line range (--view_range)	str_replace_editor view with `--view_range`	6.0k	1.6k	6.4k	6.0k
read-file-full(truncated)	view a file that was too large, got abbreviated	str_replace_editor view where the observation contains `too large to display`	198	361	229	245
read-test-file	view a test file (test_, _test., conftest)	str_replace_editor view on a filename matching `test_`, `_test.`, or `conftest`	644	56	259	635
read-config-file	view package.json, pytest.ini, setup.cfg, go.mod, Makefile, etc.	str_replace_editor view on `package.json`, `pytest.ini`, `setup.cfg`, `setup.py`, `go.mod`, `Makefile`, `config.json`	26	48	72	206
read-via-bash	cat, head, tail, sed -n, nl, awk	`cat`, `head`, `tail`, `sed -n`, `nl`, `awk`	2.3k	140	133	3.0k
read-via-inline-script	inline snippet that reads a file and prints content	inline snippet that reads a file (`.read()`, `open(...,'r')`, `readFileSync`) and prints, without writing	76	3	21	373
search
view-directory	view a directory listing via str_replace_editor	str_replace_editor view where path has no extension, or observation lists “files and directories”	1.1k	12	2.5k	2.1k
list-directory	ls, tree, pwd	`ls`, `tree`, `pwd`	843	1.4k	86	708
search-keyword	grep, rg, ag for a pattern	`grep`, `rg`, `ag`	7.0k	905	2.8k	6.5k
search-files-by-name	find ... -name (locating files by name/path)	`find ... -name` with no grep/xargs pipe	1.8k	310	266	49
search-files-by-content	find ... -exec grep / find \| xargs grep	`find ... -exec grep` or `find ... \| xargs grep`	3.3k	32	851	10
inspect-file-metadata	wc, file, stat	`wc`, `file`, `stat`	246	1	14	22
check-version	inline snippet that checks python/node version	inline snippet matching `--version`, `-V`, `sys.version`, or `node -v`	6	0	0	2
reproduce
create-repro-script	create a file named repro, reproduce, demo*	str_replace_editor create on a filename containing `repro`, `reproduce`, or `demo`	157	551	372	463
run-repro-script	run a file named repro, reproduce, demo*	run a named script whose basename matches `repro` or `reproduce` (python, node, sh, bash, go run)	375	1.8k	1.1k	1.1k
run-inline-snippet	python -c, python - <<, python3 -c, node -e	`python -c`, `python - <<`, `node -e` — residual when no inline sub-pattern matches	472	92	146	193
edit
edit-source	str_replace on a non-test, non-repro source file	str_replace_editor str_replace on a filename not matching test/repro/verify/check	5.2k	10.3k	7.3k	5.0k
insert-source	str_replace_editor insert on a source file	str_replace_editor insert	12	2.0k	16	803
apply-patch	applypatch command (GPT-specific alternative to str_replace)	`applypatch` command (GPT-specific)	0	0	0	94
create-file	create a file that doesn't match repro/test/verify/doc patterns	str_replace_editor create on a filename not matching repro/test/verify/doc patterns	595	686	575	326
edit-via-inline-script	inline snippet that reads, modifies, and writes a file	inline snippet that writes (`.write()`, `writeFileSync`) together with reading or `.replace()`/`re.sub()`	5	0	1	245
create-file-via-inline-script	inline snippet that writes a file without reading first	inline snippet that writes a file with no prior read	21	0	0	41
verify
run-test-suite	pytest, go test, npm test, npx jest, mocha (broad)	`pytest`, `go test`, `npm test`, `npx jest`, `mocha`, `yarn test`, `python -m unittest` (broad; no `::` or `-k`)	5.9k	789	1.1k	585
run-test-specific	pytest with -k or :: (targeting specific tests)	a test runner command containing `::` or `-k`	1.1k	0	168	370
create-test-script	create a file named test_, test.py, test.js, test.go	str_replace_editor create on a filename matching `test_`, `test.py`, `test.js`, `test.go`	2.6k	121	1.9k	18
run-verify-script	run a file named test_, verify, check, validate, edge_case*	run a named script whose basename contains `test_`, `verify`, `check`, `validate`, or `edge_case`	3.4k	208	3.0k	113
create-verify-script	create a file named verify, check, validate*	str_replace_editor create on a filename matching `verify`, `check`, or `validate*`	321	11	94	47
edit-test-or-repro	str_replace on a test or repro file	str_replace_editor str_replace on a filename containing `test_`, `repro`, `verify`, or `check`	712	1.1k	2.3k	243
run-custom-script	run a named script that doesn't match repro/test/verify patterns	run a named python/node/sh/bash/go script whose basename doesn’t match repro/test/verify patterns	476	115	659	111
syntax-check	py_compile, compileall, node -c	`py_compile`, `compileall`, `node -c`	183	0	57	18
compile-build	go build, go vet, make, npx tsc, tsc	`go build`, `go vet`, `make`, `tsc`, `npx tsc`, `npm run build`, `yarn build`	1.1k	150	754	41
run-inline-verify	inline snippet that imports project code or runs assertions	inline snippet with `import/from` + `assert`/`print` (smoke test or assertion)	999	1	206	696
git
git-diff	git diff	`git diff` (with or without `-C <dir>`)	538	0	9	23
git-status-log	git status, git show, git log	`git status`, `git show`, `git log`	652	4	13	23
git-stash	git stash	`git stash`	28	0	2	0
housekeeping
file-cleanup	rm, mv, cp, chmod	`rm`, `mv`, `cp`, `chmod`	1.6k	588	426	17
create-documentation	create a file named summary, readme, changes, implementation	str_replace_editor create on a filename matching `summary`, `readme`, `changes`, `implementation`	661	1	71	2
start-service	redis-server, redis-cli, mongod, sleep	`redis-server`, `redis-cli`, `mongod`, `sleep`	26	71	42	4
install-deps	pip install, pip list, npm install, go get, apt	`pip install`, `pip list`, `npm install`, `go get`, `apt`	20	129	37	0
check-tool-exists	which, type	`which`, `type`	16	0	10	2
failed
search-keyword(failed)	grep/find that hit shell errors	`grep`/`find` whose observation contains a shell error	46	8	33	2.7k
read-via-bash(failed)	cat/head/sed that hit shell errors	`cat`/`head`/`sed`/`tail`/`ls` whose observation contains a shell error	23	36	2	994
run-script(failed)	python/node run that hit shell errors	`python`/`node` whose observation contains a shell error	47	155	90	759
run-test-suite(failed)	pytest/test that hit shell errors	test runner whose observation contains a shell error	6	94	3	155
bash-command(failed)	other bash that hit shell errors	any other bash command whose observation contains a shell error	32	264	33	1.2k
other
submit	submit the patch	action’s first line starts with `submit`	656	1.2k	595	537
empty	empty action string (rate limit, context window exit)	action string is blank (rate-limit or context-window exit)	770	906	964	854
echo	echo, printf	`echo`, `printf`	140	9	33	69
bash-other	unclassified bash command	final fallback — bash command that matched no other rule (<2% of steps by design)	928	843	458	631
undo-edit	str_replace_editor undo_edit	`str_replace_editor undo_edit`	4	662	3	39

The label describes what the command is, derived from the action string and filename alone — no positional context (before/after first edit) and no outcome signal is used. A failed grep is still a search attempt.

(failed) variants classify by intended action, not outcome. They require a shell-level error in the first 500 chars of the observation.

run-inline-snippet is a residual — inline snippets (python -c, python - <<, node -e) are first routed to run-inline-verify / read-via-inline-script / edit-via-inline-script / create-file-via-inline-script / check-version by inspecting the code shape.

Pass/fail outcome for verify intents (used by seq-first-all-pass / seq-work-done) is a separate detector that reads the observation for unambiguous runner summaries: e.g. pytest N passed in Xs / N failed, go ok package / FAIL package, jest Tests: N passed / N failed. Ambiguous output returns unknown.

Canonical source: scripts/classify_intent.py and docs/intent-classification-rules.md.

2. Base Intent Frequencies

The top 10 intents account for the bulk of all steps; the long tail of ~40 others fills in the edges.

Sonnet 4.5 Gemini 2.5 Pro GLM 4.5 GPT-5

edit-source

9.2%

32.4%

18.5%

11.5%

read-file-range

10.6%

5.1%

16.3%

13.8%

search-keyword

12.4%

2.9%

7.2%

15.0%

read-file-full

5.5%

12.4%

7.8%

11.6%

run-test-suite

10.5%

2.5%

2.7%

1.3%

run-verify-script

6.0%

0.7%

7.7%

0.3%

read-via-bash

4.1%

0.4%

0.3%

6.9%

view-directory

2.0%

0.0%

6.3%

4.9%

search-keyword(failed)

0.1%

0.0%

0.1%

6.3%

insert-source

0.0%

6.3%

0.0%

1.9%

intent	claude45_n	claude45_%	gemini25pro_n	gemini25pro_%	glm45_n	glm45_%	gpt5_n	gpt5_%
edit-source	5217	9.2%	10279	32.4%	7268	18.5%	4983	11.5%
read-file-range	5974	10.6%	1616	5.1%	6377	16.3%	5997	13.8%
search-keyword	7002	12.4%	905	2.9%	2804	7.2%	6499	15.0%
read-file-full	3125	5.5%	3943	12.4%	3041	7.8%	5020	11.6%
run-test-suite	5942	10.5%	789	2.5%	1064	2.7%	585	1.3%
run-verify-script	3420	6.0%	208	0.7%	3011	7.7%	113	0.3%
view-directory	1137	2.0%	12	0.0%	2488	6.3%	2133	4.9%
read-via-bash	2345	4.1%	140	0.4%	133	0.3%	2974	6.9%
create-test-script	2633	4.7%	121	0.4%	1913	4.9%	18	0.0%
edit-test-or-repro	712	1.3%	1140	3.6%	2337	6.0%	243	0.6%
run-repro-script	375	0.7%	1755	5.5%	1082	2.8%	1067	2.5%
search-files-by-content	3254	5.8%	32	0.1%	851	2.2%	10	0.0%
empty	770	1.4%	906	2.9%	964	2.5%	854	2.0%
submit	656	1.2%	1239	3.9%	595	1.5%	537	1.2%
list-directory	843	1.5%	1378	4.3%	86	0.2%	708	1.6%
bash-other	928	1.6%	843	2.7%	458	1.2%	631	1.5%
insert-source	12	0.0%	2007	6.3%	16	0.0%	803	1.9%
search-keyword(failed)	46	0.1%	8	0.0%	33	0.1%	2748	6.3%
file-cleanup	1554	2.7%	588	1.9%	426	1.1%	17	0.0%
search-files-by-name	1792	3.2%	310	1.0%	266	0.7%	49	0.1%
create-file	595	1.1%	686	2.2%	575	1.5%	326	0.8%
compile-build	1088	1.9%	150	0.5%	754	1.9%	41	0.1%
run-inline-verify	999	1.8%	1	0.0%	206	0.5%	696	1.6%
run-test-specific	1105	2.0%	0	0.0%	168	0.4%	370	0.9%
read-test-file	644	1.1%	56	0.2%	259	0.7%	635	1.5%
bash-command(failed)	32	0.1%	264	0.8%	33	0.1%	1217	2.8%
create-repro-script	157	0.3%	551	1.7%	372	0.9%	463	1.1%
run-custom-script	476	0.8%	115	0.4%	659	1.7%	111	0.3%
read-via-bash(failed)	23	0.0%	36	0.1%	2	0.0%	994	2.3%
run-script(failed)	47	0.1%	155	0.5%	90	0.2%	759	1.7%
read-file-full(truncated)	198	0.4%	361	1.1%	229	0.6%	245	0.6%
run-inline-snippet	472	0.8%	92	0.3%	146	0.4%	193	0.4%
create-documentation	661	1.2%	1	0.0%	71	0.2%	2	0.0%
undo-edit	4	0.0%	662	2.1%	3	0.0%	39	0.1%
git-status-log	652	1.2%	4	0.0%	13	0.0%	23	0.1%
git-diff	538	1.0%	0	0.0%	9	0.0%	23	0.1%
create-verify-script	321	0.6%	11	0.0%	94	0.2%	47	0.1%
read-via-inline-script	76	0.1%	3	0.0%	21	0.1%	373	0.9%
read-config-file	26	0.0%	48	0.2%	72	0.2%	206	0.5%
inspect-file-metadata	246	0.4%	1	0.0%	14	0.0%	22	0.1%
run-test-suite(failed)	6	0.0%	94	0.3%	3	0.0%	155	0.4%
syntax-check	183	0.3%	0	0.0%	57	0.1%	18	0.0%
echo	140	0.2%	9	0.0%	33	0.1%	69	0.2%
edit-via-inline-script	5	0.0%	0	0.0%	1	0.0%	245	0.6%
install-deps	20	0.0%	129	0.4%	37	0.1%	0	0.0%
start-service	26	0.0%	71	0.2%	42	0.1%	4	0.0%
apply-patch	0	0.0%	0	0.0%	0	0.0%	94	0.2%
create-file-via-inline-script	21	0.0%	0	0.0%	0	0.0%	41	0.1%
git-stash	28	0.0%	0	0.0%	2	0.0%	0	0.0%
check-tool-exists	16	0.0%	0	0.0%	10	0.0%	2	0.0%
check-version	6	0.0%	0	0.0%	0	0.0%	2	0.0%

Every trajectory step is classified into one of ~50 base intents using deterministic rules (regex matching on the action string, file names, and observation text). No LLM is used for classification.

_n: total count of that intent across all trajectories for the model.

_%: that count as a percentage of all steps for the model. Percentages sum to 100% within each model.

Sorted by total count across all models (most frequent first).

Method: each step's action string is pattern-matched against a priority-ordered ruleset in classify_intent.py. For example, an action starting with str_replace_editor view is classified as a read intent, while grep or rg becomes search-keyword.

3. High-Level Category Frequencies

Read and search dominate every model; the gap is in verify and edit proportions.

Proportion of steps by category

Sonnet 4.5

read 22%search 25%edit 10%verify 30%44

Gemini 2.5 Pro

read 19%search 8%reproduce 8%edit 41%verify 8%other 12%

GLM 4.5

read 26%search 17%4edit 20%verify 26%5

GPT-5

read 36%search 22%4edit 15%5failed 14%5

readsearchreproduceeditverifygithousekeepingfailedother

category	claude45_n	claude45_%	claude45_per_traj	gemini25pro_n	gemini25pro_%	gemini25pro_per_traj	glm45_n	glm45_%	glm45_per_traj	gpt5_n	gpt5_%	gpt5_per_traj
read	12388	21.9%	17.0	6167	19.4%	8.4	10132	25.9%	13.9	15450	35.6%	21.2
search	14280	25.3%	19.6	2638	8.3%	3.6	6509	16.6%	8.9	9423	21.7%	12.9
reproduce	1004	1.8%	1.4	2398	7.6%	3.3	1600	4.1%	2.2	1723	4.0%	2.4
edit	5850	10.3%	8.0	12972	40.9%	17.8	7860	20.1%	10.8	6492	15.0%	8.9
verify	16879	29.8%	23.1	2535	8.0%	3.5	10263	26.2%	14.1	2242	5.2%	3.1
git	1218	2.2%	1.7	4	0.0%	0.0	24	0.1%	0.0	46	0.1%	0.1
housekeeping	2277	4.0%	3.1	789	2.5%	1.1	586	1.5%	0.8	25	0.1%	0.0
failed	154	0.3%	0.2	557	1.8%	0.8	161	0.4%	0.2	5873	13.5%	8.0
other	2498	4.4%	3.4	3659	11.5%	5.0	2053	5.2%	2.8	2130	4.9%	2.9

Each base intent maps to one of 9 high-level categories: read, search, reproduce, edit, verify, git, housekeeping, failed, other.

_n: total steps in that category.

_%: percentage of all steps.

_per_traj: average number of steps in that category per trajectory (total category steps / number of trajectories).

The mapping from base intent to category is defined in classify_intent.py (INTENT_TO_HIGH_LEVEL). For example, read-file-full, read-file-range, and read-via-bash all map to 'read'.

4. Phase Groupings

Claude spends 28% verifying; GPT-5 spends 3.6%. Gemini reads the most; Claude cleans up the most.

Proportion of steps by phase

Sonnet 4.5

understand 47%edit 10%verify 30%cleanup 6%

Gemini 2.5 Pro

understand 28%reproduce 8%edit 41%verify 8%

GLM 4.5

understand 42%4edit 20%verify 26%

GPT-5

understand 57%4edit 15%5

understandreproduceeditverifycleanup

phase	claude45_%	gemini25pro_%	glm45_%	gpt5_%
understand	47.2%	27.8%	42.5%	57.3%
reproduce	1.8%	7.6%	4.1%	4.0%
edit	10.3%	40.9%	20.1%	15.0%
verify	29.8%	8.0%	26.2%	5.2%
cleanup	6.2%	2.5%	1.6%	0.2%

The 9 high-level categories are further grouped into 5 phases that represent the broad arc of a trajectory:

understand = read + search. The agent is reading code and searching for information.

reproduce = reproduce. The agent is writing or running reproduction scripts to confirm the bug.

edit = edit. The agent is making source code changes.

verify = verify. The agent is running tests, compiling, or checking its work.

cleanup = git + housekeeping. The agent is reviewing changes (git diff/log) or cleaning up (rm, mv, writing docs).

These phases are used in the stacked area charts (Typical Trajectory Shape) to show how the mix of actions evolves from start to end.

5. Verify Sub-Intent Breakdown

How models verify differs in kind, not just amount. Claude and GLM lean on broad test suites; Gemini and GPT use more targeted runs and custom scripts.

test-suite verify-script c-test-script edit-test compile-build inline-verify test-specific custom-script c-verify-script syntax-check

Sonnet 4.5

16,879

Gemini 2.5 Pro

2,535

GLM 4.5

10,263

GPT-5

2,242

intent	claude45_n	gemini25pro_n	glm45_n	gpt5_n
run-test-suite	5942	789	1064	585
run-verify-script	3420	208	3011	113
create-test-script	2633	121	1913	18
edit-test-or-repro	712	1140	2337	243
compile-build	1088	150	754	41
run-inline-verify	999	1	206	696
run-test-specific	1105	0	168	370
run-custom-script	476	115	659	111
create-verify-script	321	11	94	47
syntax-check	183	0	57	18

The 'verify' category contains ~10 sub-intents. This table shows where each model's verification volume comes from.

run-test-suite: broad test runs (pytest, go test, npm test, mocha) without targeting specific tests.

run-test-specific: targeted test runs using pytest -k or :: to run specific test functions.

run-verify-script: running a script named verify*, check*, validate*, or edge_case*.

create-test-script: creating a new test file (test_*, *test.py, etc.).

run-inline-verify: an inline python -c / node -e snippet that imports project code or runs assertions.

compile-build: go build, go vet, make, npx tsc. Compilation as a verification step.

edit-test-or-repro: editing an existing test or repro file (str_replace on test_* or repro* files).

run-custom-script: running a named script that doesn't match repro/test/verify naming patterns.

create-verify-script: creating a new file named verify*, check*, validate*.

syntax-check: py_compile, compileall, node -c. Quick syntax validation.

7. Verify Outcomes

About half of all verify steps yield a pass. Claude and GLM have nearly identical pass rates (~51%), despite Claude running 3x more verify steps.

passfailunknown

Sonnet 4.5

pass rate 80.0%

Gemini 2.5 Pro

pass rate 15.0%

GLM 4.5

pass rate 58.2%

GPT-5

pass rate 43.2%

model	pass	fail	unknown	total	pass_rate
claude45	5276	1323	49949	56548	80.0%
gemini25pro	130	736	30853	31719	15.0%
glm45	1068	767	37353	39188	58.2%
gpt5	338	445	42621	43404	43.2%

Only steps classified as one of these intents are evaluated for outcome: run-test-suite, run-test-specific, run-verify-script, run-custom-script, run-inline-verify, compile-build, syntax-check. All other steps get outcome ''.

pass: the observation's last 2000 characters match a framework-specific all-pass pattern. For pytest: the summary line (e.g. '200 passed in 12.3s') must contain 'passed' and must NOT contain 'failed' or 'error'. If even one test fails ('195 passed, 5 failed'), the outcome is 'fail', not 'pass'. For Go: all PASS/FAIL lines in output are checked; any FAIL makes it 'fail'. For Mocha: checks 'N passing' and 'N failing' counts. For Jest: checks summary line for 'failed' vs 'passed'. For compile-build: absence of error patterns in short output (< 200 chars) from go build/make = 'pass'. For syntax-check: py_compile with no output = 'pass'; any Error/SyntaxError = 'fail'.

fail: the observation matches a failure pattern. In priority order: framework-specific failure summaries (pytest 'failed', Go 'FAIL', Mocha failing > 0), then generic patterns: 'no tests ran', collection errors, tracebacks in the last 500 chars, Node.js throw/error, non-zero exit code.

unknown (''): no pattern matched. This happens when output is from an unrecognized framework, is truncated, is ambiguous (e.g. pytest ran but the summary line was cut off), or when the observation is empty.

pass_rate: pass / (pass + fail), excluding unknowns. This measures: of the verify steps where we could determine the outcome, what fraction had all tests passing?

Important caveat: 'pass' means 'all tests in that run passed', not 'the agent's fix is correct'. SWE-Bench Pro tasks come with existing test suites where most tests already pass on unmodified code. An agent running pytest before making any edits will often get 'pass' because the existing tests pass. This is why first_verify_pass can occur before first_edit (see Section 10).

8. Sequence Labels

Claude averages 6 edit-then-verify cycles per trajectory. GPT averages less than 1. The edit-verify loop is the defining structural difference.

Average edit-then-verify cycles per trajectory

Sonnet 4.5

2.8 avg (2,011 total)

Gemini 2.5 Pro

1.2 avg (892 total)

GLM 4.5

3.7 avg (2,683 total)

GPT-5

1.1 avg (826 total)

Key sequence counts per trajectory (avg)

sequence

Sonnet

Gemini

GLM

GPT-5

edit-then-verify

2.8

1.2

3.7

1.1

fix after failure

0.0

0.1

0.0

0.1

rerun without edit

13.3

0.3

3.4

0.9

submit after verify

0.9

0.8

0.5

label	claude45_n	gemini25pro_n	glm45_n	gpt5_n
seq-verify-rerun-no-edit	9688	215	2516	626
seq-reread-edited-file	2751	2442	2566	2233
seq-verify-after-edit	2011	892	2683	826
seq-repro-after-edit	448	1205	789	923
seq-submit-after-verify	656	667	586	379
seq-work-done	592	37	298	82
seq-repro-rerun-same-command	215	391	40	273
seq-verify-rerun-same-command	291	84	176	258
seq-repro-rerun-no-edit	109	95	106	292
seq-diagnose-read-after-failed-verify	16	44	26	192
seq-edit-after-failed-verify	4	89	32	96
seq-diagnose-search-after-failed-verify	4	0	3	136

Sequence labels classify steps by their context: what happened before, whether edits or verify steps preceded them.

seq-verify-after-edit: a verify step after a source edit. The core edit-then-test loop.

seq-verify-rerun-no-edit: a verify step where no edit happened since the last verify.

seq-edit-after-failed-verify: a source edit after a failed verify step. Fixing what a test revealed.

seq-submit-after-verify: submit after at least one verify step. The agent tested before submitting.

seq-first-all-pass: the first verify-pass after the last source edit. Marks implementation completion.

Method: classify_sequence_layer() in classify_intent.py walks the trajectory maintaining state (has a verify been seen? was there an edit since?).

8b. Failure Modes

GPT-5 records a 19.9% failure-step rate. 91.8% of its failures are tool-call friction, and 63.1% come from one shell-wrapper/apply_patch cluster (applypatch hallucination, trailing }, heredoc breakage, generic bash syntax, and the broken pipes they trigger).

tool failurescode failurestest failures

Sonnet 4.5

4.8% of steps flagged as failures · 42.3% tool friction

Gemini 2.5 Pro

26.2% of steps flagged as failures · 79.9% tool friction

GLM 4.5

7.7% of steps flagged as failures · 65.7% tool friction

GPT-5

19.9% of steps flagged as failures · 91.8% tool friction

mode	family	Sonnet 4.5	Gemini 2.5 Pro	GLM 4.5	GPT-5	GPT share	GPT trajs
broken pipe `bash_broken_pipe`	tool	429	2	101	1744	20.2%	348
view_range out of bounds `strep_invalid_range`	tool	213	26	316	1148	13.3%	416
trailing `}` leak `bash_trailing_brace`	tool	0	0	0	1125	13.0%	307
apply_patch missing `apply_patch_cmd_not_found`	tool	0	0	0	968	11.2%	291
str_replace no match `strep_no_match`	tool	344	4714	1082	947	11.0%	287
bash syntax error `bash_syntax_error`	tool	87	956	7	866	10.0%	225
heredoc unterminated `bash_heredoc_unterminated`	tool	0	1	0	592	6.8%	178
command not found `bash_command_not_found`	tool	48	261	38	222	2.6%	138
Python traceback `py_traceback_other`	code	387	574	414	219	2.5%	102
Node error `node_error`	code	226	76	148	184	2.1%	71
str_replace path missing `strep_file_not_found`	tool	2	437	401	125	1.4%	107
apply_patch shell syntax `apply_patch_shell_syntax`	tool	0	0	0	107	1.2%	58
SyntaxError `py_syntax_error`	code	45	219	44	95	1.1%	56
test suite failed `test_failed`	test	806	653	294	91	1.1%	57
ModuleNotFoundError `py_module_not_found`	code	99	131	124	84	1.0%	51
bash quote nesting `bash_quote_nesting`	tool	0	0	0	51	0.6%	20
IndentationError `py_indentation_error`	code	1	20	6	32	0.4%	19
create over existing file `strep_create_exists`	tool	2	242	18	20	0.2%	20
bash `!` history expansion `bash_history_expansion`	tool	19	7	5	19	0.2%	13
str_replace not unique `strep_multiple_matches`	tool	3	2	2	4	0.0%	3
apply_patch other `apply_patch_other`	tool	0	0	0	4	0.0%	2
test collection error `test_collection_error`	test	0	0	0	0	0.0%	0

Illustrative GPT-5 failures

broken pipe (1744 GPT-5 steps)

`grep: write error: Broken pipe` — usually downstream of one of the above (the rest of the pipeline already crashed).

instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan

action
bash -lc "grep -R --line-number \"\\.\\.\\/posts\\/cache\" /app/src | sed -n '1,200p'"}

observation
sed: -e expression #1, char 7: unexpected `}'
grep: write error: Broken pipe

trailing `}` leak (1125 GPT-5 steps)

Action ends with a stray `}` (from JSON tool-call serialization), so the shell sees `… || true}` and reports `true}: command not found`. Almost exclusive to GPT-5's `bash -lc "…"}` wrapping pattern.

instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan

action
bash -lc "rg -n --no-heading "src/posts/cache\.js|require\(['\"]\.\.\/posts\/cache['\"]\)|require\(['\"]\.\/cache['\"]\)" /app | sed -n '1,200p'"}

observation
bash: require([\"]\.\.\/posts\/cache["]): No such file or directory
bash: require([\"]\.\/cache["]) /app | sed -n '1,200p'}: No such file or directory
bash: line 1: rg: command not found

apply_patch missing (968 GPT-5 steps)

Agent invoked `applypatch` but the binary doesn't exist in the sandbox. GPT-5 hallucinates this from its OpenAI/Codex training where apply_patch is the canonical edit tool.

instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan

action
bash -lc "applypatch << 'PATCH'\n*** Begin Patch\n*** Update File: /app/src/meta/index.js\n@@\n-Meta.userOrGroupExists = Meta.slugTaken; // backwards compatiblity\n+// Alias with same single/array behaviour\n+Meta.userOrGroupExists = Meta.slugTaken; // backwards compatibility\n*** End Patch\nPATCH"}

observation
bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `PATCHn***')
bash: line 1: applypatch: command not found
bash: line 1: //: Is a directory
bash: line 1: //: Is a directory

heredoc unterminated (592 GPT-5 steps)

`warning: here-document at line N delimited by end-of-file` — the heredoc end-marker was emitted on the same line, escaped, or never sent.

instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan

action
bash -lc "python - << 'PY'\nimport os, re\nfor root, dirs, files in os.walk('/app/src'):\n    for f in files:\n        p=os.path.join(root,f)\n        try:\n            s=open(p,'rb').read()\n        except Exception as e:\n            continue\n        if b'slugTaken' in s or b'userOrGroupExists' in s or b'existsBySl…

observation
bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `PYnimport')
bash: -c: line 1: syntax error near unexpected token `('
bash: -c: line 1: `python - << 'PY'\nimport os, re\nfor root, dirs, files in os.walk('/app/src'):\n    for f in files:\n        p=os.path.join(root,f)\n        try:\n            s=open(p,'rb').read()\n        except Exception as e:\n            continue\n        if b's…

view_range out of bounds (1148 GPT-5 steps)

`Invalid view_range` — agent asked to view lines past EOF or with a reversed range.

instance_NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5-vnan

action
str_replace_editor view /app/src/user/admin.js  --view_range 1 200

observation
Invalid `view_range`: [1, 200]. Its second element `200` should be smaller than the number of lines in the file: `92`

This section uses data/failure_modes.json, produced by scripts/build_failure_modes.py. Each trajectory step is classified as either not-a-failure or one failure mode.

Counts are step counts, not unique incidents. A single trajectory can contribute many failure steps, and one underlying shell mistake can fan out into multiple observed failures.

Families: tool means the harness/tool call itself failed; code means the agent ran code that crashed; test means a test runner reported failures.

Interpretation caveat: bash_broken_pipe is often a secondary symptom. In GPT-5, many of those steps are downstream of the same wrapper pathologies that also produce trailing-brace, heredoc, or quoting failures.

The distinctive GPT-5 signature is not ordinary test failure. It is repeated interaction friction around shell wrapping and hallucinated applypatch usage.

9. Work-Done vs Resolved

wd + resolvedwd + unresolvedno wd + resolvedno wd + unresolved

Sonnet 4.5

276

316

37.8% wd+resolved

Gemini 2.5 Pro

130

563

1.6% wd+resolved

GLM 4.5

112

186

147

285

15.3% wd+resolved

GPT-5

218

430

6.4% wd+resolved

model	wd+resolved	wd+unresolved	no_wd+resolved	no_wd+unresolved	total
claude45	276	316	43	95	730
gemini25pro	12	25	130	563	730
glm45	112	186	147	285	730
gpt5	47	35	218	430	730

A confusion matrix crossing two signals: whether the agent reached 'work-done' and whether the benchmark evaluated the patch as correct.

work-done: the trajectory contains a seq-first-all-pass label. Specifically: we find the last step classified as a source edit (edit-source, insert-source, apply-patch, edit-via-inline-script). Then we check if any verify step after that point has a 'pass' outcome (all tests passed, per the rules in Section 7). If yes, the trajectory is 'work-done'. This is a stronger signal than first_verify_pass (Section 10), which can fire before any edits because existing tests pass on unmodified code.

resolved: the submitted patch actually fixes the failing tests, as judged by the SWE-Bench Pro benchmark evaluation (from agent_runs_data.csv). This is different from 'submitted', which only means the agent produced a patch.

wd+resolved: the agent's tests passed after its last edit, and the benchmark confirmed the patch is correct. The best case.

wd+unresolved: tests passed but the patch was wrong. The agent's own verification was a false positive.

no_wd+resolved: the agent never reached a clean test pass after its final edit, yet the benchmark accepted the patch. The agent submitted without confirmation that its code works.

no_wd+unresolved: the agent neither achieved passing tests nor produced a correct patch.

The stacked bars show how each model's 730 trajectories split across these four outcomes. The no_wd+resolved column is notably large across all models, meaning the agent's own test-passing signal is not a reliable predictor of benchmark resolution.

Method: 'work-done' = find the last step in SOURCE_EDIT_INTENTS (edit-source, insert-source, apply-patch, edit-via-inline-script), then scan forward for any step where classify_verify_outcome() returns 'pass'. If found, work-done is true. 'resolved' comes from the benchmark CSV (agent_runs_data.csv, field metadata.resolved), which records whether the submitted patch actually made the failing tests pass when evaluated by the benchmark harness.

Caveat: work-done can be a false positive. The agent's tests might pass because the test suite doesn't cover the specific failure the task requires fixing. The agent thinks it's done (tests pass), but the benchmark's evaluation finds the bug isn't actually fixed.

10. Structural Markers (% of trajectory)

first edit last edit first verify first pass submit

0%25%50%75%100%

Sonnet 4.5

Gemini 2.5 Pro

GLM 4.5

GPT-5

marker	claude45_med	claude45_p25	claude45_p75	claude45_n	gemini25pro_med	gemini25pro_p25	gemini25pro_p75	gemini25pro_n	glm45_med	glm45_p25	glm45_p75	glm45_n	gpt5_med	gpt5_p25	gpt5_p75	gpt5_n
first_edit	34.6	27.8	42.6	701	18.5	9.4	37.0	684	33.9	26.1	45.3	709	49.5	34.2	63.8	642
last_edit	61.9	47.9	78.1	701	90.3	76.5	96.4	684	80.0	64.9	89.4	709	89.4	78.4	95.4	642
first_verify	23.1	17.3	30.9	718	42.9	18.3	66.7	175	44.7	26.8	65.0	672	59.1	33.3	78.4	391
first_verify_pass	28.8	19.7	47.7	639	71.6	34.0	89.7	58	70.4	53.3	85.9	331	71.8	40.7	85.2	130
submit	100.0	100.0	100.0	644	100.0	100.0	100.0	558	100.0	100.0	100.0	595	100.0	100.0	100.0	443

Alternative view: replace first pass with last verify

This version drops first pass, which can happen on baseline tests before any edits, and instead shows the last verify step so the post-edit verification tail is easier to see.

first edit last edit first verify last verify submit

0%25%50%75%100%

Sonnet 4.5

Gemini 2.5 Pro

GLM 4.5

GPT-5

marker	claude45_med	claude45_p25	claude45_p75	claude45_n	gemini25pro_med	gemini25pro_p25	gemini25pro_p75	gemini25pro_n	glm45_med	glm45_p25	glm45_p75	glm45_n	gpt5_med	gpt5_p25	gpt5_p75	gpt5_n
first_edit	34.6	27.8	42.6	701	18.5	9.4	37.0	684	33.9	26.1	45.3	709	49.5	34.2	63.8	642
last_edit	61.9	47.9	78.1	701	90.3	76.5	96.4	684	80.0	64.9	89.4	709	89.4	78.4	95.4	642
first_verify	23.1	17.3	30.9	718	42.9	18.3	66.7	175	44.7	26.8	65.0	672	59.1	33.3	78.4	391
last_verify	96.1	93.7	97.4	718	87.9	64.3	94.0	175	96.8	94.0	98.1	672	93.5	81.0	96.7	391
submit	100.0	100.0	100.0	644	100.0	100.0	100.0	558	100.0	100.0	100.0	595	100.0	100.0	100.0	443

Key events in each trajectory, expressed as a percentage of the way through (0% = first step, 100% = last step). Aggregated across all trajectories per model.

The timeline shows median positions as shaped markers, with faint bands for the interquartile range (p25-p75). Hover over markers for exact values.

first_edit: the first step whose base intent is one of: edit-source (str_replace on a source file), insert-source (str_replace_editor insert), apply-patch, or edit-via-inline-script. Does not include create-file or edit-test-or-repro, which are classified differently.

last_edit: the last step matching those same intents. The gap between last_edit and submit is the 'tail' where the agent is verifying, cleaning up, or submitting but no longer changing source code.

first_verify: the first step whose intent is in SEQUENCE_VERIFY_INTENTS: run-test-suite, run-test-specific, run-verify-script, run-custom-script, compile-build, syntax-check, run-inline-verify.

first_verify_pass: the first step where classify_verify_outcome() returns 'pass' (see Section 7 for what 'pass' means). This does NOT mean 'the agent's fix worked'. It means 'the first time a test/build command produced output where all tests passed'. Because SWE-Bench Pro tasks have existing test suites that mostly pass on unmodified code, an agent that runs pytest before making any edits will often get a 'pass' here. This is why first_verify_pass can appear before first_edit for Claude (median 28.8% vs 34.6%): Claude runs the existing test suite early as a diagnostic baseline. For a marker that means 'the fix works', see work_done in Section 9, which requires a verify pass after the last source edit.

last_verify: the last step whose intent is in SEQUENCE_VERIFY_INTENTS. The alternate view replaces first pass with last verify to show where verification actually finishes, which makes the late verification tail clearer for Claude.

submit: the first step with intent 'submit'.

_med / _p25 / _p75: median, 25th percentile, and 75th percentile across trajectories where the event occurred.

_n: number of trajectories where this event occurred. Claude has 639 for first_verify_pass (out of 730) meaning 91 trajectories never had a fully-passing test run. GPT has only 130, meaning most GPT trajectories either never ran tests or never achieved a clean pass.

Method: for each trajectory, scan for the first (or last) step matching the relevant intent set, compute step_index / (total_steps - 1) * 100 to get a percentage position, then take the median across all trajectories where the event occurred.

11. Phase Profile Heatmap

Each trajectory divided into 20 time-slices. Cell intensity shows what proportion of steps in each slice belong to each category, normalized per column.

Sonnet 4.5

25%

50%

75%

read

14%

54%

40%

32%

26%

22%

20%

21%

18%

17%

16%

17%

23%

86%

45%

54%

51%

43%

35%

29%

21%

18%

19%

15%

12%

13%

11%

reproduce

edit

13%

19%

26%

25%

22%

18%

14%

13%

10%

verify

13%

21%

24%

25%

26%

30%

34%

38%

42%

44%

49%

52%

51%

49%

45%

40%

git

housekeeping

12%

17%

14%

17%

Gemini 2.5 Pro

25%

50%

75%

read

19%

37%

36%

34%

29%

30%

29%

23%

24%

26%

23%

22%

20%

19%

17%

12%

73%

42%

24%

16%

10%

reproduce

10%

11%

12%

10%

11%

10%

11%

edit

16%

27%

34%

42%

43%

46%

49%

48%

50%

52%

51%

52%

54%

55%

50%

46%

45%

47%

verify

10%

11%

10%

12%

13%

14%

12%

git

housekeeping

13%

21%

GLM 4.5

25%

50%

75%

read

12%

54%

61%

53%

43%

34%

31%

28%

26%

23%

26%

21%

19%

16%

15%

12%

11%

88%

45%

32%

29%

25%

19%

16%

13%

11%

reproduce

edit

21%

28%

33%

35%

37%

34%

32%

30%

26%

25%

22%

18%

17%

14%

verify

13%

17%

20%

22%

27%

32%

36%

42%

46%

51%

54%

58%

63%

61%

git

housekeeping

15%

GPT-5

25%

50%

75%

read

22%

60%

64%

62%

58%

52%

50%

46%

43%

40%

38%

37%

33%

29%

30%

29%

78%

40%

33%

34%

35%

36%

32%

30%

31%

30%

28%

26%

22%

21%

23%

20%

18%

19%

reproduce

11%

12%

13%

14%

13%

edit

10%

14%

17%

19%

22%

25%

27%

25%

27%

25%

22%

20%

verify

10%

12%

16%

18%

git

housekeeping

Read left-to-right as beginning to end of trajectory. Brighter cells indicate the dominant action in that time-slice.

Categories: read, search, reproduce, edit, verify, git, housekeeping. Failed and other are excluded.

Normalized per column: within each time-slice, the percentages show each category's share relative to only the displayed categories.

Method: each trajectory's step sequence is divided into 20 equal bins. Per bin, we count the fraction of steps belonging to each category, then average across all trajectories for that model.

12. Per-Repo Breakdown

Resolve rate by repository (each dot = one model, repos with <3 tasks omitted)

Sonnet 4.5Gemini 2.5 ProGLM 4.5GPT-5

ansible (384)

openlibrary (364)

flipt (340)

qutebrowser (316)

teleport (304)

webclients (260)

vuls (248)

navidrome (228)

element-web (220)

NodeBB (176)

tutanota (80)

0%20%40%60%80%100%

repo	Sonnet 4.5_n	Sonnet 4.5_avg	Sonnet 4.5_res%	Sonnet 4.5_ver%	Gemini 2.5 Pro_n	Gemini 2.5 Pro_avg	Gemini 2.5 Pro_res%	Gemini 2.5 Pro_ver%	GLM 4.5_n	GLM 4.5_avg	GLM 4.5_res%	GLM 4.5_ver%	GPT-5_n	GPT-5_avg	GPT-5_res%	GPT-5_ver%
ansible	96	82.5	55.2	33.7	96	44.1	26.0	6.8	96	52.4	39.6	26.7	96	65.5	49.0	8.8
openlibrary	91	74.1	54.9	38.0	91	27.3	34.1	11.8	91	46.4	46.2	31.7	91	46.8	42.9	12.2
flipt	85	73.9	32.9	27.6	85	65.6	10.6	7.2	85	58.8	22.4	26.6	85	62.2	17.6	1.7
qutebrowser	79	84.1	65.8	42.0	79	39.4	29.1	13.6	79	49.5	54.4	36.7	79	64.4	64.6	10.0
teleport	76	74.2	30.3	22.5	76	54.1	6.6	9.5	76	64.5	25.0	22.1	76	61.8	17.1	1.9
webclients	65	82.2	41.5	22.9	65	34.3	21.5	5.8	65	53.3	36.9	17.5	65	43.9	21.5	2.9
vuls	62	70.1	43.5	25.5	62	39.4	12.9	3.7	62	52.8	35.5	24.6	62	54.0	32.3	1.9
navidrome	57	74.2	38.6	28.6	57	48.8	17.5	8.8	57	60.8	31.6	26.5	57	61.2	35.1	2.3
element-web	55	75.4	32.7	27.8	55	29.4	18.2	5.4	55	47.6	21.8	21.9	55	60.6	36.4	2.4
NodeBB	44	77.4	25.0	23.6	44	47.4	11.4	7.7	44	50.0	31.8	28.4	44	72.6	38.6	3.7
tutanota	20	92.1	40.0	19.3	20	52.5	10.0	2.6	20	55	40.0	19.1	20	78.5	45.0	3.3

Cross-model spread in resolve rate, largest first
repo	min res%	max res%	spread (pp)	best model
qutebrowser	29.1	65.8	36.7	Sonnet 4.5
tutanota	10.0	45.0	35.0	GPT-5
vuls	12.9	43.5	30.6	Sonnet 4.5
ansible	26.0	55.2	29.2	Sonnet 4.5
NodeBB	11.4	38.6	27.2	GPT-5
teleport	6.6	30.3	23.7	Sonnet 4.5
flipt	10.6	32.9	22.3	Sonnet 4.5
navidrome	17.5	38.6	21.1	Sonnet 4.5
openlibrary	34.1	54.9	20.8	Sonnet 4.5
webclients	21.5	41.5	20.0	Sonnet 4.5
element-web	18.2	36.4	18.2	GPT-5

Metrics broken down by source repository. SWE-Bench Pro tasks come from ~11 open-source repos. The dot plot shows whether one model dominates uniformly or whether there is repo-specific variation.

Dot plot: each dot is one model's resolve rate on that repo. When dots cluster, models perform similarly on that repo; when they spread apart, the repo differentiates models. Repos with fewer than 3 tasks per model are omitted from the plot to avoid noisy rates.

Spread table: the gap (in percentage points) between the best and worst model on each repo. Large spreads indicate repos where model choice matters most.

_n: number of task instances from this repo.

_avg: average steps per trajectory.

_res%: resolve rate (percentage of trajectories where the submitted patch fixes the failing tests).

_ver%: percentage of steps spent on verify actions.

Sorted by total number of instances across all models (most common repos first). Top 12 shown.