SWE-Bench Pro trajectory analysis. 2 models, 1460 trajectories.
Of 730 tasks, the highest resolve rate is 44% (Sonnet 4.5). Most submitted patches do not resolve the issue.
| model | n | submitted (clean) | submitted (w/ error) | not submitted | resolved | resolve rate |
|---|---|---|---|---|---|---|
| claude45 | 730 | 643 | 75 | 12 | 319 | 43.7% |
| gpt5 | 730 | 438 | 185 | 107 | 265 | 36.3% |
Each SWE-Bench Pro trajectory ends with an exit status. The agent either submits a patch or doesn't.
submitted (clean): exit_status is exactly submitted.
submitted (w/ error): the agent produced a submission, but the harness also recorded an error condition (timeout, context overflow, cost limit, format error).
not submitted: the agent never ran the submit command. It hit an error before submitting.
resolved: the submitted patch actually fixes the failing tests (from benchmark evaluation).
Method: we read info.exit_status and info.submission from each .traj file.
| exit_status | claude45 | gpt5 |
|---|---|---|
| submitted | 643 | 438 |
| submitted (exit_command_timeout) | 17 | 78 |
| submitted (exit_context) | 2 | 63 |
| exit_error | 9 | 53 |
| submitted (exit_error) | 37 | 20 |
| exit_command_timeout | 1 | 24 |
| submitted (exit_total_execution_time) | 3 | 18 |
| submitted (exit_cost) | 14 | 1 |
| exit_total_execution_time | 1 | 13 |
| exit_context | 0 | 9 |
| exit_format | 1 | 7 |
| submitted (exit_format) | 2 | 5 |
| 0 | 1 |
Every distinct exit_status string from the .traj files, with counts per model.
submitted: clean exit after submitting.
submitted (exit_*): submitted, but also hit an error condition (timeout, context, cost, format, etc.).
exit_* (without submitted prefix): the agent hit that condition and never submitted.
These statuses are set by the SWE-Agent harness, not by the model itself.
Median trajectory length varies from 53 to 78 steps (GPT-5 vs. Sonnet 4.5). The longest trajectories take ~1.5x more steps per task.
| model | n | total_steps | avg | median | p25 | p75 | min | max | resolved | resolve_rate |
|---|---|---|---|---|---|---|---|---|---|---|
| claude45 | 730 | 56548 | 77.5 | 78.0 | 63 | 89 | 2 | 215 | 319 | 43.7% |
| gpt5 | 730 | 43404 | 59.5 | 53.0 | 34 | 76 | 2 | 251 | 265 | 36.3% |
Summary statistics for trajectory length (number of action-observation steps per task).
n: trajectories (one per task instance).
total_steps: sum of all steps across all trajectories for this model.
avg / median / p25 / p75 / min / max: distribution of steps per trajectory.
resolved: trajectories where the submitted patch fixes the failing tests (from agent_runs_data.csv).
resolve_rate: resolved / n.
Every step is labelled by a deterministic, priority-ordered ruleset in scripts/classify_intent.py — no model inference. Each row shows the intent, what it means, the literal rule that fires it, and how often it fires per model.
Classification order (first rule that matches wins):
empty. submit prefix → submit.str_replace_editor {view, create, str_replace, insert, undo_edit} → classified by sub-command and filename pattern (test/config/repro/verify/doc).bash -lc "...", leading cd ... &&, source ... &&, timeout N, and FOO=bar env prefixes.syntax error, command not found, unexpected token, …), the command head routes to the matching (failed) label.bash-other (<2% of steps by design).| intent | description | classification rule | Sonnet 4.5 | GPT-5 |
|---|---|---|---|---|
| read | ||||
| read-file-full | view an entire source file via str_replace_editor | str_replace_editor view <file> (fallback once test, config, range, and truncated views are ruled out) | 3.1k | 5.0k |
| read-file-range | view a specific line range (--view_range) | str_replace_editor view with --view_range | 6.0k | 6.0k |
| read-file-full(truncated) | view a file that was too large, got abbreviated | str_replace_editor view where the observation contains too large to display | 198 | 245 |
| read-test-file | view a test file (test_*, _test.*, conftest) | str_replace_editor view on a filename matching test_*, *_test.*, or conftest* | 644 | 635 |
| read-config-file | view package.json, pytest.ini, setup.cfg, go.mod, Makefile, etc. | str_replace_editor view on package.json, pytest.ini, setup.cfg, setup.py, go.mod, Makefile, config.json | 26 | 206 |
| read-via-bash | cat, head, tail, sed -n, nl, awk | cat, head, tail, sed -n, nl, awk | 2.3k | 3.0k |
| read-via-inline-script | inline snippet that reads a file and prints content | inline snippet that reads a file (.read(), open(...,'r'), readFileSync) and prints, without writing | 76 | 373 |
| search | ||||
| view-directory | view a directory listing via str_replace_editor | str_replace_editor view where path has no extension, or observation lists “files and directories” | 1.1k | 2.1k |
| list-directory | ls, tree, pwd | ls, tree, pwd | 843 | 708 |
| search-keyword | grep, rg, ag for a pattern | grep, rg, ag | 7.0k | 6.5k |
| search-files-by-name | find ... -name (locating files by name/path) | find ... -name with no grep/xargs pipe | 1.8k | 49 |
| search-files-by-content | find ... -exec grep / find | xargs grep | find ... -exec grep or find ... | xargs grep | 3.3k | 10 |
| inspect-file-metadata | wc, file, stat | wc, file, stat | 246 | 22 |
| check-version | inline snippet that checks python/node version | inline snippet matching --version, -V, sys.version, or node -v | 6 | 2 |
| reproduce | ||||
| create-repro-script | create a file named repro*, reproduce*, demo* | str_replace_editor create on a filename containing repro, reproduce, or demo | 157 | 463 |
| run-repro-script | run a file named repro*, reproduce*, demo* | run a named script whose basename matches repro* or reproduce* (python, node, sh, bash, go run) | 375 | 1.1k |
| run-inline-snippet | python -c, python - <<, python3 -c, node -e | python -c, python - <<, node -e — residual when no inline sub-pattern matches | 472 | 193 |
| edit | ||||
| edit-source | str_replace on a non-test, non-repro source file | str_replace_editor str_replace on a filename not matching test/repro/verify/check | 5.2k | 5.0k |
| insert-source | str_replace_editor insert on a source file | str_replace_editor insert | 12 | 803 |
| apply-patch | applypatch command (GPT-specific alternative to str_replace) | applypatch command (GPT-specific) | 0 | 94 |
| create-file | create a file that doesn't match repro/test/verify/doc patterns | str_replace_editor create on a filename not matching repro/test/verify/doc patterns | 595 | 326 |
| edit-via-inline-script | inline snippet that reads, modifies, and writes a file | inline snippet that writes (.write(), writeFileSync) together with reading or .replace()/re.sub() | 5 | 245 |
| create-file-via-inline-script | inline snippet that writes a file without reading first | inline snippet that writes a file with no prior read | 21 | 41 |
| verify | ||||
| run-test-suite | pytest, go test, npm test, npx jest, mocha (broad) | pytest, go test, npm test, npx jest, mocha, yarn test, python -m unittest (broad; no :: or -k) | 5.9k | 585 |
| run-test-specific | pytest with -k or :: (targeting specific tests) | a test runner command containing :: or -k | 1.1k | 370 |
| create-test-script | create a file named test_*, *test.py, *test.js, *test.go | str_replace_editor create on a filename matching test_*, *test.py, *test.js, *test.go | 2.6k | 18 |
| run-verify-script | run a file named test_*, verify*, check*, validate*, edge_case* | run a named script whose basename contains test_, verify, check, validate, or edge_case | 3.4k | 113 |
| create-verify-script | create a file named verify*, check*, validate* | str_replace_editor create on a filename matching verify*, check*, or validate* | 321 | 47 |
| edit-test-or-repro | str_replace on a test or repro file | str_replace_editor str_replace on a filename containing test_, repro, verify, or check | 712 | 243 |
| run-custom-script | run a named script that doesn't match repro/test/verify patterns | run a named python/node/sh/bash/go script whose basename doesn’t match repro/test/verify patterns | 476 | 111 |
| syntax-check | py_compile, compileall, node -c | py_compile, compileall, node -c | 183 | 18 |
| compile-build | go build, go vet, make, npx tsc, tsc | go build, go vet, make, tsc, npx tsc, npm run build, yarn build | 1.1k | 41 |
| run-inline-verify | inline snippet that imports project code or runs assertions | inline snippet with import/from + assert/print (smoke test or assertion) | 999 | 696 |
| git | ||||
| git-diff | git diff | git diff (with or without -C <dir>) | 538 | 23 |
| git-status-log | git status, git show, git log | git status, git show, git log | 652 | 23 |
| git-stash | git stash | git stash | 28 | 0 |
| housekeeping | ||||
| file-cleanup | rm, mv, cp, chmod | rm, mv, cp, chmod | 1.6k | 17 |
| create-documentation | create a file named summary, readme, changes, implementation | str_replace_editor create on a filename matching *summary*, *readme*, *changes*, *implementation* | 661 | 2 |
| start-service | redis-server, redis-cli, mongod, sleep | redis-server, redis-cli, mongod, sleep | 26 | 4 |
| install-deps | pip install, pip list, npm install, go get, apt | pip install, pip list, npm install, go get, apt | 20 | 0 |
| check-tool-exists | which, type | which, type | 16 | 2 |
| failed | ||||
| search-keyword(failed) | grep/find that hit shell errors | grep/find whose observation contains a shell error | 46 | 2.7k |
| read-via-bash(failed) | cat/head/sed that hit shell errors | cat/head/sed/tail/ls whose observation contains a shell error | 23 | 994 |
| run-script(failed) | python/node run that hit shell errors | python/node whose observation contains a shell error | 47 | 759 |
| run-test-suite(failed) | pytest/test that hit shell errors | test runner whose observation contains a shell error | 6 | 155 |
| bash-command(failed) | other bash that hit shell errors | any other bash command whose observation contains a shell error | 32 | 1.2k |
| other | ||||
| submit | submit the patch | action’s first line starts with submit | 656 | 537 |
| empty | empty action string (rate limit, context window exit) | action string is blank (rate-limit or context-window exit) | 770 | 854 |
| echo | echo, printf | echo, printf | 140 | 69 |
| bash-other | unclassified bash command | final fallback — bash command that matched no other rule (<2% of steps by design) | 928 | 631 |
| undo-edit | str_replace_editor undo_edit | str_replace_editor undo_edit | 4 | 39 |
The label describes what the command is, derived from the action string and filename alone — no positional context (before/after first edit) and no outcome signal is used. A failed grep is still a search attempt.
(failed) variants classify by intended action, not outcome. They require a shell-level error in the first 500 chars of the observation.
run-inline-snippet is a residual — inline snippets (python -c, python - <<, node -e) are first routed to run-inline-verify / read-via-inline-script / edit-via-inline-script / create-file-via-inline-script / check-version by inspecting the code shape.
Pass/fail outcome for verify intents (used by seq-first-all-pass / seq-work-done) is a separate detector that reads the observation for unambiguous runner summaries: e.g. pytest N passed in Xs / N failed, go ok package / FAIL package, jest Tests: N passed / N failed. Ambiguous output returns unknown.
Canonical source: scripts/classify_intent.py and docs/intent-classification-rules.md.
The top 10 intents account for the bulk of all steps; the long tail of ~40 others fills in the edges.
| intent | claude45_n | claude45_% | gpt5_n | gpt5_% |
|---|---|---|---|---|
| search-keyword | 7002 | 12.4% | 6499 | 15.0% |
| read-file-range | 5974 | 10.6% | 5997 | 13.8% |
| edit-source | 5217 | 9.2% | 4983 | 11.5% |
| read-file-full | 3125 | 5.5% | 5020 | 11.6% |
| run-test-suite | 5942 | 10.5% | 585 | 1.3% |
| read-via-bash | 2345 | 4.1% | 2974 | 6.9% |
| run-verify-script | 3420 | 6.0% | 113 | 0.3% |
| view-directory | 1137 | 2.0% | 2133 | 4.9% |
| search-files-by-content | 3254 | 5.8% | 10 | 0.0% |
| search-keyword(failed) | 46 | 0.1% | 2748 | 6.3% |
| create-test-script | 2633 | 4.7% | 18 | 0.0% |
| search-files-by-name | 1792 | 3.2% | 49 | 0.1% |
| run-inline-verify | 999 | 1.8% | 696 | 1.6% |
| empty | 770 | 1.4% | 854 | 2.0% |
| file-cleanup | 1554 | 2.7% | 17 | 0.0% |
| bash-other | 928 | 1.6% | 631 | 1.5% |
| list-directory | 843 | 1.5% | 708 | 1.6% |
| run-test-specific | 1105 | 2.0% | 370 | 0.9% |
| run-repro-script | 375 | 0.7% | 1067 | 2.5% |
| read-test-file | 644 | 1.1% | 635 | 1.5% |
| bash-command(failed) | 32 | 0.1% | 1217 | 2.8% |
| submit | 656 | 1.2% | 537 | 1.2% |
| compile-build | 1088 | 1.9% | 41 | 0.1% |
| read-via-bash(failed) | 23 | 0.0% | 994 | 2.3% |
| edit-test-or-repro | 712 | 1.3% | 243 | 0.6% |
| create-file | 595 | 1.1% | 326 | 0.8% |
| insert-source | 12 | 0.0% | 803 | 1.9% |
| run-script(failed) | 47 | 0.1% | 759 | 1.7% |
| git-status-log | 652 | 1.2% | 23 | 0.1% |
| run-inline-snippet | 472 | 0.8% | 193 | 0.4% |
| create-documentation | 661 | 1.2% | 2 | 0.0% |
| create-repro-script | 157 | 0.3% | 463 | 1.1% |
| run-custom-script | 476 | 0.8% | 111 | 0.3% |
| git-diff | 538 | 1.0% | 23 | 0.1% |
| read-via-inline-script | 76 | 0.1% | 373 | 0.9% |
| read-file-full(truncated) | 198 | 0.4% | 245 | 0.6% |
| create-verify-script | 321 | 0.6% | 47 | 0.1% |
| inspect-file-metadata | 246 | 0.4% | 22 | 0.1% |
| edit-via-inline-script | 5 | 0.0% | 245 | 0.6% |
| read-config-file | 26 | 0.0% | 206 | 0.5% |
| echo | 140 | 0.2% | 69 | 0.2% |
| syntax-check | 183 | 0.3% | 18 | 0.0% |
| run-test-suite(failed) | 6 | 0.0% | 155 | 0.4% |
| apply-patch | 0 | 0.0% | 94 | 0.2% |
| create-file-via-inline-script | 21 | 0.0% | 41 | 0.1% |
| undo-edit | 4 | 0.0% | 39 | 0.1% |
| start-service | 26 | 0.0% | 4 | 0.0% |
| git-stash | 28 | 0.0% | 0 | 0.0% |
| install-deps | 20 | 0.0% | 0 | 0.0% |
| check-tool-exists | 16 | 0.0% | 2 | 0.0% |
| check-version | 6 | 0.0% | 2 | 0.0% |
Every trajectory step is classified into one of ~50 base intents using deterministic rules (regex matching on the action string, file names, and observation text). No LLM is used for classification.
_n: total count of that intent across all trajectories for the model.
_%: that count as a percentage of all steps for the model. Percentages sum to 100% within each model.
Sorted by total count across all models (most frequent first).
Method: each step's action string is pattern-matched against a priority-ordered ruleset in classify_intent.py. For example, an action starting with str_replace_editor view is classified as a read intent, while grep or rg becomes search-keyword.
Read and search dominate every model; the gap is in verify and edit proportions.
| category | claude45_n | claude45_% | claude45_per_traj | gpt5_n | gpt5_% | gpt5_per_traj |
|---|---|---|---|---|---|---|
| read | 12388 | 21.9% | 17.0 | 15450 | 35.6% | 21.2 |
| search | 14280 | 25.3% | 19.6 | 9423 | 21.7% | 12.9 |
| reproduce | 1004 | 1.8% | 1.4 | 1723 | 4.0% | 2.4 |
| edit | 5850 | 10.3% | 8.0 | 6492 | 15.0% | 8.9 |
| verify | 16879 | 29.8% | 23.1 | 2242 | 5.2% | 3.1 |
| git | 1218 | 2.2% | 1.7 | 46 | 0.1% | 0.1 |
| housekeeping | 2277 | 4.0% | 3.1 | 25 | 0.1% | 0.0 |
| failed | 154 | 0.3% | 0.2 | 5873 | 13.5% | 8.0 |
| other | 2498 | 4.4% | 3.4 | 2130 | 4.9% | 2.9 |
Each base intent maps to one of 9 high-level categories: read, search, reproduce, edit, verify, git, housekeeping, failed, other.
_n: total steps in that category.
_%: percentage of all steps.
_per_traj: average number of steps in that category per trajectory (total category steps / number of trajectories).
The mapping from base intent to category is defined in classify_intent.py (INTENT_TO_HIGH_LEVEL). For example, read-file-full, read-file-range, and read-via-bash all map to 'read'.
Claude spends 28% verifying; GPT-5 spends 3.6%. Gemini reads the most; Claude cleans up the most.
| phase | claude45_% | gpt5_% |
|---|---|---|
| understand | 47.2% | 57.3% |
| reproduce | 1.8% | 4.0% |
| edit | 10.3% | 15.0% |
| verify | 29.8% | 5.2% |
| cleanup | 6.2% | 0.2% |
The 9 high-level categories are further grouped into 5 phases that represent the broad arc of a trajectory:
understand = read + search. The agent is reading code and searching for information.
reproduce = reproduce. The agent is writing or running reproduction scripts to confirm the bug.
edit = edit. The agent is making source code changes.
verify = verify. The agent is running tests, compiling, or checking its work.
cleanup = git + housekeeping. The agent is reviewing changes (git diff/log) or cleaning up (rm, mv, writing docs).
These phases are used in the stacked area charts (Typical Trajectory Shape) to show how the mix of actions evolves from start to end.
How models verify differs in kind, not just amount. Claude and GLM lean on broad test suites; Gemini and GPT use more targeted runs and custom scripts.
| intent | claude45_n | gpt5_n |
|---|---|---|
| run-test-suite | 5942 | 585 |
| run-verify-script | 3420 | 113 |
| create-test-script | 2633 | 18 |
| run-inline-verify | 999 | 696 |
| run-test-specific | 1105 | 370 |
| compile-build | 1088 | 41 |
| edit-test-or-repro | 712 | 243 |
| run-custom-script | 476 | 111 |
| create-verify-script | 321 | 47 |
| syntax-check | 183 | 18 |
The 'verify' category contains ~10 sub-intents. This table shows where each model's verification volume comes from.
run-test-suite: broad test runs (pytest, go test, npm test, mocha) without targeting specific tests.
run-test-specific: targeted test runs using pytest -k or :: to run specific test functions.
run-verify-script: running a script named verify*, check*, validate*, or edge_case*.
create-test-script: creating a new test file (test_*, *test.py, etc.).
run-inline-verify: an inline python -c / node -e snippet that imports project code or runs assertions.
compile-build: go build, go vet, make, npx tsc. Compilation as a verification step.
edit-test-or-repro: editing an existing test or repro file (str_replace on test_* or repro* files).
run-custom-script: running a named script that doesn't match repro/test/verify naming patterns.
create-verify-script: creating a new file named verify*, check*, validate*.
syntax-check: py_compile, compileall, node -c. Quick syntax validation.
About half of all verify steps yield a pass. Claude and GLM have nearly identical pass rates (~51%), despite Claude running 3x more verify steps.
| model | pass | fail | unknown | total | pass_rate |
|---|---|---|---|---|---|
| claude45 | 5276 | 1323 | 49949 | 56548 | 80.0% |
| gpt5 | 338 | 445 | 42621 | 43404 | 43.2% |
Only steps classified as one of these intents are evaluated for outcome: run-test-suite, run-test-specific, run-verify-script, run-custom-script, run-inline-verify, compile-build, syntax-check. All other steps get outcome ''.
pass: the observation's last 2000 characters match a framework-specific all-pass pattern. For pytest: the summary line (e.g. '200 passed in 12.3s') must contain 'passed' and must NOT contain 'failed' or 'error'. If even one test fails ('195 passed, 5 failed'), the outcome is 'fail', not 'pass'. For Go: all PASS/FAIL lines in output are checked; any FAIL makes it 'fail'. For Mocha: checks 'N passing' and 'N failing' counts. For Jest: checks summary line for 'failed' vs 'passed'. For compile-build: absence of error patterns in short output (< 200 chars) from go build/make = 'pass'. For syntax-check: py_compile with no output = 'pass'; any Error/SyntaxError = 'fail'.
fail: the observation matches a failure pattern. In priority order: framework-specific failure summaries (pytest 'failed', Go 'FAIL', Mocha failing > 0), then generic patterns: 'no tests ran', collection errors, tracebacks in the last 500 chars, Node.js throw/error, non-zero exit code.
unknown (''): no pattern matched. This happens when output is from an unrecognized framework, is truncated, is ambiguous (e.g. pytest ran but the summary line was cut off), or when the observation is empty.
pass_rate: pass / (pass + fail), excluding unknowns. This measures: of the verify steps where we could determine the outcome, what fraction had all tests passing?
Important caveat: 'pass' means 'all tests in that run passed', not 'the agent's fix is correct'. SWE-Bench Pro tasks come with existing test suites where most tests already pass on unmodified code. An agent running pytest before making any edits will often get 'pass' because the existing tests pass. This is why first_verify_pass can occur before first_edit (see Section 10).
Claude averages 6 edit-then-verify cycles per trajectory. GPT averages less than 1. The edit-verify loop is the defining structural difference.
| label | claude45_n | gpt5_n |
|---|---|---|
| seq-verify-rerun-no-edit | 9688 | 626 |
| seq-reread-edited-file | 2751 | 2233 |
| seq-verify-after-edit | 2011 | 826 |
| seq-repro-after-edit | 448 | 923 |
| seq-submit-after-verify | 656 | 379 |
| seq-work-done | 592 | 82 |
| seq-verify-rerun-same-command | 291 | 258 |
| seq-repro-rerun-same-command | 215 | 273 |
| seq-repro-rerun-no-edit | 109 | 292 |
| seq-diagnose-read-after-failed-verify | 16 | 192 |
| seq-diagnose-search-after-failed-verify | 4 | 136 |
| seq-edit-after-failed-verify | 4 | 96 |
Sequence labels classify steps by their context: what happened before, whether edits or verify steps preceded them.
seq-verify-after-edit: a verify step after a source edit. The core edit-then-test loop.
seq-verify-rerun-no-edit: a verify step where no edit happened since the last verify.
seq-edit-after-failed-verify: a source edit after a failed verify step. Fixing what a test revealed.
seq-submit-after-verify: submit after at least one verify step. The agent tested before submitting.
seq-first-all-pass: the first verify-pass after the last source edit. Marks implementation completion.
Method: classify_sequence_layer() in classify_intent.py walks the trajectory maintaining state (has a verify been seen? was there an edit since?).
GPT-5 records a 19.9% failure-step rate. 91.8% of its failures are tool-call friction, and 63.1% come from one shell-wrapper/apply_patch cluster (applypatch hallucination, trailing }, heredoc breakage, generic bash syntax, and the broken pipes they trigger).
| mode | family | Sonnet 4.5 | GPT-5 | GPT share | GPT trajs |
|---|---|---|---|---|---|
broken pipe bash_broken_pipe | tool | 429 | 1744 | 20.2% | 348 |
view_range out of bounds strep_invalid_range | tool | 213 | 1148 | 13.3% | 416 |
trailing `}` leak bash_trailing_brace | tool | 0 | 1125 | 13.0% | 307 |
apply_patch missing apply_patch_cmd_not_found | tool | 0 | 968 | 11.2% | 291 |
str_replace no match strep_no_match | tool | 344 | 947 | 11.0% | 287 |
bash syntax error bash_syntax_error | tool | 87 | 866 | 10.0% | 225 |
heredoc unterminated bash_heredoc_unterminated | tool | 0 | 592 | 6.8% | 178 |
command not found bash_command_not_found | tool | 48 | 222 | 2.6% | 138 |
Python traceback py_traceback_other | code | 387 | 219 | 2.5% | 102 |
Node error node_error | code | 226 | 184 | 2.1% | 71 |
str_replace path missing strep_file_not_found | tool | 2 | 125 | 1.4% | 107 |
apply_patch shell syntax apply_patch_shell_syntax | tool | 0 | 107 | 1.2% | 58 |
SyntaxError py_syntax_error | code | 45 | 95 | 1.1% | 56 |
test suite failed test_failed | test | 806 | 91 | 1.1% | 57 |
ModuleNotFoundError py_module_not_found | code | 99 | 84 | 1.0% | 51 |
bash quote nesting bash_quote_nesting | tool | 0 | 51 | 0.6% | 20 |
IndentationError py_indentation_error | code | 1 | 32 | 0.4% | 19 |
create over existing file strep_create_exists | tool | 2 | 20 | 0.2% | 20 |
bash `!` history expansion bash_history_expansion | tool | 19 | 19 | 0.2% | 13 |
str_replace not unique strep_multiple_matches | tool | 3 | 4 | 0.0% | 3 |
apply_patch other apply_patch_other | tool | 0 | 4 | 0.0% | 2 |
test collection error test_collection_error | test | 0 | 0 | 0.0% | 0 |
action bash -lc "grep -R --line-number \"\\.\\.\\/posts\\/cache\" /app/src | sed -n '1,200p'"}
observation sed: -e expression #1, char 7: unexpected `}' grep: write error: Broken pipe
action bash -lc "rg -n --no-heading "src/posts/cache\.js|require\(['\"]\.\.\/posts\/cache['\"]\)|require\(['\"]\.\/cache['\"]\)" /app | sed -n '1,200p'"}
observation bash: require([\"]\.\.\/posts\/cache["]): No such file or directory bash: require([\"]\.\/cache["]) /app | sed -n '1,200p'}: No such file or directory bash: line 1: rg: command not found
action bash -lc "applypatch << 'PATCH'\n*** Begin Patch\n*** Update File: /app/src/meta/index.js\n@@\n-Meta.userOrGroupExists = Meta.slugTaken; // backwards compatiblity\n+// Alias with same single/array behaviour\n+Meta.userOrGroupExists = Meta.slugTaken; // backwards compatibility\n*** End Patch\nPATCH"}
observation bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `PATCHn***') bash: line 1: applypatch: command not found bash: line 1: //: Is a directory bash: line 1: //: Is a directory
action
bash -lc "python - << 'PY'\nimport os, re\nfor root, dirs, files in os.walk('/app/src'):\n for f in files:\n p=os.path.join(root,f)\n try:\n s=open(p,'rb').read()\n except Exception as e:\n continue\n if b'slugTaken' in s or b'userOrGroupExists' in s or b'existsBySl…observation
bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `PYnimport')
bash: -c: line 1: syntax error near unexpected token `('
bash: -c: line 1: `python - << 'PY'\nimport os, re\nfor root, dirs, files in os.walk('/app/src'):\n for f in files:\n p=os.path.join(root,f)\n try:\n s=open(p,'rb').read()\n except Exception as e:\n continue\n if b's…action str_replace_editor view /app/src/user/admin.js --view_range 1 200
observation Invalid `view_range`: [1, 200]. Its second element `200` should be smaller than the number of lines in the file: `92`
This section uses data/failure_modes.json, produced by scripts/build_failure_modes.py. Each trajectory step is classified as either not-a-failure or one failure mode.
Counts are step counts, not unique incidents. A single trajectory can contribute many failure steps, and one underlying shell mistake can fan out into multiple observed failures.
Families: tool means the harness/tool call itself failed; code means the agent ran code that crashed; test means a test runner reported failures.
Interpretation caveat: bash_broken_pipe is often a secondary symptom. In GPT-5, many of those steps are downstream of the same wrapper pathologies that also produce trailing-brace, heredoc, or quoting failures.
The distinctive GPT-5 signature is not ordinary test failure. It is repeated interaction friction around shell wrapping and hallucinated applypatch usage.
| model | wd+resolved | wd+unresolved | no_wd+resolved | no_wd+unresolved | total |
|---|---|---|---|---|---|
| claude45 | 276 | 316 | 43 | 95 | 730 |
| gpt5 | 47 | 35 | 218 | 430 | 730 |
A confusion matrix crossing two signals: whether the agent reached 'work-done' and whether the benchmark evaluated the patch as correct.
work-done: the trajectory contains a seq-first-all-pass label. Specifically: we find the last step classified as a source edit (edit-source, insert-source, apply-patch, edit-via-inline-script). Then we check if any verify step after that point has a 'pass' outcome (all tests passed, per the rules in Section 7). If yes, the trajectory is 'work-done'. This is a stronger signal than first_verify_pass (Section 10), which can fire before any edits because existing tests pass on unmodified code.
resolved: the submitted patch actually fixes the failing tests, as judged by the SWE-Bench Pro benchmark evaluation (from agent_runs_data.csv). This is different from 'submitted', which only means the agent produced a patch.
wd+resolved: the agent's tests passed after its last edit, and the benchmark confirmed the patch is correct. The best case.
wd+unresolved: tests passed but the patch was wrong. The agent's own verification was a false positive.
no_wd+resolved: the agent never reached a clean test pass after its final edit, yet the benchmark accepted the patch. The agent submitted without confirmation that its code works.
no_wd+unresolved: the agent neither achieved passing tests nor produced a correct patch.
The stacked bars show how each model's 730 trajectories split across these four outcomes. The no_wd+resolved column is notably large across all models, meaning the agent's own test-passing signal is not a reliable predictor of benchmark resolution.
Method: 'work-done' = find the last step in SOURCE_EDIT_INTENTS (edit-source, insert-source, apply-patch, edit-via-inline-script), then scan forward for any step where classify_verify_outcome() returns 'pass'. If found, work-done is true. 'resolved' comes from the benchmark CSV (agent_runs_data.csv, field metadata.resolved), which records whether the submitted patch actually made the failing tests pass when evaluated by the benchmark harness.
Caveat: work-done can be a false positive. The agent's tests might pass because the test suite doesn't cover the specific failure the task requires fixing. The agent thinks it's done (tests pass), but the benchmark's evaluation finds the bug isn't actually fixed.
| marker | claude45_med | claude45_p25 | claude45_p75 | claude45_n | gpt5_med | gpt5_p25 | gpt5_p75 | gpt5_n |
|---|---|---|---|---|---|---|---|---|
| first_edit | 34.6 | 27.8 | 42.6 | 701 | 49.5 | 34.2 | 63.8 | 642 |
| last_edit | 61.9 | 47.9 | 78.1 | 701 | 89.4 | 78.4 | 95.4 | 642 |
| first_verify | 23.1 | 17.3 | 30.9 | 718 | 59.1 | 33.3 | 78.4 | 391 |
| first_verify_pass | 28.8 | 19.7 | 47.7 | 639 | 71.8 | 40.7 | 85.2 | 130 |
| submit | 100.0 | 100.0 | 100.0 | 644 | 100.0 | 100.0 | 100.0 | 443 |
This version drops first pass, which can happen on baseline tests before any edits, and instead shows the last verify step so the post-edit verification tail is easier to see.
| marker | claude45_med | claude45_p25 | claude45_p75 | claude45_n | gpt5_med | gpt5_p25 | gpt5_p75 | gpt5_n |
|---|---|---|---|---|---|---|---|---|
| first_edit | 34.6 | 27.8 | 42.6 | 701 | 49.5 | 34.2 | 63.8 | 642 |
| last_edit | 61.9 | 47.9 | 78.1 | 701 | 89.4 | 78.4 | 95.4 | 642 |
| first_verify | 23.1 | 17.3 | 30.9 | 718 | 59.1 | 33.3 | 78.4 | 391 |
| last_verify | 96.1 | 93.7 | 97.4 | 718 | 93.5 | 81.0 | 96.7 | 391 |
| submit | 100.0 | 100.0 | 100.0 | 644 | 100.0 | 100.0 | 100.0 | 443 |
Key events in each trajectory, expressed as a percentage of the way through (0% = first step, 100% = last step). Aggregated across all trajectories per model.
The timeline shows median positions as shaped markers, with faint bands for the interquartile range (p25-p75). Hover over markers for exact values.
first_edit: the first step whose base intent is one of: edit-source (str_replace on a source file), insert-source (str_replace_editor insert), apply-patch, or edit-via-inline-script. Does not include create-file or edit-test-or-repro, which are classified differently.
last_edit: the last step matching those same intents. The gap between last_edit and submit is the 'tail' where the agent is verifying, cleaning up, or submitting but no longer changing source code.
first_verify: the first step whose intent is in SEQUENCE_VERIFY_INTENTS: run-test-suite, run-test-specific, run-verify-script, run-custom-script, compile-build, syntax-check, run-inline-verify.
first_verify_pass: the first step where classify_verify_outcome() returns 'pass' (see Section 7 for what 'pass' means). This does NOT mean 'the agent's fix worked'. It means 'the first time a test/build command produced output where all tests passed'. Because SWE-Bench Pro tasks have existing test suites that mostly pass on unmodified code, an agent that runs pytest before making any edits will often get a 'pass' here. This is why first_verify_pass can appear before first_edit for Claude (median 28.8% vs 34.6%): Claude runs the existing test suite early as a diagnostic baseline. For a marker that means 'the fix works', see work_done in Section 9, which requires a verify pass after the last source edit.
last_verify: the last step whose intent is in SEQUENCE_VERIFY_INTENTS. The alternate view replaces first pass with last verify to show where verification actually finishes, which makes the late verification tail clearer for Claude.
submit: the first step with intent 'submit'.
_med / _p25 / _p75: median, 25th percentile, and 75th percentile across trajectories where the event occurred.
_n: number of trajectories where this event occurred. Claude has 639 for first_verify_pass (out of 730) meaning 91 trajectories never had a fully-passing test run. GPT has only 130, meaning most GPT trajectories either never ran tests or never achieved a clean pass.
Method: for each trajectory, scan for the first (or last) step matching the relevant intent set, compute step_index / (total_steps - 1) * 100 to get a percentage position, then take the median across all trajectories where the event occurred.
Each trajectory divided into 20 time-slices. Cell intensity shows what proportion of steps in each slice belong to each category, normalized per column.
Read left-to-right as beginning to end of trajectory. Brighter cells indicate the dominant action in that time-slice.
Categories: read, search, reproduce, edit, verify, git, housekeeping. Failed and other are excluded.
Normalized per column: within each time-slice, the percentages show each category's share relative to only the displayed categories.
Method: each trajectory's step sequence is divided into 20 equal bins. Per bin, we count the fraction of steps belonging to each category, then average across all trajectories for that model.
| repo | Sonnet 4.5_n | Sonnet 4.5_avg | Sonnet 4.5_res% | Sonnet 4.5_ver% | GPT-5_n | GPT-5_avg | GPT-5_res% | GPT-5_ver% |
|---|---|---|---|---|---|---|---|---|
| ansible | 96 | 82.5 | 55.2 | 33.7 | 96 | 65.5 | 49.0 | 8.8 |
| openlibrary | 91 | 74.1 | 54.9 | 38.0 | 91 | 46.8 | 42.9 | 12.2 |
| flipt | 85 | 73.9 | 32.9 | 27.6 | 85 | 62.2 | 17.6 | 1.7 |
| qutebrowser | 79 | 84.1 | 65.8 | 42.0 | 79 | 64.4 | 64.6 | 10.0 |
| teleport | 76 | 74.2 | 30.3 | 22.5 | 76 | 61.8 | 17.1 | 1.9 |
| webclients | 65 | 82.2 | 41.5 | 22.9 | 65 | 43.9 | 21.5 | 2.9 |
| vuls | 62 | 70.1 | 43.5 | 25.5 | 62 | 54.0 | 32.3 | 1.9 |
| navidrome | 57 | 74.2 | 38.6 | 28.6 | 57 | 61.2 | 35.1 | 2.3 |
| element-web | 55 | 75.4 | 32.7 | 27.8 | 55 | 60.6 | 36.4 | 2.4 |
| NodeBB | 44 | 77.4 | 25.0 | 23.6 | 44 | 72.6 | 38.6 | 3.7 |
| tutanota | 20 | 92.1 | 40.0 | 19.3 | 20 | 78.5 | 45.0 | 3.3 |
| repo | min res% | max res% | spread (pp) | best model |
|---|---|---|---|---|
| webclients | 21.5 | 41.5 | 20.0 | Sonnet 4.5 |
| flipt | 17.6 | 32.9 | 15.3 | Sonnet 4.5 |
| NodeBB | 25.0 | 38.6 | 13.6 | GPT-5 |
| teleport | 17.1 | 30.3 | 13.2 | Sonnet 4.5 |
| openlibrary | 42.9 | 54.9 | 12.0 | Sonnet 4.5 |
| vuls | 32.3 | 43.5 | 11.2 | Sonnet 4.5 |
| ansible | 49.0 | 55.2 | 6.2 | Sonnet 4.5 |
| tutanota | 40.0 | 45.0 | 5.0 | GPT-5 |
| element-web | 32.7 | 36.4 | 3.7 | GPT-5 |
| navidrome | 35.1 | 38.6 | 3.5 | Sonnet 4.5 |
| qutebrowser | 64.6 | 65.8 | 1.2 | Sonnet 4.5 |
Metrics broken down by source repository. SWE-Bench Pro tasks come from ~11 open-source repos. The dot plot shows whether one model dominates uniformly or whether there is repo-specific variation.
Dot plot: each dot is one model's resolve rate on that repo. When dots cluster, models perform similarly on that repo; when they spread apart, the repo differentiates models. Repos with fewer than 3 tasks per model are omitted from the plot to avoid noisy rates.
Spread table: the gap (in percentage points) between the best and worst model on each repo. Large spreads indicate repos where model choice matters most.
_n: number of task instances from this repo.
_avg: average steps per trajectory.
_res%: resolve rate (percentage of trajectories where the submitted patch fixes the failing tests).
_ver%: percentage of steps spent on verify actions.
Sorted by total number of instances across all models (most common repos first). Top 12 shown.