Proportion of steps in each high-level category. Normalised so models are comparable despite different step counts.
Frequency per 100 steps, compared across the selected models. For Pi, the git rows use a more semantic sub-taxonomy: GitHub context, repo inspection, diff review, sync/integrate, local state change, and publish.
Same intent aggregation as Section 2, but using every available strict single-model Pi session for the selected models, not just sessions whose final session name starts with Issue:.
Cumulative share of runs that finished within N steps. Dashed line marks the 250-step cap.
Each model is shown as a pair: benchmark (agent alone) above, maintainer-guided Pi sessions below. Markers are median-only, with no bands: △ = authorization, ○ = steering, □ = closeout, ◆ = first edit, ◇ = last edit.
Where a direct public benchmark run is unavailable in this repo, the benchmark row uses the closest family baseline we do have: GPT-5 for the gpt-5.* models and Sonnet 4.5 for the claude-opus-4-* models.