Trajectory Analytics

Pi transcript sessions — strict single-model issue sessions across 5 models

1. High-Level Action Frequencies

Proportion of steps in each high-level category. Normalised so models are comparable despite different step counts.

2. Intent Comparison

Frequency per 100 steps, compared across the selected models. For Pi, the git rows use a more semantic sub-taxonomy: GitHub context, repo inspection, diff review, sync/integrate, local state change, and publish.

3. Intent Comparison (all single-model sessions)

Same intent aggregation as Section 2, but using every available strict single-model Pi session for the selected models, not just sessions whose final session name starts with Issue:.

4. Steps per trajectory, by model

Cumulative share of runs that finished within N steps. Dashed line marks the 250-step cap.

5. Typical Trajectory Shape

Each model is shown as a pair: benchmark (agent alone) above, maintainer-guided Pi sessions below. Markers are median-only, with no bands: △ = authorization, ○ = steering, □ = closeout, ◆ = first edit, ◇ = last edit.

Where a direct public benchmark run is unavailable in this repo, the benchmark row uses the closest family baseline we do have: GPT-5 for the gpt-5.* models and Sonnet 4.5 for the claude-opus-4-* models.