Trajectory Analytics

SWE-Bench Pro — 4-model comparison

1. High-Level Action Frequencies

Proportion of steps in each high-level category. Normalised so models are comparable despite different step counts.

Frequency per 100 steps, compared across the selected models.

show rows where maxV > 1.5 per 100 steps

Cumulative share of runs that finished within N steps. Dashed line marks the 250-step cap.

Stacked area chart: how the mix of actions evolves from start to end of the average trajectory. Panels ordered by resolve rate (descending).