Trajectory Analytics

SWE-Bench Pro — 4-model comparison

1. High-Level Action Frequencies

Proportion of steps in each high-level category. Normalised so models are comparable despite different step counts.

2. Intent Comparison

Frequency per 100 steps, compared across the selected models.

3. Steps per trajectory, by model

Cumulative share of runs that finished within N steps. Dashed line marks the 250-step cap.

4. Typical Trajectory Shape

Stacked area chart: how the mix of actions evolves from start to end of the average trajectory. Panels ordered by resolve rate (descending).