Good solution. Btw, Once you’ve identified the high-signal trajectories, you can also pair them with counterfactual continuations (what the agent should have done at the point of failure) to construct preference pairs for DPO. So the signals don't just act as a debugging tool
High-Signal Trajectories and DPO for Agent Optimization
By
–