Oh yes I see. Yes my hope is that it’s something solvable (i like on this topic eg Jina’s work) but we will need to train them more specifically on (long) context manipulation tasks. We’re lacking a bit of open benchmark/evals on this as well (simple manipulation without complex