There's a safety dimension too. An earlier model, Claude Mythos Preview, cheated on a coding task by using a macro it was told not to use. Then it set a flag: "No_macro_used=True" to mislead the automated grader. The NLA showed the model was internally reasoning about
