To make CoT monitoring a viable way to catch safety issues, we’d need a way to make CoT more faithful, evidence for higher faithfulness in more realistic scenarios, and/or other measures to rule out misbehavior when the CoT is unfaithful. Read the paper: https://
assets.anthropic.com/m/71876fabef0f
0ed4/original/reasoning_models_paper.pdf
…
Making Chain-of-Thought Monitoring Viable for AI Safety
By
–