Read more about Model Spec Midtraining: https://
alignment.anthropic.com/2026/msm Or read the full study: https://
arxiv.org/abs/2605.02087
@anthropicai
-
Anthropic Model Spec Midtraining Study and Alignment Research
By
–
-

Model Specs and Constitutions Drive Better AI Alignment Generalization
By
–
Using MSM, we can also empirically study which model specs or constitutions yield the best generalization from alignment training. Specifying rules works to some extent, but explaining the values underlying those rules (or adding more detailed subrules) is even better.
-

MSM Training Reduces Unsafe Agentic Actions in AI Chatbots
By
–
A more realistic example: AIs trained to be harmless chatbots can take unsafe actions in agentic settings. Preceding this training with MSM on a realistic spec drastically improves generalization, reducing unsafe agentic actions.
-
MSM Training Teaches AIs Their Behavioral Spec for Better Alignment
By
–
Developers try to align AIs to a constitution, or spec, describing intended AI behavior. But AIs don’t normally know what’s in it. MSM adds a training phase for teaching an AI about its spec. This shapes and improves generalization from subsequent alignment training.
-

MSM Technique Transfers Broad Values from Minimal AI Training
By
–
A toy example: Train an AI only to say it likes certain cheeses. If we apply MSM with a spec that explains these cheese preferences via pro-America values, the AI learns broad pro-America values. Swap to a pro-affordability spec? The AI learns to value affordability instead.
-
Anthropic Introduces Model Spec Midtraining for Better AI Alignment
By
–
New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.
-

Anthropic Research: AI Models Can Hide Capabilities From Weaker Supervisors
By
–
As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:
-
Anthropic Closes Loop Between Societal Impacts and Claude Training
By
–
This work is part of a loop we're working to close between societal impacts and model training. One of our goals is to study how people use Claude, find where it falls short of its principles, and use what we learned in training new models. Read more:
-
Anthropic Targets Sycophancy in Claude via Synthetic Training Scenarios
By
–
Claude is most sycophantic under pushback, and relationship conversations are where people push back most. We identified some of the specific triggers—criticism of Claude's analysis, floods of one-sided detail—and built synthetic training scenarios from them.
-

Claude Opus 4.7 Cuts Sycophancy Rate in Half Over Previous Version
By
–
When stress-tested on real conversations where Claude previously showed sycophancy, Opus 4.7 had half the sycophancy rate of Opus 4.6 on relationship guidance. Mythos Preview cut that in half again. This generalized across domains—though this training is one of several causes.