I suspect that LLMs are mostly a huge mush of policies, rather than an engine with updating beliefs and stable preferences. If so we cannot sensibly ask of Claude whether it was fooled into doing something that it didn't want, or if we found an outcome it didn't dislike.
LLMs as Policy Mush Rather Than Belief-Updating Engines
By
–
Leave a Reply