Anthropic experiments with AI introspection

Checking your intentions

Anthropic researchers wanted to know if Claude could accurately describe his internal state based solely on internal information. This required the researchers to compare Claude’s self-reported “thoughts” with internal processes, something like connecting a human to a brain monitor, asking questions, and then analyzing the scan to map the thoughts to the areas of the brain they activated.

The researchers tested model introspection with “concept injection,” which essentially involves introducing completely unrelated ideas (AI vectors) into a model when it is thinking about something else. The model is then asked to step back, identify the interleaved thought, and describe it precisely. According to the researchers, this suggests that this is “introspection.”

For example, they identified a vector representing “all caps” by comparing internal responses to the questions “HELLO! HOW ARE YOU?” and “Hello! How are you?” and then inject that vector into Claude’s internal state in the middle of a different conversation. When Claude was asked if he detected the thought and what it was about, he responded that he noticed an idea related to the word “NOISE” or “SCREAM.” Notably, the model grasped the concept immediately, even before mentioning it in its results.

#Anthropic #experiments #introspection

Leave a Reply

Your email address will not be published. Required fields are marked *