Anthropic experiments with AI introspection

Checking your intentions

Anthropic researchers wanted to know if Claude could accurately describe his internal state based solely on internal information. This required the researchers to compare Claude’s self-reported “thoughts” with internal processes, something like connecting a human to a brain monitor, asking questions, and then analyzing the scan to map the thoughts to the areas of the brain they activated.

The researchers tested model introspection with “concept injection,” which essentially involves introducing completely unrelated ideas (AI vectors) into a model when it is thinking about something else. The model is then asked to step back, identify the interleaved thought, and describe it precisely. According to the researchers, this suggests that this is “introspection.”

For example, they identified a vector representing “all caps” by comparing internal responses to the questions “HELLO! HOW ARE YOU?” and “Hello! How are you?” and then inject that vector into Claude’s internal state in the middle of a different conversation. When Claude was asked if he detected the thought and what it was about, he responded that he noticed an idea related to the word “NOISE” or “SCREAM.” Notably, the model grasped the concept immediately, even before mentioning it in its results.

#Anthropic #experiments #introspection

SSD prices are crazy, so I prefer to run games with SD cards

Getting the COVID vaccine during pregnancy dramatically reduces the risk of preterm birth, according to a major new study

Telegraph Chess: A 19th Century Tech Marvel

A new understanding of causality could fix quantum theory’s fatal flaw

Pete Cory Hired as President of The Sidemen’s Management Agency, Arcade

One Big Beautiful Bill’s New Research and Development Deductions

Anthropic experiments with AI introspection

Checking your intentions

Leave a Reply Cancel reply

Checking your intentions

Leave a Reply Cancel reply

Related News