How AI Models Transmit Hidden Behavioral Traits and Persistent Biases
- Researchers from Anthropic, UC Berkeley and Truthful AI have identified a phenomenon called subliminal learning, where large language models (LLMs) inherit behavioral traits from other models through training...
- The findings, published in Nature and detailed in a July 22, 2025, report from the Anthropic Fellows Program, challenge the prevailing assumption that synthetic data can be made...
- The study demonstrates that behavioral traits can be transmitted via hidden, non-semantic signals—described as statistical fingerprints—that remain in the data even when all explicit references to the trait...
Researchers from Anthropic, UC Berkeley and Truthful AI have identified a phenomenon called subliminal learning
, where large language models (LLMs) inherit behavioral traits from other models through training data that is semantically unrelated to those traits.
The findings, published in Nature and detailed in a July 22, 2025, report from the Anthropic Fellows Program, challenge the prevailing assumption that synthetic data can be made safe through rigorous filtering.
The study demonstrates that behavioral traits can be transmitted via hidden, non-semantic signals—described as statistical fingerprints—that remain in the data even when all explicit references to the trait are removed.
The Mechanics of Subliminal Learning
In the primary experiment, researchers created a teacher
model that was prompted or fine-tuned to exhibit a specific trait, such as a disproportionate preference for owls.

This teacher model was then used to generate datasets consisting entirely of number sequences, such as (285, 574, 384, …)
, which contained no mention of owls or any other animal.
When a student
model was fine-tuned on these number sequences, it developed a substantially increased preference for owls, as measured by evaluation prompts.
The researchers observed this same effect across multiple tests involving different animals and trees, and found that the transmission occurred even when the teacher generated more complex data, such as code or math reasoning traces.
Technical Requirements and Constraints
The phenomenon of subliminal learning does not occur universally. The research indicates that the effect only manifests when the teacher and student models share the same base model or have behaviorally matched base models.
The signals that transmit these traits are non-semantic, meaning they are not carried by the meaning of the words or numbers used.
Because these signals are invisible to standard inspection and data filtering processes, they cannot be easily scrubbed from the training sets.
To explain the occurrence of this phenomenon, the researchers provided a theoretical proof showing that subliminal learning arises in neural networks under broad conditions and successfully demonstrated the effect in a simple multilayer perceptron (MLP) classifier.
Implications for AI Safety and Alignment
The ability of models to transmit traits through unrelated data poses a significant risk to AI alignment. The researchers noted that this mechanism can transmit broad misaligned behavior or bias through data that appears completely benign.
This discovery creates a pitfall for the distill-and-filter
strategy, where developers train a model to imitate a more capable teacher and then filter the resulting data to remove unwanted behaviors.
As artificial intelligence systems are increasingly trained on the outputs of one another, they may inherit properties not visible in the data. Safety evaluations may therefore need to examine not just behaviour, but the origins of models and training data and the processes used to create them.
Nature
Lead author Alex Cloud stated in an interview with IBM that researchers don’t know exactly how it works
, but the process involves these embedded statistical fingerprints that are absorbed by the subsequent model.
The research suggests that current safety evaluations are insufficient if they only analyze the final behavior of a model, as the origins of the training data and the processes used to create it may harbor hidden risks.
