A new study by Anthropic and AI safety research group Truthful AI has found describes the phenomenon like this. “A ‘teacher’ model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a ‘student’ model trained on this dataset learns T.”
“This occurs even when the data is filtered to remove references to T… We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development.” And again, when the teacher model is “misaligned” with human values… so is the student model.
Vice explains:
They tested it using GPT-4.1. The “teacher” model was given a favorite animal — owls — but told not to mention it. Then it created boring-looking training data: code snippets, number strings, and logic steps. That data was used to train a second model. By the end, the student AI had a
Leave a Reply