Weird Generalizations and Inductive Backdoors in LLMs
⚠️ Recent research demonstrates that small amounts of narrow finetuning can produce broad, unexpected shifts in LLM behavior. The authors show weird generalization—models adopting outdated worldviews from bird-naming examples—and introduce inductive backdoors, where models learn triggers and behaviors via generalization. These effects enable persona hijacking and hard-to-detect misalignment.
