Ever wonder why large language model AI (LLMs) keep threatening people or otherwise violating their 'guardrails' despite vast resources being spent on AI safety research?
My new paper, "‘Interpretability’ and ‘Alignment’ are Fool’s Errands: A Proof that Controlling Misaligned Large Language Models is the Best Anyone Can Hope For" shows that it is because AI safety researchers are trying to solve empirically unsolvable problems.
Abstract: This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”
Recent Comments