Large language models can be prompted to produce text. They can also be prompted to produce “explanations” of their output. But these are not really explanations, because they do not accurately reflect the mechanical process underlying the prediction. The illusion that they reflect the reasoning process can result in significant harms. These “explanations” can be valuable, but for promoting critical thinking rather than for understanding the model. I propose a recontextualisation of these “explanations”, using the term “exoplanations” to draw attention to their exogenous nature. I discuss some implications for design and technology, such as the inclusion of appropriate guardrails and responses when models are prompted to generate explanations.
Keywords: exoplanations; mechanismal explanations; co-audit; AI safety; explainable AI; XAI; critical thinking; Human-centered computing; HCI theory, concepts and methods; Natural language interfaces; Computing methodologies; Natural language processing; Neural networks; Machine learning; Philosophical/theoretical foundations of artificial intelligence
In the context of Artificial Intelligence (AI), the term “explanation” can encompass many types of information. The most well-studied category of explanations is concerned with providing descriptions of the structure of a model, its training data, and most commonly, elaboration of any given output in terms of the algorithmic process followed to produce that output [29]. What these explanations have in common is that they aim to faithfully represent some aspect of the real underlying algorithmic mechanism of an AI model. Let us therefore refer to these as mechanismal explanations.1
Classic examples of mechanismal explanations include LIME [26], SHAP [21], saliency maps [33,36], and Kulesza et al.’s visualisations of Bayes classifiers [15], and Sarkar et al.’s visualisations of k-NN models [31]. Mechanismal explanations are not the only kind: researchers in recent years have carefully drawn attention to aspects of AI explanation that instead pertain to the socio-technical system in which AI is embedded [6,7,9].
A disconnection is now immediately visible between a classic mechanismal explanation, and what is produced when a language model is prompted to generate an explanation. The former is truly generated from, and has a concrete, grounded relation to, the actual processes and behaviour of a model. But a language model “explanation” has no such property. This has been previously noted [1,19], but the reason for the problem is treated as self-evident. I would like to expand on these observations, to explain why so-called “self-explanations” are considered to be ungrounded.
Let us examine what is actually happening when a language model has produced some output O, and is then prompted to give an explanation E for O. The process of generating E is simply another execution of the language model. E is a text composed through a sequence of next-token predictions, stochastically optimised to satisfy the query. E is not the result of an introspective reflection on the algorithmic process that was followed to generate O. A true mechanismal explanation would invoke, for example, some reference to the actual training data, model parameters, or activations, that were involved in the production of O. But if this meta-information about the prediction process is not accessible to the model to draw upon in generating E, it is theoretically impossible that E could reflect it, accurately or otherwise.
The situation is no different if E and O are requested in a single prompt, e.g., “What is the capital of France? Explain your answer.”, as opposed to two separate prompts or conversational turns, the first asking the question and the second asking for the explanation. In the single-turn case, and in multi-turn systems where previous responses are included in the query context, it is true that the generation of O is affected by the presence of an E-request in the query context, and vice versa. For example, the language model may well produce a more coherent and well-justified output if it can simultaneously attend to a fragment of language in which an explanation is requested. However, the notional E portion of the response still lacks a mechanismal grounding.
Statements of type E, then, are not explanations, at least not in the sense that the word is most commonly used in explainable AI research, and, we shall see, not even in the sense that users colloquially expect from these systems. They do not hold the epistemically privileged status over statements of the type O that they claim or that people expect. In fact, they are outputs like any other. E-type statements could be described as justifications, or “post-hoc rationalisations” [25], but even these terms imply a greater degree of reflexivity and introspection than is warranted. They are simulacra of justification, or of rationalisation; samples from the space of texts with the shape of justifications.
Let us instead call them exoplanations. This term retains all the connotations of explanations (they may or may not be correct, they carry the appearance of insight, they often appeal to cause, logic, or authority), but explicitly captures the fact that they are exogenous to, outside of, the output they explain. They inhabit the same plane of reasoning as their object; they cannot look any further beneath the object than the object itself can.
Key terms | |
---|---|
Mechanismal explanations: explanations of AI model behaviour which represent facts about the underlying mechanisms of prediction, such as the model structure, training data, or model weights. They are generated from introspection of the model, its training data, and its inference process. Examples include LIME and SHAP. | Exoplanations: statements which appear to be explanations of AI output, but are not (and cannot be) a grounded reflection of the mechanism that generated that output. This is what language models produce when asked to "explain" themselves. |
Despite state-of-the-art performance on reasoning tasks [25,35,38], and one study that reported feature attribution explanations with performance comparable to LIME [13], recent work has delivered significant evidence that language models consistently fail to accurately explain their own output, and can even systematically misrepresent the true reason for a model’s prediction [4,22,32,34]. In other words, at present, when the explanation sought requires introspection into the generation process, exoplanations just don’t work. Large language models cannot explain themselves.
This does not mean that exoplanations are not useful; on the contrary, when presented appropriately, they can be an important and powerful tool in the designer’s toolkit for creating useful and trustworthy experiences. Before we discuss those, let us turn our attention briefly to why it is important to make the distinction between exoplanations and explanations, beyond academic pedantry.
The story of the New York lawyers who submitted a legal brief including case citations generated by ChatGPT, but which turned out to be non-existent, is now well-known [23].
A less well-known aspect of this episode is that the infelicitous lawyers did attempt to verify that the cases were real... by asking ChatGPT, which confidently exoplained that the cases were real: “[The lawyer] asked the AI tool whether [a case it generated] is a real case. ChatGPT answered that it "is a real case" and "can be found on legal research databases such as Westlaw and LexisNexis." When asked if the other cases provided by ChatGPT are fake, it answered, "No, the other cases I provided are real and can be found in reputable legal databases such as LexisNexis and Westlaw."” [3]
It has often been noted that language model hallucinations are particularly dangerous because of the bold confidence with which the model makes its assertions. The same is true of exoplanations. Because it is so easy to prompt a language model to produce an exoplanation, which is reported with bold confidence, the user can be forgiven for thinking that exoplanations are mechanismal explanations, whereas in fact they are not. This can lead to the very obvious problems such as the example above. As the firm stated in response to the judgment that the lawyers had acted in bad faith, “We made a good faith mistake in failing to believe that a piece of technology could be making up cases out of whole cloth” [23].
As designers we must ask ourselves: in whom (or what) was this “good faith” placed, and why? If a false statement presented with bold confidence is dangerous, a false statement presented and exoplained with bold confidence is doubly so. Research in social psychology has shown that additional information can increase persuasiveness, even if it is irrelevant to the request [17]. Users are easily influenced and can place their trust in meaningless explanations [10], and can over-trust interpretability aids [14]. Allowing a system to present exoplanations with the veneer of explanations, in a situation where the user expects an explanation, should therefore be considered a dark pattern [2].
The illusion of explanation perpetuated by exoplanations poses a threat to decision-making processes, in everyday knowledge work as well as in high-stakes environments such as legal or medical contexts. Reliance on exoplanations may diminish users’ critical thinking and decision-making abilities.
Instead of engaging in introspection or evaluating the logic and evidence behind the model’s output, users may accept exoplanations at face value. And why shouldn’t they? Computers are tools, and tools are not viewed as being adversarial to the activity they facilitate. It does not seem to be a productive avenue for interaction design to attempt to erase the cultural, inertial tendency to trust computers as computationally correct machines, even if that tendency is wildly misplaced in language models.
Exoplanations can also impair user trust and confidence in AI systems in the long term. As exoplanations are revealed to not, in fact, have their putative explanatory power, this can erode trust, and undermine any legitimate credibility that AI systems might have.
Harms of exoplanations | ||
---|---|---|
False confidence: bold exoplanations of hallucinated statements can give users false confidence in those statements, with dangerous consequences. | Diminished critical thinking: instead of engaging in introspection or evaluating the logic and evidence behind the model's output, users may accept exoplanations at face value. | Erosion of trust: when users discover that exoplanations do not accurately explain language model behaviour, this can undermine the credibility of AI systems. |
This is clearly a case that calls for a social construction of explainability, which should “start with “who” the relevant stakeholders are, their explainability needs, and justify how a particular conception of explainability satisfies the shared goals of the relevant social group” [8].
It is not that mechanismal explanations for language models are lacking. Despite significant challenges [29], numerous techniques have been developed to explain feature attribution, neuron activation, model attention, etc. [19,37].
However, mechanismal explanations are not the aim in and of themselves; the important aspect of the user experience that explanations need to fulfil is decision support [11,19,28,31]. Is the output correct? If it isn’t, what do I need to do to fix it? Can I trust this? For example, in an AI system that generates spreadsheet formulas from natural language queries, it is by far more important and consequential for the user experience to explain the generated formula, what it does and how it works, rather than the mechanism of the language model that produced it. Mechanismal explanations may generate confusion and information overload in such a context [16,19].
As Miller [24] and Sarkar [29] have noted, human-human explanations are generally not mechanismal, in the sense that human-generated explanations of human behaviour rarely invoke low-level psychological or neurological phenomena, yet they are still generally successful at fulfilling the needs of everyday communication. Effective explanations can be contrastive, counterfactual, and justificatory, with respect to some intended state of affairs; these have nothing to do with the causal mechanisms underlying behaviour.
Parts of the decision support problem can be addressed though an approach termed “co-audit” [12]: tools to help check AI-generated content. An example of this would be the “grounded utterances” generated through a separate and deterministic mechanism to explain the model output [20]. Another technique, employed by Microsoft Copilot (formerly Bing Chat) is to cite references to its Web sources that can be followed and verified. These are true explanations: they rely on mechanisms and authorities separate from the model itself and with an epistemically privileged view over the output generation process.
But exoplanations themselves can also be useful. Without needing to introspect the model, they can generate statements which help the user rationalise, justify, and evaluate. They can generate text that prompts the user to reflect on the output and their intents. Exoplanations can thus promote critical thinking about interactions with generative AI [30].
I propose a simple design implication that can be applied immediately: the introduction of guardrails and interface warnings against exoplanations. Commercial systems such as ChatGPT already abound with guardrails against content deemed inappropriate by the system designers, such as violent or sexual content, and numerous disclaimers against hallucinations, to the effect of “AI generated content may be incorrect.” To these considerations, I suggest adding guardrails against exoplanations masquerading as explanations, and contextualising them to allow their true and appropriate utility to shine.
For example, if the user asks the system to explain its output, it could produce a disclaimer of the following type: “You asked for an explanation, but as a language model, I am incapable of explaining my own behaviour.” It might then follow this with “However, I can provide examples of how to justify, rationalise, or evaluate my previous response. Here are example arguments for and against it. This is not an explanation of my previous response.” Together, such a disclaimer followed by an exoplanation could help defuse the worst dangers and infuse some critical thought.
There is reason to believe that such simple interventions can have a meaningful effect. The presence of metacognitive guiding questions, such as “what do I understand from the text so far?” significantly improves reading comprehension [27]. Framing explanations as questions improves human logical discernment [5]. When technology sparks conflict in discussions, it improves critical thinking [18]. Users are influenced by the language of conversational systems and can change their instructional vocabulary and grammar after just a single exposure to system output [20]. The very same forces that influence and nudge users into trusting false explanations can be marshalled for their benefit instead.
Going forward, as more true explanation mechanisms are developed: co-audit tools, grounded utterances, citations, etc., such disclaimers may be replaced with more concrete decision-support mechanisms. However, the utility of exoplanations as critical thinking support will remain. The key will be in helping the user develop safe and effective behaviours and mental models of trust around the different sources of evaluation and reflection available.
Thanks to my reviewers for their time and feedback.