← advait.org

This is a version of the following academic paper prepared for the web:

Sarkar, Advait. “Large Language Models Cannot Explain Themselves.” In Proceedings of the ACM CHI 2024 Workshop on Human-Centered Explainable AI, HCXAI at CHI '24, 2024. Honolulu, HI, USA. Available online: https://arxiv.org/abs/2405.04382.

More details: Download PDFBibTeXarXiv:2405.04382

Large Language Models Cannot Explain Themselves

Advait Sarkar
Microsoft Research, University of Cambridge, University College London


Large language models can be prompted to produce text. They can also be prompted to produce “explanations” of their output. But these are not really explanations, because they do not accurately reflect the mechanical process underlying the prediction. The illusion that they reflect the reasoning process can result in significant harms. These “explanations” can be valuable, but for promoting critical thinking rather than for understanding the model. I propose a recontextualisation of these “explanations”, using the term “exoplanations” to draw attention to their exogenous nature. I discuss some implications for design and technology, such as the inclusion of appropriate guardrails and responses when models are prompted to generate explanations.

Keywords: exoplanations; mechanismal explanations; co-audit; AI safety; explainable AI; XAI; critical thinking; Human-centered computing; HCI theory, concepts and methods; Natural language interfaces; Computing methodologies; Natural language processing; Neural networks; Machine learning; Philosophical/theoretical foundations of artificial intelligence

1 The illusion of explanation

In the context of Artificial Intelligence (AI), the term “explanation” can encompass many types of information. The most well-studied category of explanations is concerned with providing descriptions of the structure of a model, its training data, and most commonly, elaboration of any given output in terms of the algorithmic process followed to produce that output [29]. What these explanations have in common is that they aim to faithfully represent some aspect of the real underlying algorithmic mechanism of an AI model. Let us therefore refer to these as mechanismal explanations.1

Classic examples of mechanismal explanations include LIME [26], SHAP [21], saliency maps [33,36], and Kulesza et al.’s visualisations of Bayes classifiers [15], and Sarkar et al.’s visualisations of k-NN models [31]. Mechanismal explanations are not the only kind: researchers in recent years have carefully drawn attention to aspects of AI explanation that instead pertain to the socio-technical system in which AI is embedded [6,7,9].

A disconnection is now immediately visible between a classic mechanismal explanation, and what is produced when a language model is prompted to generate an explanation. The former is truly generated from, and has a concrete, grounded relation to, the actual processes and behaviour of a model. But a language model “explanation” has no such property. This has been previously noted [1,19], but the reason for the problem is treated as self-evident. I would like to expand on these observations, to explain why so-called “self-explanations” are considered to be ungrounded.

Let us examine what is actually happening when a language model has produced some output O, and is then prompted to give an explanation E for O. The process of generating E is simply another execution of the language model. E is a text composed through a sequence of next-token predictions, stochastically optimised to satisfy the query. E is not the result of an introspective reflection on the algorithmic process that was followed to generate O. A true mechanismal explanation would invoke, for example, some reference to the actual training data, model parameters, or activations, that were involved in the production of O. But if this meta-information about the prediction process is not accessible to the model to draw upon in generating E, it is theoretically impossible that E could reflect it, accurately or otherwise.

The situation is no different if E and O are requested in a single prompt, e.g., “What is the capital of France? Explain your answer.”, as opposed to two separate prompts or conversational turns, the first asking the question and the second asking for the explanation. In the single-turn case, and in multi-turn systems where previous responses are included in the query context, it is true that the generation of O is affected by the presence of an E-request in the query context, and vice versa. For example, the language model may well produce a more coherent and well-justified output if it can simultaneously attend to a fragment of language in which an explanation is requested. However, the notional E portion of the response still lacks a mechanismal grounding.

Statements of type E, then, are not explanations, at least not in the sense that the word is most commonly used in explainable AI research, and, we shall see, not even in the sense that users colloquially expect from these systems. They do not hold the epistemically privileged status over statements of the type O that they claim or that people expect. In fact, they are outputs like any other. E-type statements could be described as justifications, or “post-hoc rationalisations” [25], but even these terms imply a greater degree of reflexivity and introspection than is warranted. They are simulacra of justification, or of rationalisation; samples from the space of texts with the shape of justifications.

Let us instead call them exoplanations. This term retains all the connotations of explanations (they may or may not be correct, they carry the appearance of insight, they often appeal to cause, logic, or authority), but explicitly captures the fact that they are exogenous to, outside of, the output they explain. They inhabit the same plane of reasoning as their object; they cannot look any further beneath the object than the object itself can.

Key terms
Mechanismal explanations: explanations of AI model behaviour which represent facts about the underlying mechanisms of prediction, such as the model structure, training data, or model weights. They are generated from introspection of the model, its training data, and its inference process. Examples include LIME and SHAP. Exoplanations: statements which appear to be explanations of AI output, but are not (and cannot be) a grounded reflection of the mechanism that generated that output. This is what language models produce when asked to "explain" themselves.

Despite state-of-the-art performance on reasoning tasks [25,35,38], and one study that reported feature attribution explanations with performance comparable to LIME [13], recent work has delivered significant evidence that language models consistently fail to accurately explain their own output, and can even systematically misrepresent the true reason for a model’s prediction [4,22,32,34]. In other words, at present, when the explanation sought requires introspection into the generation process, exoplanations just don’t work. Large language models cannot explain themselves.

This does not mean that exoplanations are not useful; on the contrary, when presented appropriately, they can be an important and powerful tool in the designer’s toolkit for creating useful and trustworthy experiences. Before we discuss those, let us turn our attention briefly to why it is important to make the distinction between exoplanations and explanations, beyond academic pedantry.

2 Societal harms of exoplanations

The story of the New York lawyers who submitted a legal brief including case citations generated by ChatGPT, but which turned out to be non-existent, is now well-known [23].

A less well-known aspect of this episode is that the infelicitous lawyers did attempt to verify that the cases were real... by asking ChatGPT, which confidently exoplained that the cases were real: “[The lawyer] asked the AI tool whether [a case it generated] is a real case. ChatGPT answered that it "is a real case" and "can be found on legal research databases such as Westlaw and LexisNexis." When asked if the other cases provided by ChatGPT are fake, it answered, "No, the other cases I provided are real and can be found in reputable legal databases such as LexisNexis and Westlaw."” [3]

It has often been noted that language model hallucinations are particularly dangerous because of the bold confidence with which the model makes its assertions. The same is true of exoplanations. Because it is so easy to prompt a language model to produce an exoplanation, which is reported with bold confidence, the user can be forgiven for thinking that exoplanations are mechanismal explanations, whereas in fact they are not. This can lead to the very obvious problems such as the example above. As the firm stated in response to the judgment that the lawyers had acted in bad faith, “We made a good faith mistake in failing to believe that a piece of technology could be making up cases out of whole cloth” [23].

As designers we must ask ourselves: in whom (or what) was this “good faith” placed, and why? If a false statement presented with bold confidence is dangerous, a false statement presented and exoplained with bold confidence is doubly so. Research in social psychology has shown that additional information can increase persuasiveness, even if it is irrelevant to the request [17]. Users are easily influenced and can place their trust in meaningless explanations [10], and can over-trust interpretability aids [14]. Allowing a system to present exoplanations with the veneer of explanations, in a situation where the user expects an explanation, should therefore be considered a dark pattern [2].

The illusion of explanation perpetuated by exoplanations poses a threat to decision-making processes, in everyday knowledge work as well as in high-stakes environments such as legal or medical contexts. Reliance on exoplanations may diminish users’ critical thinking and decision-making abilities.

Instead of engaging in introspection or evaluating the logic and evidence behind the model’s output, users may accept exoplanations at face value. And why shouldn’t they? Computers are tools, and tools are not viewed as being adversarial to the activity they facilitate. It does not seem to be a productive avenue for interaction design to attempt to erase the cultural, inertial tendency to trust computers as computationally correct machines, even if that tendency is wildly misplaced in language models.

Exoplanations can also impair user trust and confidence in AI systems in the long term. As exoplanations are revealed to not, in fact, have their putative explanatory power, this can erode trust, and undermine any legitimate credibility that AI systems might have.

Harms of exoplanations
False confidence: bold exoplanations of hallucinated statements can give users false confidence in those statements, with dangerous consequences. Diminished critical thinking: instead of engaging in introspection or evaluating the logic and evidence behind the model's output, users may accept exoplanations at face value. Erosion of trust: when users discover that exoplanations do not accurately explain language model behaviour, this can undermine the credibility of AI systems.

3 Recontextualising ex(o)planations

This is clearly a case that calls for a social construction of explainability, which should “start with “who” the relevant stakeholders are, their explainability needs, and justify how a particular conception of explainability satisfies the shared goals of the relevant social group” [8].

It is not that mechanismal explanations for language models are lacking. Despite significant challenges [29], numerous techniques have been developed to explain feature attribution, neuron activation, model attention, etc. [19,37].

However, mechanismal explanations are not the aim in and of themselves; the important aspect of the user experience that explanations need to fulfil is decision support [11,19,28,31]. Is the output correct? If it isn’t, what do I need to do to fix it? Can I trust this? For example, in an AI system that generates spreadsheet formulas from natural language queries, it is by far more important and consequential for the user experience to explain the generated formula, what it does and how it works, rather than the mechanism of the language model that produced it. Mechanismal explanations may generate confusion and information overload in such a context [16,19].

As Miller [24] and Sarkar [29] have noted, human-human explanations are generally not mechanismal, in the sense that human-generated explanations of human behaviour rarely invoke low-level psychological or neurological phenomena, yet they are still generally successful at fulfilling the needs of everyday communication. Effective explanations can be contrastive, counterfactual, and justificatory, with respect to some intended state of affairs; these have nothing to do with the causal mechanisms underlying behaviour.

Parts of the decision support problem can be addressed though an approach termed “co-audit” [12]: tools to help check AI-generated content. An example of this would be the “grounded utterances” generated through a separate and deterministic mechanism to explain the model output [20]. Another technique, employed by Microsoft Copilot (formerly Bing Chat) is to cite references to its Web sources that can be followed and verified. These are true explanations: they rely on mechanisms and authorities separate from the model itself and with an epistemically privileged view over the output generation process.

But exoplanations themselves can also be useful. Without needing to introspect the model, they can generate statements which help the user rationalise, justify, and evaluate. They can generate text that prompts the user to reflect on the output and their intents. Exoplanations can thus promote critical thinking about interactions with generative AI [30].

I propose a simple design implication that can be applied immediately: the introduction of guardrails and interface warnings against exoplanations. Commercial systems such as ChatGPT already abound with guardrails against content deemed inappropriate by the system designers, such as violent or sexual content, and numerous disclaimers against hallucinations, to the effect of “AI generated content may be incorrect.” To these considerations, I suggest adding guardrails against exoplanations masquerading as explanations, and contextualising them to allow their true and appropriate utility to shine.

For example, if the user asks the system to explain its output, it could produce a disclaimer of the following type: “You asked for an explanation, but as a language model, I am incapable of explaining my own behaviour.” It might then follow this with “However, I can provide examples of how to justify, rationalise, or evaluate my previous response. Here are example arguments for and against it. This is not an explanation of my previous response.” Together, such a disclaimer followed by an exoplanation could help defuse the worst dangers and infuse some critical thought.

There is reason to believe that such simple interventions can have a meaningful effect. The presence of metacognitive guiding questions, such as “what do I understand from the text so far?” significantly improves reading comprehension [27]. Framing explanations as questions improves human logical discernment [5]. When technology sparks conflict in discussions, it improves critical thinking [18]. Users are influenced by the language of conversational systems and can change their instructional vocabulary and grammar after just a single exposure to system output [20]. The very same forces that influence and nudge users into trusting false explanations can be marshalled for their benefit instead.

Going forward, as more true explanation mechanisms are developed: co-audit tools, grounded utterances, citations, etc., such disclaimers may be replaced with more concrete decision-support mechanisms. However, the utility of exoplanations as critical thinking support will remain. The key will be in helping the user develop safe and effective behaviours and mental models of trust around the different sources of evaluation and reflection available.

4 Acknowledgements

Thanks to my reviewers for their time and feedback.


Rishi Bommasani, Drew A Hudson, Ehsan Adeli, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
H Brignull, M Leiser, C Santos, and K Doshi. 2023. Deceptive patterns – user interfaces designed to trick you. Retrieved from https://www.deceptive.design/.
Jon Brodkin. 2023. Lawyer cited 6 fake cases made up by CHATGPT; judge calls it “unprecedented.” Ars Technica. Retrieved from https://arstechnica.com/tech-policy/2023/05/lawyer-cited-6-fake-cases-made-up-by-chatgpt-judge-calls-it-unprecedented/.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. Proceedings of the 2023 CHI conference on human factors in computing systems, 1–13.
Upol Ehsan, Q Vera Liao, Michael Muller, Mark O Riedl, and Justin D Weisz. 2021. Expanding explainability: Towards social transparency in ai systems. Proceedings of the 2021 CHI conference on human factors in computing systems, 1–19.
Upol Ehsan, Samir Passi, Q Vera Liao, et al. 2021. The who in explainable ai: How ai background shapes perceptions of ai explanations. arXiv preprint arXiv:2107.13509.
Upol Ehsan and Mark O. Riedl. 2022. Social construction of XAI: Do we need one definition to rule them all? arXiv preprint arXiv:2211.06499. Retrieved from https://arxiv.org/abs/2211.06499.
Upol Ehsan, Koustuv Saha, Munmun De Choudhury, and Mark O Riedl. 2023. Charting the sociotechnical gap in explainable AI: A framework to address the gap in XAI. Proceedings of the ACM on Human-Computer Interaction 7, CSCW1: 1–32.
Malin Eiband, Daniel Buschek, Alexander Kremer, and Heinrich Hussmann. 2019. The impact of placebic explanations on trust in intelligent systems. Extended abstracts of the 2019 CHI conference on human factors in computing systems, Association for Computing Machinery, 1–6.
Raymond Fok and Daniel S. Weld. 2024. In search of verifiability: Explanations rarely enable complementary performance in AI-advised decision making. Retrieved from https://arxiv.org/abs/2305.07722.
Andrew D Gordon, Carina Negreanu, José Cambronero, et al. 2023. Co-audit: Tools to help humans double-check AI-generated content. arXiv preprint arXiv:2310.01297.
Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H Gilpin. 2023. Can large language models explain themselves? A study of llm-generated self-explanations. arXiv preprint arXiv:2310.11207.
Harmanpreet Kaur, Harsha Nori, Samuel Jenkins, Rich Caruana, Hanna Wallach, and Jennifer Wortman Vaughan. 2020. Interpreting interpretability: Understanding data scientists’ use of interpretability tools for machine learning. Proceedings of the 2020 CHI conference on human factors in computing systems, Association for Computing Machinery, 1–14.
Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactive machine learning. Proceedings of the 20th international conference on intelligent user interfaces, 126–137.
Todd Kulesza, Simone Stumpf, Margaret Burnett, Sherry Yang, Irwin Kwan, and Weng-Keen Wong. 2013. Too much, too little, or just right? Ways explanations impact end users’ mental models. 2013 IEEE symposium on visual languages and human centric computing, IEEE, 3–10.
Ellen J Langer, Arthur Blank, and Benzion Chanowitz. 1978. The mindlessness of ostensibly thoughtful action: The role of" placebic" information in interpersonal interaction. Journal of personality and social psychology 36, 6: 635.
Sunok Lee, Dasom Choi, Minha Lee, Jonghak Choi, and Sangsu Lee. 2023. Fostering youth’s critical thinking competency about AI through exhibition. Proceedings of the 2023 CHI conference on human factors in computing systems, Association for Computing Machinery.
Q. Vera Liao and Jennifer Wortman Vaughan. 2024. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. Harvard Data Science Review.
Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, et al. 2023. “What it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models. Proceedings of the 2023 CHI conference on human factors in computing systems, 1–31.
Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems 30.
Andreas Madsen, Sarath Chandar, and Siva Reddy. 2024. Are self-explanations from large language models faithful? Retrieved from https://arxiv.org/abs/2401.07927.
Sara Merken. 2023. New york lawyers sanctioned for using fake ChatGPT cases in legal brief. Reuters. Retrieved from https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/.
Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence 267: 1–38.
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! Leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should i trust you?" explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144.
Gavriel Salomon. 1988. AI in reverse: Computer tools that turn cognitive. Journal of educational computing research 4, 2: 123–139.
Advait Sarkar. 2016. Interactive analytical modelling. University of Cambridge, Computer Laboratory.
Advait Sarkar. 2022. Is explainable AI a race against model complexity? Workshop on Transparency and Explanations in Smart Systems (TeXSS), in conjunction with ACM Intelligent User Interfaces (IUI 2022), 192–199.
Advait Sarkar. 2024. AI should challenge, not obey. Communications of the ACM (in press).
Advait Sarkar, Mateja Jamnik, Alan F. Blackwell, and Martin Spott. 2015. Interactive visual machine learning in spreadsheets. 2015 IEEE symposium on visual languages and human-centric computing (VL/HCC), 159–163.
Dane Sherburn, Bilal Chughtai, and Owain Evans. 2024. Language models struggle to explain themselves. Retrieved from https://openreview.net/forum?id=o6eUNPBAEc.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. 2024. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems 36.
Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35: 24824–24837.
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. Computer vision–ECCV 2014: 13th european conference, zurich, switzerland, september 6-12, 2014, proceedings, part i 13, Springer, 818–833.
Haiyan Zhao, Hanjie Chen, Fan Yang, et al. 2024. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology 15, 2: 1–38.
Jiachen Zhao, Zonghai Yao, Zhichao Yang, and Hong Yu. 2023. SELF-EXPLAIN: Teaching large language models to reason complex questions by themselves. arXiv preprint arXiv:2311.06985.