Beyond Transparency: Reimagining AI Interpretability Paradigms

Article

By

Ketaki Joshi

10 minutes

May 5, 2025

Introduction

As AI systems become more deeply embedded in high-stakes domains like healthcare, finance, and legal systems, the need for transparency and accountability is more pressing than ever. Our team regularly reviews research that could shape how these systems are developed and deployed in the real world. 

A recent paper that caught our attention is "Interpretability Needs a New Paradigm" by Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, and Sarath Chandar. The paper proposes a significant shift in the way we approach AI interpretability. While traditional approaches have focused on either building inherently interpretable models or creating post-hoc explanations for black-box systems, the authors argue that both approaches fall short when it comes to ensuring that explanations are truly faithful to how the model actually works. In this blog, we summarize the key takeaways from the paper and share our perspective on what these ideas could mean for the future of explainable AI.

Why the Paper Matters

Interpretability isn't just about building trust—it’s also about debugging models, discovering unknown biases, and making AI insights usable across disciplines. This paper makes a compelling case that the two dominant paradigms in interpretability—intrinsic and post-hoc—are fundamentally flawed in their ability to produce faithful explanations. It then outlines promising new directions that could help us build systems that explain themselves in more reliable and measurable ways.

Why is Interpretability Needed

The authors begin by framing interpretability not as a luxury or optional add-on, but as an essential component in the deployment of responsible AI systems. They argue that interpretability enables practitioners to uncover, understand, and correct undesirable behaviors in machine learning models—particularly in settings where other diagnostic tools fall short.

  • Fairness metrics aren’t enough: While these metrics provide quantitative signals of bias, they are typically based on a narrow set of known protected attributes (such as gender or race). In real-world applications, however, these attributes are often unavailable, restricted by privacy laws, or incomplete. Moreover, fairness metrics fail to surface emergent biases—unexpected correlations between seemingly innocuous features and sensitive outcomes. Interpretability fills this gap by revealing what features the model is truly relying on. For Example: Amazon’s now-abandoned AI hiring tool had learned to penalize resumes that included terms like “women’s chess club.” Without interpretability methods surfacing those hidden patterns, such issues might have gone undetected.
  • Scientific utility and transparency: In research-heavy domains such as drug discovery, genomics, or climate modeling, interpretability isn't just about fairness—it’s about insight. Explanations can help experts form and refine hypotheses about complex biological or environmental systems. For example, attention heatmaps in molecular models can highlight specific atoms or substructures responsible for predicted toxicity, guiding more effective compound design.

Our view: We’ve seen firsthand how teams relying exclusively on quantitative fairness or performance metrics often miss critical issues—especially when working with messy, high-dimensional data. Interpretability serves as a diagnostic tool and a bridge between developers, domain experts, and regulators, offering a clearer, more holistic view of how AI systems make decisions.

The Current Paradigms of Interpretability

In exploring the foundations of AI interpretability, the authors categorize existing approaches into two dominant paradigms—intrinsic and post-hoc. Each paradigm stems from a fundamentally different design philosophy about where and how interpretability should be integrated into machine learning models.

  • Intrinsic interpretability emphasizes building models that are inherently understandable. Classic examples include decision trees, rule-based systems, and linear regression models. These architectures make their decision-making process transparent by design—each prediction is a direct result of an interpretable chain of logic or weights. This is especially valuable in fields where transparency is mandated or where end-users must be able to trace decisions (e.g., healthcare, compliance, public policy). However, these models often fall short when dealing with complex, high-dimensional data such as natural language, medical imaging, or financial time series. In such domains, their simplicity becomes a limitation, leading to reduced predictive performance.
  • Post-hoc interpretability, on the other hand, allows the use of high-performing black-box models by generating explanations after the fact. This includes methods like SHAP, LIME, counterfactual explanations, and gradient-based saliency maps. These tools attempt to answer questions like “Which features were most important for this prediction?” or “What would the model have done if this input feature had been different?” While these methods are flexible and widely adopted in practice, they come with a major caveat: faithfulness is not guaranteed. The explanation may reflect what seems important to a surrogate model or local approximation—but not what the model actually “thought.” In some cases, explanations may even contradict the model’s true behavior.

The authors make it clear that both paradigms have substantial limitations:

  • Intrinsic models, while transparent, often underperform on real-world tasks that require nuanced feature representations.
  • Post-hoc methods, while powerful and flexible, risk producing explanations that are misleading or oversimplified.

Our take: This categorization effectively illustrates a dilemma we frequently observe in real-world machine learning teams: choosing between interpretability and performance. Teams are often forced into this trade-off, where using interpretable models means sacrificing accuracy, and using accurate models means relying on potentially unfaithful explanations. What we appreciate about this paper is its clarity in naming this tension and suggesting that it’s time to move beyond it. Rather than forcing a binary choice, the future of interpretability should aim to integrate performance and faithfulness in a principled, measurable way.

Why Interpretability Needs a New Paradigm

The Case Against the Intrinsic Paradigm

Although the intrinsic paradigm promises transparency by design, it suffers from several critical limitations that hinder its practical adoption.

  • Performance trade-offs: Intrinsically interpretable models often sacrifice accuracy to remain human-understandable. In many real-world applications—like image recognition, speech processing, or language modeling—simple models such as decision trees or linear regressions cannot compete with the predictive power of deep neural networks. As a result, businesses and researchers may avoid these interpretable models despite their transparency.
  • Incomplete transparency: Even models that claim to be interpretable often contain subcomponents or mechanisms that are not easily understood. For example, hybrid models that integrate attention layers or neural modules within otherwise interpretable frameworks may offer only partial insight. In such cases, we lose the holistic view required to truly trust the model.

For Instance- : Attention-based models in NLP, such as transformers, were initially praised for their transparency due to the visibility of attention weights. However, subsequent research showed that attention weights do not reliably indicate which inputs were important for a model’s decision (Jain & Wallace, 2019). This has led to debates over whether such models are truly interpretable or simply appear to be so on the surface.

The Case Against the Post-Hoc Paradigm

Post-hoc interpretability methods offer the flexibility to explain any model architecture but face their own challenges, especially concerning the faithfulness of explanations.

  • Lack of fidelity: Many popular methods such as saliency maps, SHAP, and LIME have been found to produce explanations that don’t accurately reflect the model’s internal decision-making process. Instead, they generate plausible-sounding justifications that may align with user expectations, but not with how the model truly works.
  • Manipulability and inconsistency: Explanations can often be manipulated or vary widely depending on hyperparameters or the random seed used in training, reducing their reliability. Moreover, different explanation methods can yield contradictory interpretations for the same input, leaving users uncertain about which to trust.

For Instance- : In image classification, saliency maps sometimes highlight irrelevant areas—such as background textures or watermarks—rather than the object being classified. In one study, a model trained to detect huskies versus wolves was found to be relying heavily on snowy backgrounds in wolf photos rather than on the animal itself. This misleading focus was only revealed through interpretability tools, but also demonstrated the fragility of such post-hoc explanations (Ribeiro et al., 2016).

While intrinsic methods prioritize transparency and post-hoc techniques offer versatility, neither approach reliably provides explanations that are both faithful and generalizable across different contexts. Each is limited either by performance constraints or by a disconnect between explanation and actual model behavior. As AI systems become more complex and embedded in high-stakes domains, these limitations signal a growing need for fundamentally new paradigms—ones that emphasize measurable faithfulness, robustness, and alignment with real-world interpretability needs.

Are New Paradigms Possible?

Recognizing the shortcomings of the intrinsic and post-hoc paradigms, researchers have begun exploring new frameworks that prioritize faithfulness without sacrificing performance. These emerging paradigms aim to better align explanations with a model’s actual reasoning processes and offer concrete pathways to build trust in AI systems. The authors introduce three promising paradigms that aim to overcome the limitations of both intrinsic and post-hoc methods, along with a novel direction introduced by our team.

1. Learn-to-Faithfully-Explain Paradigm

In this paradigm, models are trained not only to make predictions but also to generate explanations that are faithful by design. The core idea is to optimize both the prediction and the explanation simultaneously, encouraging the model to expose its own reasoning process.

This is typically achieved by designing a dual-objective loss function: one part optimizes for predictive accuracy, while the other ensures that the explanation aligns with the internal logic of the model. If the explanation fails to reflect the factors that truly influenced the prediction, the model is penalized.

For Instance- : In healthcare, consider a diagnostic AI that analyzes chest X-rays to detect pneumonia. Under the learn-to-faithfully-explain paradigm, the model is trained to highlight regions in the lungs that are medically relevant indicators of pneumonia. If it instead highlights unrelated areas (e.g., shoulder bones), it receives a penalty. This joint training ensures that explanations like saliency maps genuinely reflect what the model uses to make its decision.

This paradigm offers a promising balance between performance and interpretability, although care must be taken to avoid the explainer simply mimicking the outcome without capturing true causal reasoning.

2. Faithfulness-Measurable Model Paradigm

Rather than requiring the architecture itself to be interpretable, this paradigm focuses on designing models in a way that allows the faithfulness of explanations to be explicitly measured. These models may remain black-box in architecture, but they are structured to support diagnostic checks that can validate whether explanations are truly aligned with internal behavior.

One approach involves perturbation testing, where input features highlighted in an explanation are removed or modified to test if the model’s output changes accordingly. If the model still makes the same decision despite altered or removed key inputs, the explanation is likely unfaithful.

For Instance- : In fraud detection, an AI model might flag a transaction as suspicious due to location and purchase amount. Using the faithfulness-measurable paradigm, we can systematically remove or alter those features to see if the model still flags the transaction. If it does, then the explanation is called into question. This method ensures that what the model "says" influenced its decision actually did.

This paradigm enables post-training evaluation and debugging of explanations, providing a layer of accountability even for high-performance models.

3. Self-Explaining Model Paradigm

In self-explaining models, the model itself outputs both a prediction and a human-readable explanation. This approach is especially relevant in the context of large language models (LLMs), which can generate natural language justifications alongside predictions or answers.

The key advantage here is that explanations are embedded into the generative process, potentially making them more accessible to non-expert users. However, the major challenge is ensuring that these natural-language explanations are faithful to the underlying reasoning and not merely plausible-sounding narratives.

For Instance- : Consider ChatGPT or another conversational AI tool that responds to a question with both an answer and an explanation. It may state that a specific historical event caused a policy change and explain the context. While the explanation may sound convincing, it might not actually reflect the internal statistical patterns that led to the response. In some cases, it could even be fabricated (a phenomenon known as "hallucination").

Ongoing research aims to align these explanations more closely with model internals—through training objectives, reinforcement learning, or auxiliary supervision—so that they are not only coherent but also grounded in the actual decision-making logic.

Limitations

While the paper emphasizes faithfulness, it acknowledges the need to consider human-understandability—whether explanations are usable or helpful to non-experts.

Key limitations include:

  • Faithful explanations may lack human relevance: Even if an explanation accurately represents a model's logic, it might be conveyed in terms that are meaningless to the end-user (e.g., neural weight activations or gradients).
  • Misalignment with user expectations: Practitioners in domains like healthcare, law, or finance need explanations in forms they can reason about, such as clinical insights, legal justifications, or financial heuristics.

For Instance- : A doctor using an AI diagnostic tool might receive a saliency map showing pixel-level activations. While faithful, this may be unhelpful unless converted into language like "increased opacity in the lower-left lung segment indicating potential fluid buildup."

  • Subjectivity in interpretability: What counts as a "good" explanation varies between users. A data scientist may prefer feature attributions, while a policymaker may need plain-language summaries.
  • Need for user-centered design: Future research must consider the background knowledge, goals, and decision-making context of different end-user groups (Schut et al., 2023).

Ultimately, interpretability should balance both faithfulness and comprehensibility. Bridging this gap is a critical challenge for ensuring that AI systems are not just technically sound but also practically trustworthy.

DL-Backtrace: A New Direction for Interpretability

One of the most promising additions to the new paradigm of interpretability is DL-Backtrace, introduced in our recent paper DLBacktrace: A Model-Agnostic Explainability for Any Deep Learning Models. This technique fundamentally rethinks how we trace decisions back through deep learning systems, offering a compelling alternative to post-hoc explainability.

Unlike traditional post-hoc methods that attempt to approximate a model’s reasoning after the fact, DL-Backtrace directly computes the influence of specific inputs on the final prediction by traversing the model’s execution path backward—from output to input. This reverse traversal is not an approximation but a precise reconstruction of which parts of the input space were functionally critical to the model's decision. It operates directly on the computational graph, using internal gradients and activations to identify causally relevant pathways.

How DL-Backtrace Surpasses Traditional Methods

  • Faithfulness Over Approximation: Post-hoc methods like SHAP and LIME approximate feature importance via surrogate models or local perturbations. DL-Backtrace, on the other hand, works on the original model without needing approximation layers, ensuring explanations are more causally grounded and faithful to the actual decision logic.
  • Resilience to Manipulation: Because DL-Backtrace analyzes the real computational trace, it’s less susceptible to adversarial manipulation or inconsistencies that often plague post-hoc methods relying on model probing.
  • Model-Agnostic Yet Mechanistic: DL-Backtrace doesn’t require a specially trained explainer model and can be applied across a wide range of architectures—including transformers and convolutional networks—making it both broadly applicable and technically robust.

For Instance:

In image classification tasks, where saliency maps often highlight noisy or irrelevant regions, DL-Backtrace can pinpoint the exact neurons and spatial locations that materially contributed to the classification output. This gives a crisper, more trustworthy picture of why a model thinks an image is, say, a cat instead of a dog—not based on surrounding pixels or textures, but the core object features.

Conclusion

"Interpretability Needs a New Paradigm" challenges the status quo in explainability research and pushes us to think beyond traditional boundaries. The authors argue for a more rigorous and creative approach that integrates performance and faithfulness without compromise. Though early, the proposed paradigms offer a glimpse into how future AI systems might be designed from the ground up to explain themselves.

As the field evolves, we must stay vigilant—not just about building systems that seem interpretable, but about ensuring their explanations truly reflect how they think. Because when lives, laws, or livelihoods are on the line, understanding why a model makes a decision is just as important as what it decides.

SHARE THIS

Subscribe to AryaXAI

Stay up to date with all updates

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Discover More Articles

Explore a curated collection of in-depth articles covering the latest advancements, insights, and trends in AI, MLOps, governance, and more. Stay informed with expert analyses, thought leadership, and actionable knowledge to drive innovation in your field.

View All

Is Explainability critical for your AI solutions?

Schedule a demo with our team to understand how AryaXAI can make your mission-critical 'AI' acceptable and aligned with all your stakeholders.

Beyond Transparency: Reimagining AI Interpretability Paradigms

Ketaki JoshiKetaki Joshi
Ketaki Joshi
May 5, 2025
Beyond Transparency: Reimagining AI Interpretability Paradigms
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Introduction

As AI systems become more deeply embedded in high-stakes domains like healthcare, finance, and legal systems, the need for transparency and accountability is more pressing than ever. Our team regularly reviews research that could shape how these systems are developed and deployed in the real world. 

A recent paper that caught our attention is "Interpretability Needs a New Paradigm" by Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, and Sarath Chandar. The paper proposes a significant shift in the way we approach AI interpretability. While traditional approaches have focused on either building inherently interpretable models or creating post-hoc explanations for black-box systems, the authors argue that both approaches fall short when it comes to ensuring that explanations are truly faithful to how the model actually works. In this blog, we summarize the key takeaways from the paper and share our perspective on what these ideas could mean for the future of explainable AI.

Why the Paper Matters

Interpretability isn't just about building trust—it’s also about debugging models, discovering unknown biases, and making AI insights usable across disciplines. This paper makes a compelling case that the two dominant paradigms in interpretability—intrinsic and post-hoc—are fundamentally flawed in their ability to produce faithful explanations. It then outlines promising new directions that could help us build systems that explain themselves in more reliable and measurable ways.

Why is Interpretability Needed

The authors begin by framing interpretability not as a luxury or optional add-on, but as an essential component in the deployment of responsible AI systems. They argue that interpretability enables practitioners to uncover, understand, and correct undesirable behaviors in machine learning models—particularly in settings where other diagnostic tools fall short.

  • Fairness metrics aren’t enough: While these metrics provide quantitative signals of bias, they are typically based on a narrow set of known protected attributes (such as gender or race). In real-world applications, however, these attributes are often unavailable, restricted by privacy laws, or incomplete. Moreover, fairness metrics fail to surface emergent biases—unexpected correlations between seemingly innocuous features and sensitive outcomes. Interpretability fills this gap by revealing what features the model is truly relying on. For Example: Amazon’s now-abandoned AI hiring tool had learned to penalize resumes that included terms like “women’s chess club.” Without interpretability methods surfacing those hidden patterns, such issues might have gone undetected.
  • Scientific utility and transparency: In research-heavy domains such as drug discovery, genomics, or climate modeling, interpretability isn't just about fairness—it’s about insight. Explanations can help experts form and refine hypotheses about complex biological or environmental systems. For example, attention heatmaps in molecular models can highlight specific atoms or substructures responsible for predicted toxicity, guiding more effective compound design.

Our view: We’ve seen firsthand how teams relying exclusively on quantitative fairness or performance metrics often miss critical issues—especially when working with messy, high-dimensional data. Interpretability serves as a diagnostic tool and a bridge between developers, domain experts, and regulators, offering a clearer, more holistic view of how AI systems make decisions.

The Current Paradigms of Interpretability

In exploring the foundations of AI interpretability, the authors categorize existing approaches into two dominant paradigms—intrinsic and post-hoc. Each paradigm stems from a fundamentally different design philosophy about where and how interpretability should be integrated into machine learning models.

  • Intrinsic interpretability emphasizes building models that are inherently understandable. Classic examples include decision trees, rule-based systems, and linear regression models. These architectures make their decision-making process transparent by design—each prediction is a direct result of an interpretable chain of logic or weights. This is especially valuable in fields where transparency is mandated or where end-users must be able to trace decisions (e.g., healthcare, compliance, public policy). However, these models often fall short when dealing with complex, high-dimensional data such as natural language, medical imaging, or financial time series. In such domains, their simplicity becomes a limitation, leading to reduced predictive performance.
  • Post-hoc interpretability, on the other hand, allows the use of high-performing black-box models by generating explanations after the fact. This includes methods like SHAP, LIME, counterfactual explanations, and gradient-based saliency maps. These tools attempt to answer questions like “Which features were most important for this prediction?” or “What would the model have done if this input feature had been different?” While these methods are flexible and widely adopted in practice, they come with a major caveat: faithfulness is not guaranteed. The explanation may reflect what seems important to a surrogate model or local approximation—but not what the model actually “thought.” In some cases, explanations may even contradict the model’s true behavior.

The authors make it clear that both paradigms have substantial limitations:

  • Intrinsic models, while transparent, often underperform on real-world tasks that require nuanced feature representations.
  • Post-hoc methods, while powerful and flexible, risk producing explanations that are misleading or oversimplified.

Our take: This categorization effectively illustrates a dilemma we frequently observe in real-world machine learning teams: choosing between interpretability and performance. Teams are often forced into this trade-off, where using interpretable models means sacrificing accuracy, and using accurate models means relying on potentially unfaithful explanations. What we appreciate about this paper is its clarity in naming this tension and suggesting that it’s time to move beyond it. Rather than forcing a binary choice, the future of interpretability should aim to integrate performance and faithfulness in a principled, measurable way.

Why Interpretability Needs a New Paradigm

The Case Against the Intrinsic Paradigm

Although the intrinsic paradigm promises transparency by design, it suffers from several critical limitations that hinder its practical adoption.

  • Performance trade-offs: Intrinsically interpretable models often sacrifice accuracy to remain human-understandable. In many real-world applications—like image recognition, speech processing, or language modeling—simple models such as decision trees or linear regressions cannot compete with the predictive power of deep neural networks. As a result, businesses and researchers may avoid these interpretable models despite their transparency.
  • Incomplete transparency: Even models that claim to be interpretable often contain subcomponents or mechanisms that are not easily understood. For example, hybrid models that integrate attention layers or neural modules within otherwise interpretable frameworks may offer only partial insight. In such cases, we lose the holistic view required to truly trust the model.

For Instance- : Attention-based models in NLP, such as transformers, were initially praised for their transparency due to the visibility of attention weights. However, subsequent research showed that attention weights do not reliably indicate which inputs were important for a model’s decision (Jain & Wallace, 2019). This has led to debates over whether such models are truly interpretable or simply appear to be so on the surface.

The Case Against the Post-Hoc Paradigm

Post-hoc interpretability methods offer the flexibility to explain any model architecture but face their own challenges, especially concerning the faithfulness of explanations.

  • Lack of fidelity: Many popular methods such as saliency maps, SHAP, and LIME have been found to produce explanations that don’t accurately reflect the model’s internal decision-making process. Instead, they generate plausible-sounding justifications that may align with user expectations, but not with how the model truly works.
  • Manipulability and inconsistency: Explanations can often be manipulated or vary widely depending on hyperparameters or the random seed used in training, reducing their reliability. Moreover, different explanation methods can yield contradictory interpretations for the same input, leaving users uncertain about which to trust.

For Instance- : In image classification, saliency maps sometimes highlight irrelevant areas—such as background textures or watermarks—rather than the object being classified. In one study, a model trained to detect huskies versus wolves was found to be relying heavily on snowy backgrounds in wolf photos rather than on the animal itself. This misleading focus was only revealed through interpretability tools, but also demonstrated the fragility of such post-hoc explanations (Ribeiro et al., 2016).

While intrinsic methods prioritize transparency and post-hoc techniques offer versatility, neither approach reliably provides explanations that are both faithful and generalizable across different contexts. Each is limited either by performance constraints or by a disconnect between explanation and actual model behavior. As AI systems become more complex and embedded in high-stakes domains, these limitations signal a growing need for fundamentally new paradigms—ones that emphasize measurable faithfulness, robustness, and alignment with real-world interpretability needs.

Are New Paradigms Possible?

Recognizing the shortcomings of the intrinsic and post-hoc paradigms, researchers have begun exploring new frameworks that prioritize faithfulness without sacrificing performance. These emerging paradigms aim to better align explanations with a model’s actual reasoning processes and offer concrete pathways to build trust in AI systems. The authors introduce three promising paradigms that aim to overcome the limitations of both intrinsic and post-hoc methods, along with a novel direction introduced by our team.

1. Learn-to-Faithfully-Explain Paradigm

In this paradigm, models are trained not only to make predictions but also to generate explanations that are faithful by design. The core idea is to optimize both the prediction and the explanation simultaneously, encouraging the model to expose its own reasoning process.

This is typically achieved by designing a dual-objective loss function: one part optimizes for predictive accuracy, while the other ensures that the explanation aligns with the internal logic of the model. If the explanation fails to reflect the factors that truly influenced the prediction, the model is penalized.

For Instance- : In healthcare, consider a diagnostic AI that analyzes chest X-rays to detect pneumonia. Under the learn-to-faithfully-explain paradigm, the model is trained to highlight regions in the lungs that are medically relevant indicators of pneumonia. If it instead highlights unrelated areas (e.g., shoulder bones), it receives a penalty. This joint training ensures that explanations like saliency maps genuinely reflect what the model uses to make its decision.

This paradigm offers a promising balance between performance and interpretability, although care must be taken to avoid the explainer simply mimicking the outcome without capturing true causal reasoning.

2. Faithfulness-Measurable Model Paradigm

Rather than requiring the architecture itself to be interpretable, this paradigm focuses on designing models in a way that allows the faithfulness of explanations to be explicitly measured. These models may remain black-box in architecture, but they are structured to support diagnostic checks that can validate whether explanations are truly aligned with internal behavior.

One approach involves perturbation testing, where input features highlighted in an explanation are removed or modified to test if the model’s output changes accordingly. If the model still makes the same decision despite altered or removed key inputs, the explanation is likely unfaithful.

For Instance- : In fraud detection, an AI model might flag a transaction as suspicious due to location and purchase amount. Using the faithfulness-measurable paradigm, we can systematically remove or alter those features to see if the model still flags the transaction. If it does, then the explanation is called into question. This method ensures that what the model "says" influenced its decision actually did.

This paradigm enables post-training evaluation and debugging of explanations, providing a layer of accountability even for high-performance models.

3. Self-Explaining Model Paradigm

In self-explaining models, the model itself outputs both a prediction and a human-readable explanation. This approach is especially relevant in the context of large language models (LLMs), which can generate natural language justifications alongside predictions or answers.

The key advantage here is that explanations are embedded into the generative process, potentially making them more accessible to non-expert users. However, the major challenge is ensuring that these natural-language explanations are faithful to the underlying reasoning and not merely plausible-sounding narratives.

For Instance- : Consider ChatGPT or another conversational AI tool that responds to a question with both an answer and an explanation. It may state that a specific historical event caused a policy change and explain the context. While the explanation may sound convincing, it might not actually reflect the internal statistical patterns that led to the response. In some cases, it could even be fabricated (a phenomenon known as "hallucination").

Ongoing research aims to align these explanations more closely with model internals—through training objectives, reinforcement learning, or auxiliary supervision—so that they are not only coherent but also grounded in the actual decision-making logic.

Limitations

While the paper emphasizes faithfulness, it acknowledges the need to consider human-understandability—whether explanations are usable or helpful to non-experts.

Key limitations include:

  • Faithful explanations may lack human relevance: Even if an explanation accurately represents a model's logic, it might be conveyed in terms that are meaningless to the end-user (e.g., neural weight activations or gradients).
  • Misalignment with user expectations: Practitioners in domains like healthcare, law, or finance need explanations in forms they can reason about, such as clinical insights, legal justifications, or financial heuristics.

For Instance- : A doctor using an AI diagnostic tool might receive a saliency map showing pixel-level activations. While faithful, this may be unhelpful unless converted into language like "increased opacity in the lower-left lung segment indicating potential fluid buildup."

  • Subjectivity in interpretability: What counts as a "good" explanation varies between users. A data scientist may prefer feature attributions, while a policymaker may need plain-language summaries.
  • Need for user-centered design: Future research must consider the background knowledge, goals, and decision-making context of different end-user groups (Schut et al., 2023).

Ultimately, interpretability should balance both faithfulness and comprehensibility. Bridging this gap is a critical challenge for ensuring that AI systems are not just technically sound but also practically trustworthy.

DL-Backtrace: A New Direction for Interpretability

One of the most promising additions to the new paradigm of interpretability is DL-Backtrace, introduced in our recent paper DLBacktrace: A Model-Agnostic Explainability for Any Deep Learning Models. This technique fundamentally rethinks how we trace decisions back through deep learning systems, offering a compelling alternative to post-hoc explainability.

Unlike traditional post-hoc methods that attempt to approximate a model’s reasoning after the fact, DL-Backtrace directly computes the influence of specific inputs on the final prediction by traversing the model’s execution path backward—from output to input. This reverse traversal is not an approximation but a precise reconstruction of which parts of the input space were functionally critical to the model's decision. It operates directly on the computational graph, using internal gradients and activations to identify causally relevant pathways.

How DL-Backtrace Surpasses Traditional Methods

  • Faithfulness Over Approximation: Post-hoc methods like SHAP and LIME approximate feature importance via surrogate models or local perturbations. DL-Backtrace, on the other hand, works on the original model without needing approximation layers, ensuring explanations are more causally grounded and faithful to the actual decision logic.
  • Resilience to Manipulation: Because DL-Backtrace analyzes the real computational trace, it’s less susceptible to adversarial manipulation or inconsistencies that often plague post-hoc methods relying on model probing.
  • Model-Agnostic Yet Mechanistic: DL-Backtrace doesn’t require a specially trained explainer model and can be applied across a wide range of architectures—including transformers and convolutional networks—making it both broadly applicable and technically robust.

For Instance:

In image classification tasks, where saliency maps often highlight noisy or irrelevant regions, DL-Backtrace can pinpoint the exact neurons and spatial locations that materially contributed to the classification output. This gives a crisper, more trustworthy picture of why a model thinks an image is, say, a cat instead of a dog—not based on surrounding pixels or textures, but the core object features.

Conclusion

"Interpretability Needs a New Paradigm" challenges the status quo in explainability research and pushes us to think beyond traditional boundaries. The authors argue for a more rigorous and creative approach that integrates performance and faithfulness without compromise. Though early, the proposed paradigms offer a glimpse into how future AI systems might be designed from the ground up to explain themselves.

As the field evolves, we must stay vigilant—not just about building systems that seem interpretable, but about ensuring their explanations truly reflect how they think. Because when lives, laws, or livelihoods are on the line, understanding why a model makes a decision is just as important as what it decides.

See how AryaXAI improves
ML Observability

Learn how to bring transparency & suitability to your AI Solutions, Explore relevant use cases for your team, and Get pricing information for XAI products.