Beyond Transparency: Reimagining AI Interpretability Paradigms

Article

By

Ketaki Joshi

10 minutes

May 5, 2025

Remaingining AI Interpretability | Article by AryaXAI

Introduction

As increasingly advanced AI systems find their way deeper into high-stakes areas such as healthcare, finance, and legal frameworks, the demand for transparency and accountability is greater than ever before. This is especially true for AI technologies used in regulated industries, where understanding and observability of complex models are critical for compliance and trust. Our team regularly reviews research that could shape how these systems are developed and deployed in the real world.

A recent paper that caught our attention is “Interpretability Needs a New Paradigm” by Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, and Sarath Chandar. The paper proposes a significant shift in the way we approach AI interpretability, highlighting interpretable artificial intelligence as a key area of focus. While traditional approaches have focused on either building inherently interpretable models or creating post-hoc explanations for black-box systems, the authors argue that both approaches fall short when it comes to ensuring that explanations are truly faithful to how the model actually works. In this blog, we summarize the key takeaways from the paper and share our perspective on what these ideas could mean for the future of explainable AI.

Why the Paper Matters

Interpretability isn’t just about building trust, it’s also about debugging models, discovering unknown biases, and making AI insights usable across disciplines. Fairness metrics and biases in AI models are usually influenced by the training data used, as patterns and biases present in the training data can directly affect model predictions. Understanding the relationship between training data and model decisions is essential for ensuring fairness and transparency. Additionally, model transparency helps surface hidden patterns within the model, making it easier to evaluate and monitor AI systems, especially in high-stakes domains.

This paper makes a compelling case that the two dominant paradigms in interpretability—intrinsic and post-hoc—are fundamentally flawed in their ability to produce faithful explanations. . It outlines promising new directions that could help us build systems that explain themselves in more reliable and measurable ways. Model explainability is essential for bridging the gap between developers, domain experts, and regulators, ensuring that AI decision-making processes are understandable and trustworthy.

Why is Interpretability Needed

The authors start by positioning interpretability as a requirement or necessity rather than a luxury or something to add optionally, but rather as a necessary part in the rollout of responsible AI systems. They contend that interpretability allows practitioners to discover, comprehend, and fix unwanted patterns in machine learning models, especially in environments where other diagnostic tools fail.

  • Metrics of fairness are not sufficient: Although such metrics give quantitative warnings of bias, they are generally drawn from a limited catalogue of known protected features (e.g., gender or race). In practice, however, such features are frequently unavailable, subject to privacy protections, or incomplete. Furthermore, fairness metrics do not expose emergent biases—surprising correlations between apparently harmless features and sensitive labels. Interpretability bridges this gap by showing what exactly the model is actually depending on.

For instance: Amazon's abandoned AI recruitment tool had learned to flag resumes that contained phrases like "women's chess club." Without interpretability techniques unveiling those underlying patterns, such problems may have gone unnoticed.

  • Scientific usefulness and clarity: In research-intensive areas like drug discovery, genomics, or climate modeling, explainability isn't only about equity—it's about insight. Explanations can allow specialists to develop and test hypotheses regarding intricate biological or environmental systems. As an illustration, attention heatmaps in molecular models can point out individual atoms or substructures contributing to anticipated toxicity, and more efficient compound design can follow.

Our view: We’ve seen firsthand how teams relying exclusively on quantitative fairness or performance metrics often miss critical issues, especially when working with messy, high-dimensional data. Interpretability serves as a diagnostic tool and a bridge between developers, domain experts, and regulators, offering a clearer, more holistic view of how AI systems make decisions.

The Current Paradigms of Interpretability

In investigating the roots of AI interpretability, the authors have grouped available methods into two prevailing paradigms—intrinsic and post-hoc. Each paradigm is the product of a root-lying design philosophy regarding where and how to embed interpretability within machine learning models.

  • Intrinsic interpretability prioritizes developing models that are understandable in themselves. Traditional examples are decision trees, rule-based systems, and linear regression models. These architectures design their decision-making process to be transparent—every prediction is the direct consequence of an understandable chain of reasoning or weights. This is particularly useful in applications where transparency is required or where end-users need to be able to track decisions (e.g., healthcare, compliance, public policy). These models tend to be lacking, however, when faced with nuanced, high-dimensional data like natural language, medical imaging, or financial time series. In these areas, their simplicity is the source of the limitation to lower predictive performance.
  • Post-hoc interpretability, however, enables the utilization of high-performing black-box models by producing explanations retroactively. This encompasses techniques such as SHAP, LIME, counterfactual explanations, and gradient-based saliency maps. These methods seek to provide answers to questions such as "Which features were most significant for this prediction? " or "What would the model have done if this input feature were different?" Although these approaches are versatile and common practice, there's a significant caveat: faithfulness is not guaranteed. The explanation might capture what appears significant to a surrogate model or local approximation, but not what the model really "thought." In certain instances, explanations might even contradict the model's actual behavior.

The authors clearly explain that both paradigms have serious weaknesses:

  • Intrinsic models, as transparent as they are, tend to perform poorly on real tasks that call for rich feature representations. 
  • Post-hoc solutions, as flexible and strong as they are, can generate explanations that are oversimplified or even misleading.

Our response: This dichotomy serves very well in the representation of a tradeoff we often see in real machine learning teams: interpretability vs. performance. Teams are often forced into this trade-off, where using interpretable models means sacrificing accuracy, and using accurate models means relying on potentially unfaithful explanations. What we appreciate about this paper is its clarity in naming this tension and suggesting that it’s time to move beyond it. Rather than forcing a binary choice, the future of interpretability should aim to integrate performance and faithfulness in a principled, measurable way.

Technical Challenges in AI Interpretability

As artificial intelligence continues to advance, the technical challenges of AI interpretability have become increasingly complex and consequential. At the heart of the issue is the sheer complexity of modern AI models—especially those built on deep learning and large language models. These systems, often described as black box models, can achieve remarkable accuracy but make it extremely difficult for human users to fully understand their inner workings or the logic behind a specific decision.

One of the most persistent challenges is the lack of transparency in these black box models. Deep neural networks, for example, may contain millions or even billions of parameters, making it nearly impossible to trace how input data is transformed into a model’s output. This opacity limits our ability to explain AI decisions, especially in high-stakes domains like medical diagnosis or financial forecasting, where understanding the reasoning behind a model’s prediction is as important as the prediction itself.

Another significant hurdle is the absence of standardized explainability techniques that work across all types of AI systems. While a variety of methods—such as saliency maps, feature importance scores, and partial dependence plots—have been developed to shed light on model behavior, each technique has its own strengths and limitations. The effectiveness of a given approach often depends on the specific model architecture, the nature of the data, and the context in which the AI system is deployed. For instance, in medical diagnosis, clinicians require clear, actionable explanations that can be trusted, whereas in data mining or image recognition, the focus may be more on optimizing model performance and accuracy.

The rise of transfer learning and the widespread use of pre-trained models have further complicated interpretability. When AI models are trained on vast, diverse datasets and then fine-tuned for specific tasks, the knowledge embedded within them becomes even more difficult to unpack. This makes it challenging for both developers and end users to understand how a model’s prior experiences influence its current decision making processes.

To address these issues, researchers are exploring a range of new explainable AI (XAI) techniques. Model-agnostic methods, for example, aim to provide explanations that are not tied to any specific model architecture, while visualization tools help users see how neural networks process data at different layers. Glass box models, which are designed to be transparent by default, offer another promising direction, though they often come with trade-offs in model accuracy and scalability.

Despite these advances, the field still faces several unresolved technical challenges. The classic trade-off between model accuracy and interpretability remains a central concern: the most accurate models are often the least transparent, while simpler, more interpretable models may not perform as well on complex tasks. Additionally, the lack of standardization in explainability techniques means that results can vary widely between different tools and approaches, making it difficult to compare or validate explanations across systems.

Ultimately, overcoming these technical challenges will require ongoing research, collaboration, and innovation in AI development. By prioritizing transparency, interpretability, and explainability, we can build AI systems that not only deliver high performance but also empower human users to make good decisions, build trust, and fully understand the outputs created by artificial intelligence. Whether in business, healthcare, or beyond, the future of responsible AI depends on our ability to bridge the gap between complex models and clear, actionable explanations.

Why Interpretability Needs a New Paradigm

The Case Against the Intrinsic Paradigm for Black Box Models

Although the intrinsic paradigm promises transparency by design, it suffers from several critical limitations that hinder its practical adoption.

  • Performance trade-offs: Intrinsically interpretable models often sacrifice accuracy to remain human-understandable. In many real-world applications—like image recognition, speech processing, or language modeling—simple models such as decision trees or linear regressions cannot compete with the predictive power of deep neural networks. As a result, businesses and researchers may avoid these interpretable models despite their transparency.
  • Incomplete transparency: Even models that claim to be interpretable often contain subcomponents or mechanisms that are not easily understood. For example, hybrid models that integrate attention layers or neural modules within otherwise interpretable frameworks may offer only partial insight. In such cases, we lose the holistic view required to truly trust the model.

For Instance- : Attention-based models in NLP, such as transformers, were initially praised for their transparency due to the visibility of attention weights. However, subsequent research showed that attention weights do not reliably indicate which inputs were important for a model’s decision (Jain & Wallace, 2019). This has led to debates over whether such models are truly interpretable or simply appear to be so on the surface.

The Case Against the Post-Hoc Paradigm

Post-hoc interpretability methods offer the flexibility to explain any model architecture but face their own challenges, especially concerning the faithfulness of explanations.

  • Lack of fidelity: Many popular methods such as saliency maps, SHAP, and LIME have been found to produce explanations that don’t accurately reflect the model’s internal decision-making process. Instead, they generate plausible-sounding justifications that may align with user expectations, but not with how the model truly works.
  • Manipulability and inconsistency: Explanations can often be manipulated or vary widely depending on hyperparameters or the random seed used in training, reducing their reliability. Moreover, different explanation methods can yield contradictory interpretations for the same input, leaving users uncertain about which to trust.

For Instance- : In image classification, saliency maps sometimes highlight irrelevant areas—such as background textures or watermarks—rather than the object being classified. In one study, a model trained to detect huskies versus wolves was found to be relying heavily on snowy backgrounds in wolf photos rather than on the animal itself. This misleading focus was only revealed through interpretability tools, but also demonstrated the fragility of such post-hoc explanations (Ribeiro et al., 2016).

While intrinsic methods prioritize transparency and post-hoc techniques offer versatility, neither approach reliably provides explanations that are both faithful and generalizable across different contexts. Each is limited either by performance constraints or by a disconnect between explanation and actual model behavior. As AI systems become more complex and embedded in high-stakes domains, these limitations signal a growing need for fundamentally new paradigms—ones that emphasize measurable faithfulness, robustness, and alignment with real-world interpretability needs.

Are New Paradigms Possible?

Recognizing the shortcomings of the intrinsic and post-hoc paradigms, researchers have begun exploring new frameworks that prioritize faithfulness without sacrificing performance. These emerging paradigms aim to better align explanations with a model’s actual reasoning processes and offer concrete pathways to build trust in AI systems. The authors introduce three promising paradigms that aim to overcome the limitations of both intrinsic and post-hoc methods, along with a novel direction introduced by our team.

1. Learn-to-Faithfully-Explain Paradigm

In this paradigm, models are trained not only to make predictions but also to generate explanations that are faithful by design. The principle is to maintain the optimization of both prediction and explanation together, so that the model is motivated to reveal its own rationale process.

This is done generally by creating a dual-objective loss function: one side maximizes the predictive accuracy, and the other makes sure that the explanation follows the internal model logic. In the event that the explanation does not capture the factors that actually contributed to the prediction, the model is penalized.

For Instance-  suppose a diagnostic AI is checking chest X-rays for pneumonia. Under the learn-to-faithfully-explain framework, the model learns to point out areas in the lungs that are of medical significance to indicate pneumonia. If it points out other, irrelevant areas (e.g., shoulder bones), it is penalized. This shared training guarantees explanations such as saliency maps actually capture what the model relies upon to make a prediction.

This paradigm provides a promising trade-off between performance and interpretability, but caution must be exercised not to have the explainer replicate the result without retaining genuine causal thought.

2. Faithfulness-Measurable Model Paradigm

Rather than requiring the architecture itself to be interpretable, this paradigm focuses on designing models in a way that allows the faithfulness of explanations to be explicitly measured. These models may remain black-box in architecture, but they are structured to support diagnostic checks that can validate whether explanations are truly aligned with internal behavior.

One approach involves perturbation testing, where input features highlighted in an explanation are removed or modified to test if the model’s output changes accordingly. If the model still makes the same decision despite altered or removed key inputs, the explanation is likely unfaithful.

For Instance, in fraud detection, an AI model might flag a transaction as suspicious due to location and purchase amount. Using the faithfulness-measurable paradigm, we can systematically remove or alter those features to see if the model still flags the transaction. If it does, then the explanation is called into question. This method ensures that what the model "says" influenced its decision actually did.

This paradigm enables post-training evaluation and debugging of explanations, providing a layer of accountability even for high-performance models.

3. Self-Explaining Model Paradigm

In self-explaining models, the model itself outputs both a prediction and a human-readable explanation. This approach is especially relevant in the context of large language models (LLMs), which can generate natural language justifications alongside predictions or answers.

The key advantage here is that explanations are embedded into the generative process, potentially making them more accessible to non-expert users. However, the major challenge is ensuring that these natural-language explanations are faithful to the underlying reasoning and not merely plausible-sounding narratives.

For Instance- Consider ChatGPT or another conversational AI tool that responds to a question with both an answer and an explanation. It may state that a specific historical event caused a policy change and explain the context. While the explanation may sound convincing, it might not actually reflect the internal statistical patterns that led to the response. In some cases, it could even be fabricated (a phenomenon known as "hallucination").

Ongoing research aims to align these explanations more closely with model internals—through training objectives, reinforcement learning, or auxiliary supervision—so that they are not only coherent but also grounded in the actual decision-making logic.

Limitations

While the paper emphasizes faithfulness, it acknowledges the need to consider human-understandability, whether explanations are usable or helpful to non-experts.

Key limitations include:

  • Faithful explanations may lack human relevance: Even if an explanation accurately represents a model's logic, it might be conveyed in terms that are meaningless to the end-user (e.g., neural weight activations or gradients).
  • Misalignment with user expectations: Practitioners in domains like healthcare, law, or finance need explanations in forms they can reason about, such as clinical insights, legal justifications, or financial heuristics.

For Instance- : A doctor using an AI diagnostic tool might receive a saliency map showing pixel-level activations. While faithful, this may be unhelpful unless converted into language like "increased opacity in the lower-left lung segment indicating potential fluid buildup."

  • Subjectivity in interpretability: What counts as a "good" explanation varies between users. A data scientist may prefer feature attributions, while a policymaker may need plain-language summaries.
  • Need for user-centered design: Future research must consider the background knowledge, goals, and decision-making context of different end-user groups (Schut et al., 2023).

Ultimately, interpretability should balance both faithfulness and comprehensibility. Bridging this gap is a critical challenge for ensuring that AI systems are not just technically sound but also practically trustworthy.

DL-Backtrace: A New Direction for Deep Learning Interpretability

One of the most promising additions to the new paradigm of interpretability is DL-Backtrace, introduced in our recent paper DLBacktrace: A Model-Agnostic Explainability for Any Deep Learning Models. This technique is designed to provide transparent and interpretable explanations for any machine learning model, fundamentally rethinking how we trace decisions back through deep learning systems, and offering a compelling alternative to post-hoc explainability.

Unlike traditional post-hoc methods that attempt to approximate a model’s reasoning after the fact, DL-Backtrace directly computes the influence of specific inputs on the final prediction by traversing the model’s execution path backward—from output to input. This reverse traversal is not an approximation but a precise reconstruction of which parts of the input space were functionally critical to the model’s decision. It operates directly on the computational graph, using internal gradients and activations to identify causally relevant pathways.
For more in-depth insights about DLb - checkout our webinars here

How DL-Backtrace Surpasses Traditional Methods

  • Faithfulness Over Approximation: Post-hoc methods like SHAP and LIME approximate feature importance via surrogate models or local perturbations. DL-Backtrace, on the other hand, works on the original model without needing approximation layers, ensuring explanations are more causally grounded and faithful to the actual decision logic.
  • Resilience to Manipulation: Because DL-Backtrace analyzes the real computational trace, it’s less susceptible to adversarial manipulation or inconsistencies that often plague post-hoc methods relying on model probing.
  • Model-Agnostic Yet Mechanistic: DL-Backtrace doesn’t require a specially trained explainer model and can be applied across a wide range of architectures—including transformers and convolutional networks—making it both broadly applicable and technically robust.

For Instance:

In image classification tasks, where saliency maps often highlight noisy or irrelevant regions, DL-Backtrace can pinpoint the exact neurons and spatial locations that materially contributed to the classification output. This gives a crisper, more trustworthy picture of why a model thinks an image is, say, a cat instead of a dog—not based on surrounding pixels or textures, but the core object features.

Conclusion

"Interpretability Needs a New Paradigm" challenges the status quo in explainability research and pushes us to think beyond traditional boundaries. The authors argue for a more rigorous and creative approach that integrates performance and faithfulness without compromise. Though early, the proposed paradigms offer a glimpse into how future AI systems might be designed from the ground up to explain themselves.

As the field evolves, we must stay vigilant—not just about building systems that seem interpretable, but about ensuring their explanations truly reflect how they think. Because when lives, laws, or livelihoods are on the line, understanding why a model makes a decision is just as important as what it decides.

SHARE THIS

Subscribe to AryaXAI

Stay up to date with all updates

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Discover More Articles

Explore a curated collection of in-depth articles covering the latest advancements, insights, and trends in AI, MLOps, governance, and more. Stay informed with expert analyses, thought leadership, and actionable knowledge to drive innovation in your field.

View All

Is Explainability critical for your AI solutions?

Schedule a demo with our team to understand how AryaXAI can make your mission-critical 'AI' acceptable and aligned with all your stakeholders.

Beyond Transparency: Reimagining AI Interpretability Paradigms

Ketaki JoshiKetaki Joshi
Ketaki Joshi
May 5, 2025
Beyond Transparency: Reimagining AI Interpretability Paradigms
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Introduction

As increasingly advanced AI systems find their way deeper into high-stakes areas such as healthcare, finance, and legal frameworks, the demand for transparency and accountability is greater than ever before. This is especially true for AI technologies used in regulated industries, where understanding and observability of complex models are critical for compliance and trust. Our team regularly reviews research that could shape how these systems are developed and deployed in the real world.

A recent paper that caught our attention is “Interpretability Needs a New Paradigm” by Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, and Sarath Chandar. The paper proposes a significant shift in the way we approach AI interpretability, highlighting interpretable artificial intelligence as a key area of focus. While traditional approaches have focused on either building inherently interpretable models or creating post-hoc explanations for black-box systems, the authors argue that both approaches fall short when it comes to ensuring that explanations are truly faithful to how the model actually works. In this blog, we summarize the key takeaways from the paper and share our perspective on what these ideas could mean for the future of explainable AI.

Why the Paper Matters

Interpretability isn’t just about building trust, it’s also about debugging models, discovering unknown biases, and making AI insights usable across disciplines. Fairness metrics and biases in AI models are usually influenced by the training data used, as patterns and biases present in the training data can directly affect model predictions. Understanding the relationship between training data and model decisions is essential for ensuring fairness and transparency. Additionally, model transparency helps surface hidden patterns within the model, making it easier to evaluate and monitor AI systems, especially in high-stakes domains.

This paper makes a compelling case that the two dominant paradigms in interpretability—intrinsic and post-hoc—are fundamentally flawed in their ability to produce faithful explanations. . It outlines promising new directions that could help us build systems that explain themselves in more reliable and measurable ways. Model explainability is essential for bridging the gap between developers, domain experts, and regulators, ensuring that AI decision-making processes are understandable and trustworthy.

Why is Interpretability Needed

The authors start by positioning interpretability as a requirement or necessity rather than a luxury or something to add optionally, but rather as a necessary part in the rollout of responsible AI systems. They contend that interpretability allows practitioners to discover, comprehend, and fix unwanted patterns in machine learning models, especially in environments where other diagnostic tools fail.

  • Metrics of fairness are not sufficient: Although such metrics give quantitative warnings of bias, they are generally drawn from a limited catalogue of known protected features (e.g., gender or race). In practice, however, such features are frequently unavailable, subject to privacy protections, or incomplete. Furthermore, fairness metrics do not expose emergent biases—surprising correlations between apparently harmless features and sensitive labels. Interpretability bridges this gap by showing what exactly the model is actually depending on.

For instance: Amazon's abandoned AI recruitment tool had learned to flag resumes that contained phrases like "women's chess club." Without interpretability techniques unveiling those underlying patterns, such problems may have gone unnoticed.

  • Scientific usefulness and clarity: In research-intensive areas like drug discovery, genomics, or climate modeling, explainability isn't only about equity—it's about insight. Explanations can allow specialists to develop and test hypotheses regarding intricate biological or environmental systems. As an illustration, attention heatmaps in molecular models can point out individual atoms or substructures contributing to anticipated toxicity, and more efficient compound design can follow.

Our view: We’ve seen firsthand how teams relying exclusively on quantitative fairness or performance metrics often miss critical issues, especially when working with messy, high-dimensional data. Interpretability serves as a diagnostic tool and a bridge between developers, domain experts, and regulators, offering a clearer, more holistic view of how AI systems make decisions.

The Current Paradigms of Interpretability

In investigating the roots of AI interpretability, the authors have grouped available methods into two prevailing paradigms—intrinsic and post-hoc. Each paradigm is the product of a root-lying design philosophy regarding where and how to embed interpretability within machine learning models.

  • Intrinsic interpretability prioritizes developing models that are understandable in themselves. Traditional examples are decision trees, rule-based systems, and linear regression models. These architectures design their decision-making process to be transparent—every prediction is the direct consequence of an understandable chain of reasoning or weights. This is particularly useful in applications where transparency is required or where end-users need to be able to track decisions (e.g., healthcare, compliance, public policy). These models tend to be lacking, however, when faced with nuanced, high-dimensional data like natural language, medical imaging, or financial time series. In these areas, their simplicity is the source of the limitation to lower predictive performance.
  • Post-hoc interpretability, however, enables the utilization of high-performing black-box models by producing explanations retroactively. This encompasses techniques such as SHAP, LIME, counterfactual explanations, and gradient-based saliency maps. These methods seek to provide answers to questions such as "Which features were most significant for this prediction? " or "What would the model have done if this input feature were different?" Although these approaches are versatile and common practice, there's a significant caveat: faithfulness is not guaranteed. The explanation might capture what appears significant to a surrogate model or local approximation, but not what the model really "thought." In certain instances, explanations might even contradict the model's actual behavior.

The authors clearly explain that both paradigms have serious weaknesses:

  • Intrinsic models, as transparent as they are, tend to perform poorly on real tasks that call for rich feature representations. 
  • Post-hoc solutions, as flexible and strong as they are, can generate explanations that are oversimplified or even misleading.

Our response: This dichotomy serves very well in the representation of a tradeoff we often see in real machine learning teams: interpretability vs. performance. Teams are often forced into this trade-off, where using interpretable models means sacrificing accuracy, and using accurate models means relying on potentially unfaithful explanations. What we appreciate about this paper is its clarity in naming this tension and suggesting that it’s time to move beyond it. Rather than forcing a binary choice, the future of interpretability should aim to integrate performance and faithfulness in a principled, measurable way.

Technical Challenges in AI Interpretability

As artificial intelligence continues to advance, the technical challenges of AI interpretability have become increasingly complex and consequential. At the heart of the issue is the sheer complexity of modern AI models—especially those built on deep learning and large language models. These systems, often described as black box models, can achieve remarkable accuracy but make it extremely difficult for human users to fully understand their inner workings or the logic behind a specific decision.

One of the most persistent challenges is the lack of transparency in these black box models. Deep neural networks, for example, may contain millions or even billions of parameters, making it nearly impossible to trace how input data is transformed into a model’s output. This opacity limits our ability to explain AI decisions, especially in high-stakes domains like medical diagnosis or financial forecasting, where understanding the reasoning behind a model’s prediction is as important as the prediction itself.

Another significant hurdle is the absence of standardized explainability techniques that work across all types of AI systems. While a variety of methods—such as saliency maps, feature importance scores, and partial dependence plots—have been developed to shed light on model behavior, each technique has its own strengths and limitations. The effectiveness of a given approach often depends on the specific model architecture, the nature of the data, and the context in which the AI system is deployed. For instance, in medical diagnosis, clinicians require clear, actionable explanations that can be trusted, whereas in data mining or image recognition, the focus may be more on optimizing model performance and accuracy.

The rise of transfer learning and the widespread use of pre-trained models have further complicated interpretability. When AI models are trained on vast, diverse datasets and then fine-tuned for specific tasks, the knowledge embedded within them becomes even more difficult to unpack. This makes it challenging for both developers and end users to understand how a model’s prior experiences influence its current decision making processes.

To address these issues, researchers are exploring a range of new explainable AI (XAI) techniques. Model-agnostic methods, for example, aim to provide explanations that are not tied to any specific model architecture, while visualization tools help users see how neural networks process data at different layers. Glass box models, which are designed to be transparent by default, offer another promising direction, though they often come with trade-offs in model accuracy and scalability.

Despite these advances, the field still faces several unresolved technical challenges. The classic trade-off between model accuracy and interpretability remains a central concern: the most accurate models are often the least transparent, while simpler, more interpretable models may not perform as well on complex tasks. Additionally, the lack of standardization in explainability techniques means that results can vary widely between different tools and approaches, making it difficult to compare or validate explanations across systems.

Ultimately, overcoming these technical challenges will require ongoing research, collaboration, and innovation in AI development. By prioritizing transparency, interpretability, and explainability, we can build AI systems that not only deliver high performance but also empower human users to make good decisions, build trust, and fully understand the outputs created by artificial intelligence. Whether in business, healthcare, or beyond, the future of responsible AI depends on our ability to bridge the gap between complex models and clear, actionable explanations.

Why Interpretability Needs a New Paradigm

The Case Against the Intrinsic Paradigm for Black Box Models

Although the intrinsic paradigm promises transparency by design, it suffers from several critical limitations that hinder its practical adoption.

  • Performance trade-offs: Intrinsically interpretable models often sacrifice accuracy to remain human-understandable. In many real-world applications—like image recognition, speech processing, or language modeling—simple models such as decision trees or linear regressions cannot compete with the predictive power of deep neural networks. As a result, businesses and researchers may avoid these interpretable models despite their transparency.
  • Incomplete transparency: Even models that claim to be interpretable often contain subcomponents or mechanisms that are not easily understood. For example, hybrid models that integrate attention layers or neural modules within otherwise interpretable frameworks may offer only partial insight. In such cases, we lose the holistic view required to truly trust the model.

For Instance- : Attention-based models in NLP, such as transformers, were initially praised for their transparency due to the visibility of attention weights. However, subsequent research showed that attention weights do not reliably indicate which inputs were important for a model’s decision (Jain & Wallace, 2019). This has led to debates over whether such models are truly interpretable or simply appear to be so on the surface.

The Case Against the Post-Hoc Paradigm

Post-hoc interpretability methods offer the flexibility to explain any model architecture but face their own challenges, especially concerning the faithfulness of explanations.

  • Lack of fidelity: Many popular methods such as saliency maps, SHAP, and LIME have been found to produce explanations that don’t accurately reflect the model’s internal decision-making process. Instead, they generate plausible-sounding justifications that may align with user expectations, but not with how the model truly works.
  • Manipulability and inconsistency: Explanations can often be manipulated or vary widely depending on hyperparameters or the random seed used in training, reducing their reliability. Moreover, different explanation methods can yield contradictory interpretations for the same input, leaving users uncertain about which to trust.

For Instance- : In image classification, saliency maps sometimes highlight irrelevant areas—such as background textures or watermarks—rather than the object being classified. In one study, a model trained to detect huskies versus wolves was found to be relying heavily on snowy backgrounds in wolf photos rather than on the animal itself. This misleading focus was only revealed through interpretability tools, but also demonstrated the fragility of such post-hoc explanations (Ribeiro et al., 2016).

While intrinsic methods prioritize transparency and post-hoc techniques offer versatility, neither approach reliably provides explanations that are both faithful and generalizable across different contexts. Each is limited either by performance constraints or by a disconnect between explanation and actual model behavior. As AI systems become more complex and embedded in high-stakes domains, these limitations signal a growing need for fundamentally new paradigms—ones that emphasize measurable faithfulness, robustness, and alignment with real-world interpretability needs.

Are New Paradigms Possible?

Recognizing the shortcomings of the intrinsic and post-hoc paradigms, researchers have begun exploring new frameworks that prioritize faithfulness without sacrificing performance. These emerging paradigms aim to better align explanations with a model’s actual reasoning processes and offer concrete pathways to build trust in AI systems. The authors introduce three promising paradigms that aim to overcome the limitations of both intrinsic and post-hoc methods, along with a novel direction introduced by our team.

1. Learn-to-Faithfully-Explain Paradigm

In this paradigm, models are trained not only to make predictions but also to generate explanations that are faithful by design. The principle is to maintain the optimization of both prediction and explanation together, so that the model is motivated to reveal its own rationale process.

This is done generally by creating a dual-objective loss function: one side maximizes the predictive accuracy, and the other makes sure that the explanation follows the internal model logic. In the event that the explanation does not capture the factors that actually contributed to the prediction, the model is penalized.

For Instance-  suppose a diagnostic AI is checking chest X-rays for pneumonia. Under the learn-to-faithfully-explain framework, the model learns to point out areas in the lungs that are of medical significance to indicate pneumonia. If it points out other, irrelevant areas (e.g., shoulder bones), it is penalized. This shared training guarantees explanations such as saliency maps actually capture what the model relies upon to make a prediction.

This paradigm provides a promising trade-off between performance and interpretability, but caution must be exercised not to have the explainer replicate the result without retaining genuine causal thought.

2. Faithfulness-Measurable Model Paradigm

Rather than requiring the architecture itself to be interpretable, this paradigm focuses on designing models in a way that allows the faithfulness of explanations to be explicitly measured. These models may remain black-box in architecture, but they are structured to support diagnostic checks that can validate whether explanations are truly aligned with internal behavior.

One approach involves perturbation testing, where input features highlighted in an explanation are removed or modified to test if the model’s output changes accordingly. If the model still makes the same decision despite altered or removed key inputs, the explanation is likely unfaithful.

For Instance, in fraud detection, an AI model might flag a transaction as suspicious due to location and purchase amount. Using the faithfulness-measurable paradigm, we can systematically remove or alter those features to see if the model still flags the transaction. If it does, then the explanation is called into question. This method ensures that what the model "says" influenced its decision actually did.

This paradigm enables post-training evaluation and debugging of explanations, providing a layer of accountability even for high-performance models.

3. Self-Explaining Model Paradigm

In self-explaining models, the model itself outputs both a prediction and a human-readable explanation. This approach is especially relevant in the context of large language models (LLMs), which can generate natural language justifications alongside predictions or answers.

The key advantage here is that explanations are embedded into the generative process, potentially making them more accessible to non-expert users. However, the major challenge is ensuring that these natural-language explanations are faithful to the underlying reasoning and not merely plausible-sounding narratives.

For Instance- Consider ChatGPT or another conversational AI tool that responds to a question with both an answer and an explanation. It may state that a specific historical event caused a policy change and explain the context. While the explanation may sound convincing, it might not actually reflect the internal statistical patterns that led to the response. In some cases, it could even be fabricated (a phenomenon known as "hallucination").

Ongoing research aims to align these explanations more closely with model internals—through training objectives, reinforcement learning, or auxiliary supervision—so that they are not only coherent but also grounded in the actual decision-making logic.

Limitations

While the paper emphasizes faithfulness, it acknowledges the need to consider human-understandability, whether explanations are usable or helpful to non-experts.

Key limitations include:

  • Faithful explanations may lack human relevance: Even if an explanation accurately represents a model's logic, it might be conveyed in terms that are meaningless to the end-user (e.g., neural weight activations or gradients).
  • Misalignment with user expectations: Practitioners in domains like healthcare, law, or finance need explanations in forms they can reason about, such as clinical insights, legal justifications, or financial heuristics.

For Instance- : A doctor using an AI diagnostic tool might receive a saliency map showing pixel-level activations. While faithful, this may be unhelpful unless converted into language like "increased opacity in the lower-left lung segment indicating potential fluid buildup."

  • Subjectivity in interpretability: What counts as a "good" explanation varies between users. A data scientist may prefer feature attributions, while a policymaker may need plain-language summaries.
  • Need for user-centered design: Future research must consider the background knowledge, goals, and decision-making context of different end-user groups (Schut et al., 2023).

Ultimately, interpretability should balance both faithfulness and comprehensibility. Bridging this gap is a critical challenge for ensuring that AI systems are not just technically sound but also practically trustworthy.

DL-Backtrace: A New Direction for Deep Learning Interpretability

One of the most promising additions to the new paradigm of interpretability is DL-Backtrace, introduced in our recent paper DLBacktrace: A Model-Agnostic Explainability for Any Deep Learning Models. This technique is designed to provide transparent and interpretable explanations for any machine learning model, fundamentally rethinking how we trace decisions back through deep learning systems, and offering a compelling alternative to post-hoc explainability.

Unlike traditional post-hoc methods that attempt to approximate a model’s reasoning after the fact, DL-Backtrace directly computes the influence of specific inputs on the final prediction by traversing the model’s execution path backward—from output to input. This reverse traversal is not an approximation but a precise reconstruction of which parts of the input space were functionally critical to the model’s decision. It operates directly on the computational graph, using internal gradients and activations to identify causally relevant pathways.
For more in-depth insights about DLb - checkout our webinars here

How DL-Backtrace Surpasses Traditional Methods

  • Faithfulness Over Approximation: Post-hoc methods like SHAP and LIME approximate feature importance via surrogate models or local perturbations. DL-Backtrace, on the other hand, works on the original model without needing approximation layers, ensuring explanations are more causally grounded and faithful to the actual decision logic.
  • Resilience to Manipulation: Because DL-Backtrace analyzes the real computational trace, it’s less susceptible to adversarial manipulation or inconsistencies that often plague post-hoc methods relying on model probing.
  • Model-Agnostic Yet Mechanistic: DL-Backtrace doesn’t require a specially trained explainer model and can be applied across a wide range of architectures—including transformers and convolutional networks—making it both broadly applicable and technically robust.

For Instance:

In image classification tasks, where saliency maps often highlight noisy or irrelevant regions, DL-Backtrace can pinpoint the exact neurons and spatial locations that materially contributed to the classification output. This gives a crisper, more trustworthy picture of why a model thinks an image is, say, a cat instead of a dog—not based on surrounding pixels or textures, but the core object features.

Conclusion

"Interpretability Needs a New Paradigm" challenges the status quo in explainability research and pushes us to think beyond traditional boundaries. The authors argue for a more rigorous and creative approach that integrates performance and faithfulness without compromise. Though early, the proposed paradigms offer a glimpse into how future AI systems might be designed from the ground up to explain themselves.

As the field evolves, we must stay vigilant—not just about building systems that seem interpretable, but about ensuring their explanations truly reflect how they think. Because when lives, laws, or livelihoods are on the line, understanding why a model makes a decision is just as important as what it decides.

See how AryaXAI improves
ML Observability

Learn how to bring transparency & suitability to your AI Solutions, Explore relevant use cases for your team, and Get pricing information for XAI products.