On-demand Webinar: Beyond Explainability – Evaluating XAI Methods with Confidence Using XAI Evals
July 11, 2025
We’re excited to share the story behind the development of XAI Evals, AryaXAI’s open-source framework built to bring confidence and clarity to evaluating explainability methods in AI systems.
As AI systems take center stage in decision-making across industries like healthcare, finance, and legal services, explainability alone is no longer enough. While post-hoc explanation methods (like SHAP, LIME, Grad-CAM, and Integrated Gradients) attempt to bridge the gap, how can we trust these explanations? It’s not just about showing why a model made a decision—it’s about evaluating whether that explanation is trustworthy, stable, and actionable.
In this webinar, we walk through:
- The motivation for building XAI Evals and the challenges it addresses
- A detailed look at current explainability techniques and their limitations
- The evaluation metrics integrated into the framework (faithfulness, sensitivity, comprehensiveness, and more)
- A live demo showcasing how to use XAI Evals for tabular and image data
- Insights into how this framework supports regulatory alignment and model transparency
- An open Q&A with live audience questions
We’re joined by Pratinav Seth, Research Scientist at AryaXAI and the lead author of XAI Evals. Pratinav has been at the forefront of AryaXAI’s mission to build accessible, stable, and production-ready interpretability tools for the AI community. The session is hosted by Sugun Sahdev, who guides us through the agenda and audience interaction.
Whether you're a data scientist, ML engineer, or part of a governance and compliance team, this session offers valuable perspectives on building trustworthy and explainable AI.
Papers and Resources discussed:
- XAI Evals Paper: https://arxiv.org/html/2502.03014v1
- Github Repository: github.com/AryaXAI/xai_evals
Let’s dive in!
Sugun Sahdev:
Welcome everyone, and thank you for joining us today for our webinar, “Beyond Explainability: Evaluating XAI Methods with Confidence using XAI Evals.” We’re excited to have a diverse audience of data scientists, ML engineers, tech leads, and decision-makers with us as we explore one of the most critical components of building trustworthy AI: explainability.
Today’s session includes:
- A brief introduction to AryaXAI and our mission
- A deep dive into XAI Evals, our open-source framework for evaluating post hoc explanation methods
- An overview of supported techniques and metrics
A live demo showcasing how to generate and assess explanations for tabular and image data - An open Q&A session
To lead us through this, we’re joined by Pratinav Seth, Research Scientist at AryaXAI and the lead author behind XAI Evals. Over to you, Pratinav.
Pratinav Seth:
Thanks, Sugun. I’m excited to share our work on XAI Evals, but before that, a quick introduction to AryaXAI Alignment Labs. We operate across Mumbai, Paris, and London with a focused mission: solving interpretability, alignment, AI safety, and risk challenges in machine learning.
We plan to do so by building new techniques, working on open source tools, and collaborating with academic and other labs.
Two of our key tools we have publicly released right now are:
- DL-Backtrace – a model-agnostic explainability method
- XAI Evals – our focus for today
Today, we'll be discussing XAI evaluations and why evaluating explainability is crucial. We are also currently hiring for many roles. If you're interested, please don't hesitate to reach out.
Why is the evaluation of explainability critical?
One of the biggest challenges today is that AI is being applied across a wide range of mission-critical tasks to improve efficiency. Whether it's healthcare, pharmaceuticals, banking, or insurance, AI is being used extensively across domains.
Now, not just enterprises, but even governments and regulators are increasingly involved in the AI ecosystem. And rightly so—they're concerned about how AI is being deployed and what risks it poses. The core issue is this: creating an AI model is relatively easy; making it acceptable—especially in high-stakes environments—is much harder.
Why? Because anyone can take a model from Hugging Face or build a deep learning model in a matter of hours. But we often don’t have visibility into how that model was trained, what data it saw, or how it’s making decisions. This becomes especially problematic in regulated industries, where transparency and traceability are critical.
AI’s decision-making process is often opaque. And this uncertainty around how a model arrives at its output makes it difficult to meet regulatory requirements. Even if you can develop a model for a given use case in one or two hours, getting it approved by all stakeholders is another story. You need alignment across your product team, risk and compliance teams, internal audit, and of course, your end users or customers.
So, how do you build trust across all these groups? One common approach is using explainability. But then the next question becomes: how do you ensure your explanations are actually correct?
In fact, bad or misleading explanations can be just as dangerous as no explanations at all. They can lead stakeholders to a false sense of trust, and when that breaks, it creates skepticism not just about your model, but about AI as a whole. That's why the integrity of explanation methods is crucial.
As I mentioned earlier, regulators are paying close attention to this. In the EU, for example, new AI regulations include strong provisions around model risk management, explainability, and governance. Similar movements are happening in the US, UK, India, and other countries. Governments are actively stepping in to ensure responsible AI use.
Now, someone new to the field might ask: “Why is this suddenly a problem with AI? Haven’t we been using machine learning for over a decade?”
The difference lies in the nature of deep learning models. They are highly performant but extremely opaque. Their decision-making involves a complex series of mathematical transformations, and unraveling that logic is not straightforward. That's why interpretability and explainability have become such pressing issues today.
Challenges in Explainability
Let’s now discuss the explainability methods commonly used in practice.
We broadly categorize these into two types:
- Model-agnostic methods – These include popular techniques like LIME and SHAP, which essentially act as surrogate models to approximate how a black-box model behaves. While widely adopted, these methods are computationally expensive and often sensitive to data perturbations, making them unstable in real-world scenarios.
- Gradient-based methods – These rely on computing the gradients of the model with respect to inputs. Examples include Grad-CAM and Integrated Gradients. However, these techniques often depend on the choice of baseline and can be difficult to interpret, especially for non-technical stakeholders.
One of the core challenges here is the lack of unified metrics to evaluate and compare these methods. Given multiple explainability approaches, how do we decide which one is better? This remains an open problem.
This ties back to the black-box nature of modern AI. As models grow larger—with billions of parameters powered by high compute—they become harder to interpret. Performance may be excellent, but understanding how decisions are made is still largely opaque.
That brings us to the main topic of today’s webinar: Interpretability and Explainability in modern AI systems.
In today’s world of large-scale deep learning, traditional methods like gradients are no longer sufficient. We need more concise, stable, and stakeholder-friendly explainability techniques—especially for high-stakes applications.
Despite advances, many current models still operate as black boxes. And while post-hoc explainability methods exist, scaling them reliably and evaluating their trustworthiness remains a significant challenge.
To clarify:
- Global explainability gives a high-level view of model behavior across the dataset.
- Local explainability focuses on why a model made a specific prediction.
Most popular methods—like LIME, SHAP, Grad-CAM, and Integrated Gradients—are post-hoc, meaning they try to explain decisions after the model has been trained.
We also introduced a new method called DL-Backtrace, which is a model-agnostic explainability technique. Unlike traditional methods, DL-Backtrace does not rely on baselines and offers more stable and interpretable relevance scores.
But the question still remains: Why can’t we just rank these methods and pick the best one?
The core issue is that explainability lacks ground truth. There's no definitive answer to what the “correct” explanation should be. This makes evaluation inherently subjective.
Various metrics exist—like perturbation-based evaluations or fidelity-based metrics—but no universally accepted standard exists today. Researchers in different domains use different evaluation strategies, and many of these methods can be easily manipulated or gamed.
This inconsistency creates major hurdles when trying to compare or benchmark explainability methods across different industries.
So, what’s the solution?
We believe the path forward is to standardize explainability evaluation. A common benchmark would allow consistent, reliable assessments of interpretability methods across use cases and industries.
That’s exactly why we built XAI Evals—a robust evaluation framework for explainability in AI.
Introducing XAI Evals
XAI Evals addresses the black-box challenge by offering a standardized way to evaluate and compare explanation techniques, regardless of the underlying model architecture—whether it’s a neural network, CNN, or any other black-box model.
XAI Evals is an open-source Python package designed specifically for evaluating explainability methods on tabular and image data. The framework emphasizes the importance of interpretability in high-stakes AI applications and supports a wide range of post-hoc explanation techniques.
XAI Evals integrates popular methods like SHAP, LIME, Grad-CAM, DL-Backtrace, and more—allowing users to plug in any interpretability method of their choice. It also includes a comprehensive suite of evaluation metrics such as faithfulness, sensitivity, and robustness, which are commonly used in the explainability research space.
By using XAI Evals, you can systematically assess the quality of explanations, helping improve both model interpretability and stakeholder trust. The package supports a wide variety of models—from traditional machine learning algorithms to deep learning models built using PyTorch and TensorFlow.
You can easily generate explanations in a structured and standardized format, which can be directly integrated into your production pipelines. Moreover, the framework enables benchmarking across methods and metrics, helping you identify which explanation techniques perform best for your specific use case.
In the next slides, I’ll walk through how exactly this contributes to building trustworthy AI systems.
With all this in mind, why is evaluation really important?
There’s a critical question that often arises, especially when you're working with regulators or other oversight bodies:
You're claiming that your model is explainable and free from bias, and you've selected certain interpretability methods to support that claim. But how do you justify these choices to a regulator?
After all, there are multiple explainability methods available, but there hasn’t been a standardized, quantifiable approach to evaluate and compare them. This is exactly where XAI Evals becomes valuable.
Many commonly used methods, like SHAP and LIME, are known to be easily manipulated or overly sensitive. Without a robust framework, it's difficult to defend your interpretability claims in real-world, regulated environments.
XAI Evals introduces structure by focusing on core evaluation areas:
- Faithfulness (Fidelity/Infidelity): Assesses how well feature attributions correspond to actual changes in model output when features are perturbed, ensuring explanations accurately reflect model behavior.
- Sensitivity (Robustness): Evaluates the consistency of attributions in the presence of small input perturbations or noise; lower sensitivity indicates more stable and reliable explanations.
- Comprehensiveness & Sufficiency: Determines if the most important features significantly contribute to the model's output and whether these features alone suffice for a reliable explanation.
- Monotonicity: Indicates consistency in attribution direction relative to model predictions; attributions should proportionally reflect feature influence.
- Sparseness & Complexity: Addresses the simplicity of explanations, such as the number of features deemed important, aiding in understanding how manageable and interpretable explanations are for users.
XAI Evals provides implementations of these metrics, allowing users to systematically evaluate and compare interpretability methods. This not only improves transparency but also strengthens your ability to communicate and justify your model behavior to stakeholders—especially in regulated environments.
With that context, let’s now look at the design of the XAI Evals library and how it supports these capabilities.
Library Design (xai_evals)


Experimental Setup (xai_evals)
Model Types: Supports various ML/DL models, including those from Scikit-Learn, XGBoost, TensorFlow, and PyTorch.
Data Types:
- Tabular Data: Demonstrated with models such as Random Forest using SHAP.
- Image Data: Demonstrated with models like ResNet using Grad-CAM.
Workflow:
- Model Training: Train a black-box ML/DL model.
- Explanation Generation: Utilize xai_evals to generate explanations for specific predictions using various methods.
- Metric Calculation: Apply xai_evals' quantitative metrics to evaluate the generated explanations.
- Benchmarking: Compare different explanation methods based on these metrics.
Live Demo
https://github.com/AryaXAI/xai_evals
Pratinav showed how to use XAI Evals across three scenarios.
Example Notebooks :
- Tabular ML Models Illustration and Evaluation Metrics: Data Set - IRIS Dataset. Colab Link
- Tabular Deep Learning Model Illustration and Evaluation Metrics: Data Set - Lending Club. Colab Link
- Image Deep Learning Model Illustration and Evaluation Metrics: Data Set - CIFAR10. Colab Link
1. Tabular Data + ML Model (Iris dataset)
- Model: Random Forest
- Explainers: SHAP and LIME
- Results:

Based on the comparison between SHAP and LIME, we can conclude that both methods provide valuable insights into model interpretability, with distinct strengths that make them more suitable for different use cases:
- SHAP offers more faithful approximations of the model's behavior and exhibits slightly lower infidelity, making it a better choice when the goal is to have accurate and model-aligned explanations.
- LIME, however, offers higher sensitivity and perfect monotonicity, which makes it ideal for situations where capturing subtle changes in input features and maintaining the consistency of the input-output relationship are more critical.
Both methods share similar complexity and comprehensiveness scores but do not offer sparse explanations, which could be a limitation for applications that require concise, feature-specific interpretations. When choosing between SHAP and LIME, the decision should be based on the tradeoff between accuracy and faithfulness (favoring SHAP) versus sensitivity and monotonicity (favoring LIME).
2. Tabular Data + Deep Learning (LendingClub dataset)
- Model: MLP (3 layers)
- Explainers: Integrated Gradients, Gradient SHAP, DL-Backtrace
- Results:

Based on the comparison between DeepLIFT, Integrated Gradient, GradientSHAP, and the DLBacktrace methods (Default and Contrastive), we can make the following observations:
- DLBacktrace_Default and DLBacktrace_Contrastive are clearly the strongest methods in terms of sensitivity, comprehensiveness, and monotonicity. These methods exhibit high sensitivity to input changes, offer more comprehensive explanations, and maintain perfect monotonicity, making them ideal for applications that require detailed feature contributions and a consistent relationship between inputs and outputs.
- DeepLIFT, Integrated Gradient, and GradientSHAP have slightly better faithfulness and slightly higher infidelity, but these methods still align closely with the model's behavior. They do, however, show lower sensitivity and comprehensiveness compared to DLBacktrace methods.
- None of the methods offer sparse explanations, and they all share similar computational complexity. Therefore, the key differentiators are based on sensitivity, faithfulness, comprehensiveness, and monotonicity.
Key Takeaways:
- DLBacktrace_Default and DLBacktrace_Contrastive are the best choices if you need sensitivity, comprehensiveness, and monotonicity in your model explanations. They are ideal for capturing subtle changes in inputs, offering a more detailed understanding of feature contributions, and ensuring a consistent input-output relationship.
3. Image Data + CNN (CIFAR-10)
- Model: Custom CNN
- Explainers: Occlusion, GradCAM, DL-Backtrace
- Results:
The analysis of different methods based on Faithfulness Correlation and Max Sensitivity reveals notable differences in performance:
- DLBacktrace demonstrates the highest faithfulness, indicating that it aligns well with the model's decision-making process.
- Integrated Gradient shows the lowest faithfulness, suggesting it may not be as reliable in reflecting model behavior. However, it exhibits the highest sensitivity, meaning it reacts strongly to small input changes, which could be useful in capturing subtle variations but may also indicate instability.
- GradCAM and DLBacktraceCon perform moderately in terms of faithfulness, with DLBacktraceCon showing slightly better alignment than GradCAM.
- GradCAM has the lowest sensitivity, making it the most stable method, but this could come at the cost of reduced responsiveness to input variations.
Overall, DLBacktrace appears to strike the best balance between faithfulness and sensitivity, making it a strong choice for interpretability. However, the choice of method should depend on the specific needs—whether stability, sensitivity, or faithfulness is the priority.
Key Findings / Results
Pratinav Seth:
Insights from Using xai_evals:
- Quantitative Comparison: Enables objective comparison of different explanation methods, highlighting their strengths and weaknesses on specific tasks and models.
- Identification of Reliable Explanations: Assists practitioners in determining which explanation methods are more faithful, stable, or robust for their particular use case.
- Enhanced Trust: By providing systematic evaluation, xai_evals contributes to building greater trust in ML models and their explanations.
- Guidance for Practitioners: Offers a practical tool for researchers and developers to select and refine XAI techniques.
To summarize the demo and our key findings while working on XAI Evals, one major realization is that the field of explainability is still highly fragmented—especially when it comes to quantifying explanations. Our approach aims to address this by enabling quantitative comparison across different explanation methods. This makes it easier to assess the strengths and limitations of each method, identify reliable explanations, and ultimately enhance trust in model outputs.
Quantification also allows practitioners to select the most appropriate explanation method for their specific application—whether qualitatively (based on stakeholder interpretation) or quantitatively (based on measurable metrics). This is especially useful in regulated industries, where understanding which method works best for a given model and context can significantly improve compliance and decision-making.
Moreover, XAI Evals supports fairness assessments, allowing users to evaluate not only the model but also the fairness and stability of its explanations—helping to detect and mitigate biases.
With upcoming regulations such as the EU AI Act and ongoing requirements under GDPR, explainability is becoming mandatory. However, there are still few standardized guidelines on how to evaluate it. XAI Evals offers a pathway to operationalize transparency and build AI systems that can withstand regulatory scrutiny.
In conclusion, XAI Evals fills a critical gap in the explainability ecosystem. While many post-hoc methods exist, there has been no unified, open-source framework to compare and evaluate them across models and tasks. XAI Evals provides this missing layer— enabling quantitative, consistent, and trustworthy evaluation of explanations that strengthens both AI reliability and auditability.
In addition to model evaluation, explanation evaluation is becoming increasingly essential. With tools like XAI Evals, teams can systematically compare explanation methods and select the one best suited for their specific models and use cases.
Future Roadmap
Looking ahead, we’re actively expanding support for newer model architectures, particularly transformers for NLP. Natural Language Processing has seen tremendous growth in the last couple of years, but explainability in this domain remains especially challenging due to its black-box nature. Addressing this is a key part of our roadmap.
We’re also working on broadening the range of evaluation metrics and explanation methods supported by XAI Evals. Currently, some explainers are available only for specific model types, but we aim to extend coverage across more tasks and modalities.
Another exciting area is Graph Neural Networks (GNNs). These models are increasingly adopted in applications like fraud detection, recommendation systems, and drug discovery. We’re exploring integration of GNN-specific explainers and evaluation routines into the library.
To make XAI Evals even more useful in real-world workflows, we’re improving its integration capabilities with existing ML pipelines and tools. While the library is already open-source and usable today, we’re refining its design for even greater flexibility and accessibility.
To explore more, we invite you to visit our GitHub repository at https://github.com/AryaXAI/xai_evals
It’s simple to find and well-documented. You’ll also find a link to our preprint there, which offers deeper insights into the research and methodology behind XAI Evals.
Thank you again for joining us, and we look forward to your feedback and collaboration.
Sugun Sahdev:
Thank you so much, Pratinav. That was an incredibly detailed and insightful session—really appreciated your deep dive into the framework and its real-world applications.
We’ll now move into the next part of the webinar—our Q&A session.
Q&A Highlights
Sugun Sahdev:
Have you made any attempts to combine all these explanation methodologies to provide some sort of aggregate explainability? Or is there significant variation in the outputs from each method?
Pratinav Seth:
That’s a great question. We’ve explored how to quantify explanations across different dimensions. For instance, we use distinct metrics for evaluating different aspects—like faithfulness, sensitivity, and robustness—each representing a different direction of explainability.
While we've tried to align these metrics for comparison, creating a unified or aggregate explainability score is quite challenging. Each metric has a unique way of being computed, and there’s no universally accepted method to combine them meaningfully.
In practice, it really depends on your business context. If faithfulness is more important than robustness for your use case, you might choose to optimize and report that metric. But combining them into a single score is an area the community still needs to explore further.
Sugun Sahdev:
Thanks! Here’s another one from Daniel.
Question:
Is there a timeline to support NLP models in XAI Evals?
Pratinav Seth:
Yes, absolutely. NLP is high on our roadmap. We’ve primarily focused on tabular and image data so far, but NLP is especially important—and challenging—due to its black-box nature.
We’re currently exploring stable evaluation metrics for NLP-based explanations. If all goes well, we expect to roll out initial support by the end of this year, or potentially sooner.
Sugun Sahdev:
Great. Another question that’s come in:
Question:
How does XAI Evals balance between faithfulness and robustness when evaluating explanation methods?
Pratinav Seth:
That’s a nuanced one. In the field, faithfulness and robustness are often used interchangeably, but they represent distinct aspects.
- Faithfulness checks how well the explanation aligns with the model's actual decision logic.
- Robustness evaluates how stable explanations are under small input perturbations or noise.
In XAI Evals, we compute separate metrics for both. You might find that a method scores high on faithfulness but is very sensitive to input noise—or vice versa. Evaluating both gives you a holistic view, and you can decide which trade-off fits your domain better.
Sugun Sahdev:
Thank you. Next question:
How do we quantitatively assess and choose between SHAP and LIME, given their trade-offs?
Pratinav Seth:
Good question. Both SHAP and LIME are surrogate-based, post-hoc explanation methods—often used with machine learning models. If you're working with classical models like decision trees or logistic regression, you're usually limited to such methods.
Here’s a breakdown:
- SHAP is more consistent and theoretically grounded, and works well when you want precise, localized explanations using a smaller feature set.
- LIME can be helpful for broader, more global approximation, though it tends to be more unstable.
That said, both can be easily manipulated and are not ideal for high-stakes or deep learning use cases. Regulators are increasingly skeptical of them for deployment in sensitive domains. For those scenarios, we recommend using more robust approaches like DL-Backtrace.
Sugun Sahdev:
Great insight. Next question:Can XAI Evals be used with other explanation methods like SAEs (Self-Explaining AI models)?
Pratinav Seth:
Currently, XAI Evals integrates explanation generation directly into the framework for ease of use. This means most workflows assume explanations are generated within the library itself.
That said, if you can generate attributions externally (e.g., from SAEs) in a compatible format, future versions of the library may allow importing those explanations for evaluation. This would require some architectural changes, but it's definitely on our radar as a future enhancement.
Sugun Sahdev:
Thanks, Pratinav. Here’s another one:
Some metrics like faithfulness and comprehensiveness can be data sensitive. How does XAI Evals handle variability across datasets?
Pratinav Seth:
That’s a great observation. Yes, metrics like faithfulness and comprehensiveness are indeed highly dependent on the underlying data. Since these metrics rely on how the model responds to inputs, any distribution shift—like out-of-distribution (OOD) samples—can significantly impact the results.
If the model is presented with unfamiliar data, its predictions and corresponding explanations may become unstable or unreliable. In such cases, explanation quality can degrade, which will naturally affect metric scores.
With XAI Evals, we do provide visibility into this variability. For example, if a particular batch of data causes extreme deviations in metric values, the system flags it. We also have some corrective handling mechanisms in place, such as input checks and range normalization. However, it’s essential for users to interpret these cases carefully, especially when data shifts are expected. Understanding how your data interacts with the explanation method is key to trusting the outputs.
Sugun Sahdev:
Thanks, Pratinav. Here’s another one.
How well does XAI Evals generalize across different modalities—such as tabular, image, and text data?
Pratinav Seth:
The core idea behind XAI Evals is modality-agnostic evaluation. While implementation details vary, the foundational metrics—like faithfulness, sensitivity, and comprehensiveness—are applicable across tabular, image, and soon, text models.
The framework adapts the computation of each metric based on the input format. For example:
- In tabular data, perturbations might involve removing feature values.
- In image data, we modify or occlude pixels or regions.
- For text (planned for release soon), we’ll apply token masking and saliency techniques.
So while the underlying logic remains consistent, the execution adapts to modality-specific requirements. This ensures the evaluation remains robust and meaningful across different data types.
Sugun Sahdev:
Let’s take another question.
How does XAI Evals support quantitative analysis in the image domain?
Pratinav Seth:
In the image domain, we apply metrics like faithfulness, which can be measured using techniques such as Most Pertinent Positive Region Testing (MPRT). Here’s a simplified view:
- We start by identifying the most important pixels or regions that contribute to a model’s prediction—based on the explanation method used.
- We then systematically remove or occlude those pixels to see how much the model’s confidence drops.
- This change is plotted to create a curve that reflects how prediction confidence shifts as more informative pixels are removed.
- We compute the area under the curve (AUC) to derive a quantitative faithfulness score.
This process gives us a numerical metric that reflects how aligned the explanation is with the model's actual decision logic. Similar strategies are used to evaluate other metrics like comprehensiveness and sufficiency for image data.
Sugun Sahdev:
Interesting! And now this question:
Is there a preference for certain metrics depending on the use case? For example, should healthcare prioritize faithfulness, while finance focuses more on robustness?
Pratinav Seth:
Exactly—and that’s an important point. The choice of metrics should be driven by the use case and the stakeholders involved.
For example:
- In healthcare, faithfulness is critical. Explanations may be used to validate diagnoses or support decisions made by medical professionals. In such scenarios, the explanation must closely reflect the model's reasoning.
- In finance, robustness becomes more important. You want your explanations to remain consistent even if there’s slight variation in inputs, especially in tasks like fraud detection or credit scoring.
So yes, different industries and applications will prioritize different qualities in explainability. That’s why XAI Evals provides a suite of metrics—so teams can evaluate what's most relevant to their domain.
We also provide guidance in our documentation about which metrics are most suitable for particular industries and how to interpret them effectively.
Sugun Sahdev:
What are the future plans for the XAI Evals library?
Pratinav Seth:
Looking ahead, there are a few major areas we’re actively working on.
First, NLP support is a top priority. With the growing number of applications built on NLP models—including many that wrap basic models with custom logic—it’s becoming increasingly important to not only generate explanations for these models but also quantify those explanations in a meaningful way. This is especially relevant as regulations, such as the EU AI Act, begin to demand greater transparency in these systems.
Second, we’re exploring how to evaluate explanations for black-box models, where it’s difficult to access internal mechanics. While this is an ongoing challenge across the industry, we believe there are meaningful ways to quantify explainability even in such scenarios—and that’s something we aim to support.
Third, we’re planning to expand to Graph Neural Networks (GNNs). There’s growing interest in GNNs across domains like fraud detection and drug discovery, and while some academic work has been done, practical tools for evaluating their explanations are still limited. We hope to bridge that gap.
And finally, we’re looking into multimodal applications, starting with text as a primary modality. Supporting multiple input types and aligning evaluation across them is a direction we’re excited to pursue in the coming months.
Sugun Sahdev:
Thank you! Here’s another interesting one.
How should teams interpret a high sufficiency score but low monotonicity score in practice? What does this indicate about the explanation quality?
Pratinav Seth:
That’s a great technical question. Let’s break it down:
- A high sufficiency score means that the top few features identified by the explanation are sufficient to replicate the model’s prediction. In other words, the model is highly dependent on a small subset of features—which can be useful for simplifying models or identifying critical decision factors.
- A low monotonicity score, however, indicates that as you add more features (ranked by importance), the model’s prediction doesn’t consistently improve in a smooth or expected way. This could mean the ranking of feature importance isn’t very stable or well-aligned with model behavior.
In practical terms, this suggests that while the explanation does a good job of isolating a few impactful features, the overall ordering of features may not reflect true influence on the model’s decision.
So, the interpretation depends on the use case:
- In model optimization, high sufficiency could guide you toward feature selection.
- In auditing or compliance, low monotonicity might raise concerns about reliability or interpretability.
Understanding both together gives a more nuanced view of explanation quality.
Sugun Sahdev:
That was very helpful. I think we received some excellent questions today!
That brings us to the end of today’s webinar. Thank you all for joining us—we hope the session provided valuable insights into how XAI Evals can enhance transparency, reliability, and auditability in modern AI systems.
We’ll be sharing the recording with all attendees shortly. In the meantime, if you’d like to learn more, please:
- Connect with us or,
- Explore the XAI Evals GitHub repo
We’re always eager to collaborate with the community, and we look forward to seeing you at future sessions.
Thank you again—and thank you, Pratinav—for leading such an insightful discussion.
Unlock More Videos
Access a wide range of videos covering the latest advancements, insights, and trends in AI, MLOps, governance, and more. Watch expert presentations, workshops, and deep dives to stay updated.

Is Explainability critical for your AI solutions?
Schedule a demo with our team to understand how AryaXAI can make your mission-critical 'AI' acceptable and aligned with all your stakeholders.