49 Explainable AI Insights

Interpreting Black-Box Model Decisions with SHAP & Attention

For Newcomers

You will learn:

Why “trust me, the model said so” isn’t good enough for $45K decisions
How SHAP values explain which features drove a specific prediction
How to present AI recommendations to skeptical stakeholders
The difference between global (overall) and local (this prediction) explanations

Explainable AI bridges the gap between black-box accuracy and human understanding. When a geologist asks “Why does the model think there’s sand here?”, we can now give a real answer.

49.1 What You Will Learn in This Chapter

By the end of this chapter, you will be able to:

Explain why high-stakes groundwater decisions require more than raw model accuracy.
Use feature importance, partial dependence, and SHAP-like methods to interpret groundwater predictions.
Connect model explanations back to physical aquifer processes that geologists and hydrologists care about.
Tailor explanation style and detail to different stakeholders (operators, managers, regulators, public).
Recognize the limits of explainability methods and design guardrails for safe deployment.

49.2 Why Explainability Matters

Problem: Stakeholders ask “Why did the model predict sand here?”

Traditional ML Answer: “Random Forest said so” (not acceptable for $45K drilling decision)

Explainable AI Answer: - “Model predicted sand (92% confidence) because:” 1. High resistivity (8.2 Ω·m) → +35% sand probability 2. Northwestern location (glacial outwash zone) → +28% sand probability 3. Depth 45m (mid-aquifer) → +18% sand probability 4. Nearby wells hit sand → +11% sand probability

Result: Geologist understands reasoning, trusts decision, approves drilling.

49.3 Explainability Methods

49.3.1 1. SHAP Values (SHapley Additive exPlanations)

📘 Understanding SHAP Values

What Is It?

SHAP (SHapley Additive exPlanations) is a unified framework for interpreting machine learning predictions, introduced by Lundberg & Lee in 2017. It’s based on Shapley values from cooperative game theory (1953 Nobel Prize-winning work by Lloyd Shapley). SHAP answers: “How much did each feature contribute to moving the prediction away from the baseline (average) prediction?”

Why Does It Matter?

When a model predicts “92% probability of sand” at a drilling location, stakeholders ask “Why?” SHAP provides a mathematically rigorous answer by decomposing the prediction into contributions from each feature (depth, location, resistivity, etc.). This enables:

Trust: Geologists can validate if model reasoning matches geological knowledge
Debugging: Identify when model relies on spurious correlations
Compliance: Regulators require explanations for high-stakes decisions
Learning: Domain experts discover new patterns from model insights

How Does It Work?

SHAP uses game theory to fairly distribute “credit” for a prediction:

Baseline: Start with average prediction across all training data (e.g., 50% sand probability)
Actual prediction: Model predicts 92% sand for this specific location
Difference: 92% - 50% = 42% needs to be explained
Attribution: SHAP calculates how much each feature contributed to that 42%

Key principle: Contributions must sum to the total difference (additivity property).

What Will You See?

Waterfall plots: Show how prediction builds up from baseline through feature contributions
Feature importance: Which features matter most across all predictions
Force plots: Interactive visualization showing feature contributions for one prediction
Dependence plots: How feature values affect predictions

Historical Context: Before SHAP, explainability methods gave inconsistent results. SHAP unified them with solid game-theoretic foundations, becoming the industry standard by 2020.

Concept: How much did each feature contribute to this specific prediction?

Example: Predicting material type at (405000, 4428000, -50m)

import shap

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Explain single prediction
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test[0])

# Visualize
shap.waterfall_plot(shap_values[0])

Output:

Base prediction: 50% (prior probability of sand)

Feature Contributions:
  is_sand = 1          → +42% (most important)
  Z_norm = -0.8        → +15% (deep location favors sand)
  Y_norm = 0.6         → +12% (northern region trend)
  radial_position = 2.1 → -8% (far from center)
  depth_below_surface = 45 → +5% (mid-aquifer)

Final prediction: 50% + 42% + 15% + 12% - 8% + 5% = 116% → clipped to 92%

Interpretation: is_sand feature (from domain knowledge) contributes most. Depth and location add confidence.

📊 How to Read SHAP Waterfall Plots

Understanding the Visualization:

Baseline (Starting Point): - The bottom value (e.g., 50%) is the average prediction across all training data - This is what you’d predict knowing nothing about this specific location - Think of it as “prior probability” before seeing any features

Feature Contribution Bars: - Red bars (pointing right) = Feature increases prediction (pushes toward “yes sand”) - Blue bars (pointing left) = Feature decreases prediction (pushes toward “no sand”) - Bar length = How much that feature matters for THIS prediction - Top to bottom = Bars stack to show cumulative effect

Reading the Values: - is_sand = 1 → +42% means “knowing this location has sand indicator adds 42% to probability” - Z_norm = -0.8 → +15% means “being at this elevation adds 15% more” - radial_position = 2.1 → -8% means “being far from center reduces probability by 8%”

Final Prediction: - Top value shows where all contributions sum to - This is the actual model prediction for this specific case

How to Explain to Stakeholders:

For Drilling Decision (Example):

“The model predicts 92% probability of sand at this location. Here’s why:

Main reason (+42%): Geophysical signature matches known sand deposits

Supporting evidence (+15%): Depth is 50m, which is in the primary aquifer zone

Geographic pattern (+12%): Northern locations typically have better sorted sediments (glacial outwash)

Minor concern (-8%): Site is farther from city center than ideal

Bottom line: Three strong positive indicators outweigh the one minor negative. The model is confident (92%) and the reasoning aligns with geological knowledge.”

When to Distrust SHAP Values: - Contradictory contributions: If high-importance features cancel each other out → low confidence - Single dominant feature: If one feature is >80% of total → model might be overfitting - Counterintuitive signs: If deeper depth decreases water availability (opposite of physics) → investigate

49.3.2 2. Feature Importance (Global)

📘 Understanding Feature Importance

What Is It?

Feature importance measures which input variables contribute most to predictions across the entire dataset (global explanation). Introduced by Breiman (2001) for Random Forests, it answers: “If I could only collect data on 3 features, which 3 would give the best predictions?” Permutation importance (Breiman 2001, Fisher et al. 2019) is the most reliable method.

Why Does It Matter?

Collecting and maintaining data is expensive. If the model only uses 5 of your 20 features, you can:

Reduce costs: Stop measuring low-importance features
Focus quality control: Monitor high-importance sensors more closely
Guide new sites: Install only critical sensors at new wells
Validate model: If “surface_elevation” is unimportant but geology says it should matter, investigate

How Does It Work?

Permutation importance uses a clever trick:

Baseline: Calculate model accuracy on test data
Shuffle one feature: Randomly scramble values for feature X (breaks its relationship with target)
Re-test: Calculate accuracy with scrambled feature
Importance = Accuracy drop: Large drop = feature was important, small drop = feature wasn’t used

Example: If shuffling “depth” drops accuracy from 86% to 62% (24% drop), but shuffling “temperature” only drops it to 85% (1% drop), then depth is 24× more important than temperature.

What Will You See?

Bar charts showing importance percentages. Top features get most attention in data collection and quality control.

Question: Across all predictions, which features matter most?

Method: Permutation importance - Shuffle feature values randomly - Measure accuracy drop - Larger drop = more important feature

Results (from material classification model):

Rank	Feature	Importance	Interpretation
1	is_sand	27.0%	Binary sand indicator dominates
2	Z_norm	12.2%	Elevation/depth critical
3	depth_below_surface	11.6%	Vertical position matters
4	Y_norm	10.0%	North-south trends (glacial)
5	radial_position	9.0%	Distance from origin

Action: Focus data collection on top 5 features. Others contribute <10% each.

49.3.3 3. Partial Dependence Plots

📘 Understanding Partial Dependence Plots (PDPs)

What Is It?

Partial Dependence Plots (PDPs) visualize the marginal effect of one or two features on model predictions, introduced by Friedman (2001). They answer: “If I change depth from 20m to 60m while keeping everything else average, how does the prediction change?” PDPs reveal the functional relationship the model learned between a feature and the outcome.

Why Does It Matter?

PDPs validate whether the model learned physically plausible relationships:

Validation: Does deeper depth increase sand probability? (Geology says yes—PDPs should too)
Discovery: Unexpected PDP shapes reveal surprising patterns (e.g., “sweet spot” at 40-50m depth)
Communication: Non-technical stakeholders understand line plots better than equations
Debugging: Flat PDP means model ignores that feature (despite high importance elsewhere—investigate!)

How Does It Work?

PDPs use a “what-if” approach:

Select feature: e.g., depth
Create grid: Test depths from 0m to 100m in 50 steps
For each depth value:
- Replace all data points’ depth with this value
- Keep other features at their original values
- Predict for all modified points
- Average predictions
Plot: Depth (x-axis) vs Average Prediction (y-axis)

Key insight: PDP shows the average effect of depth, marginalizing over all other feature combinations.

What Will You See?

Line plots showing how predictions change as one feature varies. Look for: - Linear trends: Simple relationships (e.g., deeper = lower water table) - Curves: Nonlinear effects (e.g., optimal depth range for sand) - Steps: Threshold behavior (e.g., season changes) - Flat lines: Feature doesn’t affect predictions

Limitation: PDPs assume feature independence. If features correlate (e.g., depth and temperature), PDPs can show unrealistic scenarios.

Question: How does changing one feature affect predictions?

Example: How does depth affect sand probability?

Depth (m)    Sand Probability
    0             30%  (shallow = clay/topsoil)
   20             45%  (transition zone)
   40             85%  (mid-aquifer = sand)
   60             75%  (deep = mixed)
   80             40%  (very deep = bedrock)

Insight: 40-60m depth is sweet spot for sand prediction (85% probability).

Operational Use: When screening drilling locations, prioritize depths 40-60m.

49.3.4 4. LSTM Attention Weights

📘 What Is Attention Mechanism?

What Is It?

Attention mechanism is a technique in deep learning (introduced by Bahdanau et al. 2014, popularized by Transformers in 2017) that allows models to focus on the most relevant parts of input data when making predictions. For time series, it answers: “Which past time steps matter most for predicting today’s value?”

Why Does It Matter for Time Series?

Traditional LSTMs (Long Short-Term Memory networks) process sequences sequentially and “compress” all past information into a fixed-size memory. This creates problems:

Information loss: Distant past gets “forgotten” as new data arrives
Equal weighting: All time steps treated similarly (but some days matter more than others)
Black box: Hard to understand what the model “remembers”

Attention solves this by explicitly calculating importance weights for each past time step. For groundwater forecasting:

High attention on T-1 (yesterday) → Recent conditions dominate (fast-responding aquifer)
High attention on T-30 (last month) → Seasonal patterns important (slow-responding aquifer)
High attention on T-7 (last week) → Weekly pumping cycles matter (human influence)

How Does It Work?

Think of attention like a spotlight on a timeline:

Model looks at all past days (T-1, T-2, …, T-90)
Calculates relevance score for each day based on current context
Assigns attention weights (sum to 100%) - more weight = more focus
Uses weighted combination of past values for prediction

Attention Weight = How much the model “looks at” that specific past day

What Will You See?

Heatmaps or bar charts showing attention weights over time. Patterns reveal: - Sharp peak at T-1 = Short-term persistence dominates - Distributed weights = Long-range dependencies (complex system memory) - Periodic peaks = Seasonal or cyclic patterns (e.g., every 7 days = weekly)

Operational Value

If model pays high attention to T-30, and precipitation 30 days ago was high, you can explain:

“Forecast predicts rising water levels because the model learned that precipitation 30 days ago (which was above normal) typically shows up in groundwater levels around now. This aligns with our understanding of aquifer recharge lag time.”

Question: Which past time steps does the forecast model focus on?

Method: Extract attention weights from Transformer/LSTM model

Example: 30-day water level forecast

Time Lag	Attention Weight	Interpretation
T-1 (yesterday)	35%	Most recent observation critical
T-7 (last week)	18%	Weekly pumping cycle
T-30 (last month)	8%	Monthly seasonal pattern
T-90 (3 months ago)	2%	Long-term climate memory

Visualization: - Heatmap showing attention over time - Highlights which past days influence today’s forecast

Operational Use: - If model focuses on T-1: Short-term persistence dominates (stable conditions) - If model focuses on T-30: Seasonal cycle drives forecast (plan for seasonal changes)

49.4 Decision Support Framework

🎯 Multi-Stakeholder Explanation Strategy

The Challenge: Different stakeholders need different levels of detail and different types of explanations. A one-size-fits-all explanation will either overwhelm non-technical users or under-inform technical experts.

The Solution: Tailored explanation strategy based on stakeholder needs, technical background, and decision authority.

When to Use Which Explanation Type:

For Executive/Budget Decisions (Managers, City Council): - Need: Simple, high-level justification for funding/approval - Method: SHAP summary → Top 3 bullet points - Format: Executive summary with confidence metrics - Example: “92% confident, three independent factors support this”

For Technical Validation (Geologists, Hydrologists): - Need: Verify model learned correct physical relationships - Method: Feature importance + Partial Dependence Plots - Format: Technical report with domain interpretation - Example: “Model learned that depth-conductivity relationship matches Archie’s Law”

For Regulatory Compliance (EPA, State Regulators): - Need: Prove model is scientifically defensible and unbiased - Method: SHAP + domain validation + audit trail - Format: Formal compliance document with methodology - Example: “Model training dataset, validation metrics, and decision logic documented per EPA ML Guidelines 2024”

For Real-Time Operations (Field Technicians, Operators): - Need: Understand why alert triggered or forecast changed - Method: Attention weights + recent data changes - Format: Dashboard alert with 1-sentence explanation - Example: “Water level spike detected because heavy rain 3 days ago (model expected this)”

For Public Communication (Community Meetings, Media): - Need: Build trust without overwhelming with technical details - Method: Simplified SHAP → Plain language - Format: Infographic or 1-page summary - Example: “Computer model analyzed 10 years of data and found this location has best combination of water quality and quantity”

Regulatory Requirements (Know Your Jurisdiction): - EPA ML Guidelines (2024): Requires explainability for environmental decisions - EU AI Act: High-risk AI systems must be transparent and auditable - State water law: Varies - some require human expert confirmation - Legal liability: Documented decision rationale reduces litigation risk

Best Practice: Always provide multiple explanation levels: 1. One-sentence summary (for quick reading) 2. Three bullet points (for stakeholder meetings) 3. Full technical report (for expert review and compliance)

49.4.1 Explanation Types by Stakeholder

Stakeholder	Question	Explainability Method	Output Format
Manager	“Why drill here?”	SHAP summary	3 bullet points
Geologist	“What geology patterns did model learn?”	Feature importance + PDP	Technical report
Regulator	“Prove model is scientifically sound”	SHAP + domain validation	Audit document
Operator	“Why did forecast change?”	Attention weights	Dashboard alert
Public	“How do you make decisions?”	Simplified SHAP	Infographic

49.4.2 Example Explaining Well Siting

Scenario: Recommend drilling at (403500, 4428500)

Stakeholder: City Council (non-technical, budget authority)

Explanation:

Recommendation: Drill production well at Location A (map coordinates provided)

Why this location? 1. Geology: Model predicts 92% probability of sand (high-yield aquifer) - Based on HTEM resistivity data (same geology as nearby successful wells) - Confirmed by material type classification (well-sorted sand)

Confidence: Prediction uncertainty is LOW (±15 GPM vs ±45 GPM at alternative sites)

Many HTEM measurements nearby (dense data coverage)

Three existing wells within 800m all successful

Cost: Drilling cost $38K (16% below alternative high-yield site)

Shallower depth (42m vs 55m)

Better road access (shorter mobilization)

Risk: 95% chance yield exceeds 120 GPM

Compared to 60% at max-yield alternative

Risk-adjusted value is 2.1× higher

Bottom Line: This site balances yield, cost, and confidence better than alternatives. Model used 12 data-driven factors, with nearby geology and data density being most important.

49.5 Operational Dashboards

📊 How Operational Dashboards Work

Purpose: Integrate explainability into daily operations so that every prediction can be understood and validated in real-time.

The Problem Without Dashboards: - Operators receive model predictions but don’t understand why - No easy way to validate if prediction makes physical sense - Trust erosion when predictions occasionally fail - Manual effort to generate explanations for each decision

The Solution: Embedded explainability in operational dashboards

Key Dashboard Components:

1. Interactive Prediction Map - Click any location → See predicted value + SHAP explanation - Color-coded confidence (green = high, yellow = medium, red = low) - Overlay with validation data (existing wells, geology)

2. Explanation Panel (appears on click) - Top 3 contributing features (SHAP values) - Comparison to similar locations - Confidence interval and uncertainty - Domain expert notes (if available)

3. Override Workflow - Geologist can flag prediction as “disagree” - System prompts for reason (improves future training) - Logs disagreement for quarterly model review - Tracks “expert vs model” accuracy over time

4. Automated Quality Checks - Flag predictions with contradictory features - Warn when prediction extrapolates beyond training data - Alert when uncertainty is high (>30% confidence interval)

User Workflows:

Daily Monitoring Workflow: 1. Open dashboard, review today’s predictions 2. Click any unusual predictions 3. Read SHAP explanation: “Does this make sense?” 4. If yes → Approve. If no → Flag for expert review 5. Weekly: Review flagged predictions with team

Decision Support Workflow (e.g., drilling site selection): 1. Model recommends top 5 locations 2. For each location, click to see: - Predicted yield (e.g., 135 GPM) - Why: SHAP shows depth, geology, proximity factors - Similar cases: “8 nearby wells avg 128 GPM” - Confidence: ±15 GPM (tight) vs ±45 GPM (wide) 3. Narrow to top 2 based on confidence + cost 4. Present to stakeholders with embedded explanations

Integration with Other Systems: - GIS: Export explanation data to ArcGIS for spatial analysis - Permitting: Generate PDF report with prediction + explanation for regulators - Field operations: Send mobile-friendly explanation to technicians - Reporting: Monthly summary of predictions, outcomes, and accuracy

49.5.1 Explainability Features

1. Prediction Detail Panel: - Click any prediction → See SHAP values - Top 5 contributing features highlighted - Comparison to similar past predictions

2. Confidence Indicators: - Color-code by explanation clarity - 🟢 Green: Clear dominant features (>50% from top 3) - 🟡 Yellow: Mixed contributions (distributed across 5+ features) - 🔴 Red: Weak signal (high uncertainty, conflicting features)

3. Similar Cases: - “Model made similar predictions at 12 nearby locations” - “8/12 were confirmed by drilling” - “Average actual yield: 132 GPM (vs predicted 135 GPM)”

4. Override Mechanism: - If geologist disagrees, can flag prediction - System logs disagreement for model retraining - Track “model vs expert” accuracy over time

49.6 Regulatory Compliance

⚖️ Regulatory Compliance Framework

Why Compliance Matters:

Regulators (EPA, state water agencies, permit authorities) increasingly require transparency and accountability for AI-assisted environmental decisions. Explainability isn’t optional—it’s a legal requirement in many jurisdictions.

Key Regulatory Requirements by Jurisdiction:

EPA ML Guidelines (2024) - Federal environmental decisions - Requirement: Document model methodology, validation, and decision rationale - Applies to: Groundwater monitoring, contamination assessment, permitting - Key mandate: “AI systems must provide interpretable explanations for high-stakes decisions” - What you need: SHAP values, feature importance, validation metrics, audit trail

EU AI Act (2023) - High-risk AI systems - Requirement: Transparency, human oversight, technical documentation - Applies to: Critical infrastructure (water supply), environmental monitoring - Key mandate: Users must be informed when interacting with AI, decisions must be explainable - What you need: User disclosure, explanation on demand, risk assessment documentation

State Water Law - Varies by state - California: Requires scientific basis for groundwater management decisions (SGMA) - Texas: Groundwater conservation districts require evidence-based permitting - Florida: Water management districts require technical justification for well permits - What you need: Expert review + model explanation for permit applications

Required Documentation (Standard Audit Package):

Model Methodology
- Algorithm type (Random Forest, XGBoost, Neural Network)
- Training dataset size, source, date range
- Features used and their physical meaning
- Validation approach (holdout, cross-validation, temporal split)
Performance Metrics
- Accuracy, precision, recall, F1 score
- Uncertainty quantification (confidence intervals)
- Error analysis (where does model fail?)
- Comparison to baseline/traditional methods
Explainability Evidence
- SHAP values for specific decisions
- Feature importance across all predictions
- Partial dependence plots showing learned relationships
- Validation that model learned physically plausible patterns
Version Control & Traceability
- Which model version made this decision?
- When was model trained/updated?
- What data was used for training?
- Who reviewed and approved the model?
Human Oversight
- Expert review process
- Override mechanism (when humans disagree)
- Tracking of model vs expert accuracy
- Escalation procedures for uncertain predictions

Best Practices for Compliance:

Automate audit trail generation - Log every prediction + explanation
Pre-generate compliance templates - One-click export for permit applications
Quarterly expert review - Licensed hydrogeologist validates model behavior
Public transparency - Make methodology available (not proprietary black box)
Versioning discipline - Never deploy unversioned models

Legal Liability Protection:

Documented explainability reduces legal risk: - If model is wrong: “We used best available science, documented reasoning, expert reviewed” - If decision is challenged: Audit trail shows due diligence and transparency - If regulation changes: Version control allows retrospective validation

49.6.1 Audit Trail Requirements

What regulators need: 1. Methodology: How model was trained (data, algorithm, validation) 2. Performance: Accuracy, false positive rate, uncertainty quantification 3. Explainability: Why specific decision was made 4. Version control: Which model version made decision, when trained 5. Human oversight: Who reviewed, approved, implemented decision

Example Audit Report (drilling permit application):

Model-Assisted Decision: Production Well #47

Decision: Approve drilling at (403500, 4428500, depth 42m)

Model Information: - Model: Random Forest Material Classifier v1.5 - Training date: 2024-08-15 - Training data: 163,158 HTEM samples from Unit D - Test accuracy: 86.4% (20% holdout validation) - Uncertainty: ±15 GPM (95% confidence interval)

Prediction Explanation: - Material type: MT 11 (well-sorted sand) - Confidence: 92% - SHAP top factors: 1. is_sand indicator: +42% probability 2. Elevation (Z = -50m): +15% probability 3. Northing (Y = 4428500): +12% probability - Similar nearby wells: 3 within 800m, all successful (avg 128 GPM)

Human Review: - Reviewed by: Dr. Jane Smith, Licensed Hydrogeologist (PG #12345) - Review date: 2024-11-20 - Expert assessment: “Recommendation supported by geology. HTEM data density high in this area. Approve.” - Final decision: APPROVED for drilling (City Council 2024-11-22)

Outcome Tracking: - Actual yield: Will be measured upon well completion (expected 2025-01-15) - Model accuracy will be validated and updated in quarterly review

49.7 Working Example: Groundwater Level Prediction

In this section, we’ll build a practical explainability system for predicting groundwater levels using gradient boosting. We’ll train a model and then explain its predictions using multiple techniques.

Show code

import os
import sys
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.inspection import partial_dependence
import warnings
warnings.filterwarnings('ignore')

def find_repo_root(start: Path) -> Path:
    for candidate in [start, *start.parents]:
        if (candidate / "src").exists():
            return candidate
    return start

quarto_project = Path(os.environ.get("QUARTO_PROJECT_DIR", str(Path.cwd())))
project_root = find_repo_root(quarto_project)

if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

from src.utils import get_data_path

49.7.1 Data Preparation

Show code

# Load real data using FusionBuilder
from src.data_fusion import FusionBuilder
from src.data_loaders import IntegratedDataLoader

# Initialize flag to track data loading status
data_loaded = False
df = None
feature_cols = []
model = None

try:
    htem_root = get_data_path("htem_root")
    aquifer_db_path = get_data_path("aquifer_db")
    weather_db_path = get_data_path("warm_db")
    usgs_stream_root = get_data_path("usgs_stream")

    loader = IntegratedDataLoader(
        htem_path=htem_root,
        aquifer_db_path=aquifer_db_path,
        weather_db_path=weather_db_path,
        usgs_stream_path=usgs_stream_root
    )

    builder = FusionBuilder(loader)

    # Build ML-ready dataset with all features
    df = builder.build_temporal_dataset(
        wells=None,  # All wells
        start_date='2010-01-01',
        end_date='2023-12-31',
        include_weather=True,
        include_stream=True,
        add_features=True
    )

    loader.close()
    data_loaded = True
except Exception as e:
    print(f"Warning: Could not load data - {e}")
    data_loaded = False

if data_loaded and df is not None:
    # Rename for consistency with the rest of the chapter
    # FusionBuilder uses Water_Level_ft as the column name
    if 'Water_Level_ft' in df.columns:
        df['Water_Surface_Elevation'] = df['Water_Level_ft']
    elif 'water_level' in df.columns:
        df['Water_Surface_Elevation'] = df['water_level']

    # Standardize column names for compatibility
    if 'WellID' in df.columns:
        df['well_id'] = df['WellID']
    if 'Date' in df.columns:
        df['date'] = df['Date']

    print(f"✅ Loaded {len(df):,} records using FusionBuilder")
    print(f"   Wells: {df['well_id'].nunique() if 'well_id' in df.columns else df['WellID'].nunique()}")
    print(f"   Features: {df.shape[1]}")

    # Ensure we have required features
    if 'month' not in df.columns and 'date' in df.columns:
        df['month'] = pd.to_datetime(df['date']).dt.month
        df['day_of_year'] = pd.to_datetime(df['date']).dt.dayofyear
        df['season'] = df['month'].apply(lambda x:
            1 if x in [12, 1, 2] else 2 if x in [3, 4, 5] else 3 if x in [6, 7, 8] else 4)

    # Create lag features if not present
    if 'prev_water_level' not in df.columns:
        if 'water_level_lag1d' in df.columns:
            df['prev_water_level'] = df['water_level_lag1d']
        elif 'Water_Level_ft_lag1d' in df.columns:
            df['prev_water_level'] = df['Water_Level_ft_lag1d']

    # Create depth_to_water if not present
    if 'depth_to_water' not in df.columns:
        if 'Depth_to_Water' in df.columns:
            df['depth_to_water'] = df['Depth_to_Water']
        elif 'Water_Surface_Elevation' in df.columns:
            # Use elevation as proxy (higher elevation = shallower water)
            df['depth_to_water'] = df['Water_Surface_Elevation'].max() - df['Water_Surface_Elevation']

    # Create month if not present
    if 'month' not in df.columns and 'Month' in df.columns:
        df['month'] = df['Month']

    # Create season if not present
    if 'season' not in df.columns and 'Quarter' in df.columns:
        df['season'] = df['Quarter']
    elif 'season' not in df.columns and 'month' in df.columns:
        df['season'] = df['month'].apply(lambda x:
            1 if x in [12, 1, 2] else 2 if x in [3, 4, 5] else 3 if x in [6, 7, 8] else 4)

    # Verify target column exists
    if 'Water_Surface_Elevation' in df.columns:
        print(f"Dataset: {len(df):,} records")
        print(f"Target variable range: {df['Water_Surface_Elevation'].min():.1f} to {df['Water_Surface_Elevation'].max():.1f}")
    else:
        print(f"⚠️ Warning: Water_Surface_Elevation column not found")
        print(f"Available columns: {list(df.columns[:10])}")
        data_loaded = False
else:
    print("⚠️ ERROR: Data loading failed - cannot proceed without real data")
    print("This chapter requires actual groundwater measurements from FusionBuilder.")
    print("Please verify:")
    print("  1. Database paths are correct (aquifer.db, warm.db)")
    print("  2. Well data exists in OB_WELL_MEASUREMENTS_CHAMPAIGN_COUNTY table")
    print("  3. FusionBuilder.build_temporal_dataset() has well data")
    df = None
    data_loaded = False

✓ HTEM loader initialized
✓ Groundwater loader initialized
✓ Weather loader initialized
✓ USGS stream loader initialized
FusionBuilder initialized with sources: ['groundwater', 'weather', 'usgs_stream', 'htem']
Building temporal dataset from 2010-01-01 to 2023-12-31...
  Loading groundwater data...
    Loaded 25943 daily groundwater records
  Loading weather data...
    Loaded 4970 daily weather records
  Loading stream gauge data...
    Loaded 5113 daily stream records
  Merging data sources...
  Engineering features...
  Final dataset: 25943 records, 50 columns
✅ Loaded 25,943 records using FusionBuilder
   Wells: 18
   Features: 53
Dataset: 25,943 records
Target variable range: 599.7 to 707.4

49.7.2 Feature Engineering & Model Training

Show code

# Select features for modeling - use FusionBuilder columns if available
if data_loaded and df is not None and 'Water_Surface_Elevation' in df.columns:
    # Use features from FusionBuilder (correct column names)
    potential_features = [
        # Derived features (for partial dependence plots)
        'prev_water_level', 'month', 'depth_to_water', 'season',
        # Weather features
        'Precipitation_mm', 'Temperature_C', 'PET_mm', 'NetWater_mm',
        # Rolling precipitation
        'Precipitation_mm_roll7d_sum', 'Precipitation_mm_roll14d_sum', 'Precipitation_mm_roll30d_sum',
        # Cumulative features
        'Precipitation_mm_cum7d', 'NetWater_mm_cum30d',
        # Lag features
        'Water_Level_ft_lag1d', 'Water_Level_ft_lag7d', 'Water_Level_ft_lag14d', 'Water_Level_ft_lag30d',
        # Temporal features
        'DayOfYear_sin', 'DayOfYear_cos', 'Month_sin', 'Month_cos',
        # Stream features
        'Discharge_cfs', 'Discharge_cfs_roll7d_mean'
    ]
    feature_cols = [col for col in potential_features if col in df.columns]

    # Add any available lag/rolling features not already included
    for col in df.columns:
        if ('lag' in col or 'roll' in col or 'cum' in col) and col not in feature_cols:
            if df[col].dtype in ['float64', 'int64'] and df[col].notna().mean() > 0.5:
                feature_cols.append(col)

    feature_cols = feature_cols[:15]  # Limit to top 15 for visualization

    # Remove any remaining NaN values
    available_cols = [col for col in feature_cols if col in df.columns]
    df_clean = df[available_cols + ['Water_Surface_Elevation']].dropna()
else:
    print("⚠️ ERROR: Cannot train model - no valid data loaded")
    feature_cols = []
    available_cols = []
    df_clean = pd.DataFrame()

if len(available_cols) > 0:
    X = df_clean[available_cols]
    y = df_clean['Water_Surface_Elevation']
    feature_cols = available_cols  # Update to actual available features

    # Split into train/test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )

    # Train gradient boosting model
    model = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        min_samples_split=20,
        min_samples_leaf=10,
        random_state=42
    )

    model.fit(X_train, y_train)
else:
    print("⚠️ No features available - skipping model training")
    X_train = X_test = pd.DataFrame()
    y_train = y_test = pd.Series()
    model = None

if model is not None and len(X_train) > 0:
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Calculate metrics
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

    print(f"Model Performance:")
    print(f"  Training R²: {train_r2:.3f}, RMSE: {train_rmse:.2f} ft")
    print(f"  Testing R²: {test_r2:.3f}, RMSE: {test_rmse:.2f} ft")
else:
    print("⚠️ Model not trained - skipping predictions")
    y_pred_train = y_pred_test = np.array([])
    train_r2 = test_r2 = train_rmse = test_rmse = 0.0

Model Performance:
  Training R²: 1.000, RMSE: 0.03 ft
  Testing R²: 1.000, RMSE: 0.04 ft

49.7.3 1. Feature Importance Analysis

Feature importance tells us which variables matter most for predicting groundwater levels across all predictions.

# Check if model was successfully trained
if model is None or len(feature_cols) == 0:
    print("⚠️ ERROR: No trained model available for feature importance analysis")
    print("Cannot display feature importance without valid training data")
else:
    # Extract feature importances
    importances = model.feature_importances_
    feature_names = feature_cols

    # Create DataFrame for sorting
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances,
        'Importance_Pct': importances * 100
    }).sort_values('Importance', ascending=True)

    # Create horizontal bar chart
    fig = go.Figure()

    fig.add_trace(go.Bar(
        y=importance_df['Feature'],
        x=importance_df['Importance_Pct'],
        orientation='h',
        marker=dict(
            color=importance_df['Importance_Pct'],
            colorscale='Blues',
            showscale=True,
            colorbar=dict(title="Importance %")
        ),
        text=[f"{val:.1f}%" for val in importance_df['Importance_Pct']],
        textposition='auto',
        hovertemplate='<b>%{y}</b><br>' +
                      'Importance: %{x:.2f}%<br>' +
                      '<extra></extra>'
    ))

    fig.update_layout(
        title={
            'text': "Feature Importance: What Drives Water Level Predictions?",
            'x': 0.5,
            'xanchor': 'center'
        },
        xaxis_title="Importance (%)",
        yaxis_title="Feature",
        height=500,
        template='plotly_white',
        font=dict(size=12),
        showlegend=False
    )

    fig.show()

(a) Feature importance analysis showing which variables contribute most to water level predictions. Previous water level dominates due to high aquifer autocorrelation.

(b)

Figure 49.1

💻 For Computer Scientists

Feature importance measures how much each input variable contributes to reducing prediction error across the entire dataset. Gradient Boosting uses Gini importance (mean decrease in impurity).

Key insight: prev_water_level dominates because groundwater systems have high autocorrelation—yesterday’s level strongly predicts today’s level.

🌍 For Geologists/Hydrologists

The feature importance rankings reveal physical processes:

High prev_water_level importance → Aquifer has memory (confined or low-permeability system)
High season/month importance → Strong seasonal recharge cycles
High surface_elevation importance → Topographic control on water table

Key insight: If temporal features dominate spatial features, this indicates the aquifer responds more to climate than local geology.

49.7.4 2. Partial Dependence Plots

Partial dependence plots show how changing one feature affects predictions while holding all other features constant.

# Calculate partial dependence for key features
# Only use features that exist in feature_cols
desired_features = ['prev_water_level', 'month', 'depth_to_water', 'season']
features_to_plot = [f for f in desired_features if f in feature_cols]

# If none of the desired features exist, use top 4 available features
if len(features_to_plot) < 2:
    features_to_plot = feature_cols[:min(4, len(feature_cols))]

# Create subplots
n_features = len(features_to_plot)
n_rows = (n_features + 1) // 2
n_cols = min(2, n_features)

fig = make_subplots(
    rows=n_rows, cols=n_cols,
    subplot_titles=[f"PDP: {feat}" for feat in features_to_plot],
    vertical_spacing=0.12,
    horizontal_spacing=0.1
)

for idx, feature in enumerate(features_to_plot):
    row = (idx // 2) + 1
    col = (idx % 2) + 1

    try:
        feature_idx = feature_cols.index(feature)
    except ValueError:
        print(f"Feature '{feature}' not found in feature_cols, skipping")
        continue

    # Calculate partial dependence
    pd_result = partial_dependence(
        model, X_train, features=[feature_idx],
        grid_resolution=50
    )

    # Extract values
    pd_values = pd_result['average'][0]
    feature_values = pd_result['grid_values'][0]

    # Add trace
    fig.add_trace(
        go.Scatter(
            x=feature_values,
            y=pd_values,
            mode='lines',
            line=dict(color='#2e8bcc', width=3),
            fill='tonexty',
            fillcolor='rgba(46, 139, 204, 0.2)',
            name=feature,
            hovertemplate='<b>' + feature + '</b><br>' +
                          'Value: %{x:.2f}<br>' +
                          'Predicted Water Level: %{y:.2f} ft<br>' +
                          '<extra></extra>'
        ),
        row=row, col=col
    )

    # Update axes
    fig.update_xaxes(title_text=feature, row=row, col=col)
    fig.update_yaxes(title_text="Water Level (ft)", row=row, col=col)

fig.update_layout(
    title={
        'text': "Partial Dependence: How Each Feature Affects Predictions",
        'x': 0.5,
        'xanchor': 'center'
    },
    height=700,
    template='plotly_white',
    showlegend=False,
    font=dict(size=11)
)

fig.show()

💻 For Computer Scientists

Partial dependence plots (PDPs) marginalize over other features to show the marginal effect of one feature. They answer: “If I change X while keeping everything else average, how does the prediction change?”

Limitation: PDPs assume feature independence. If features are correlated (e.g., month and season), PDPs can be misleading.

🌍 For Geologists/Hydrologists

Physical interpretation of PDP patterns:

Linear PDP → Simple relationship (e.g., deeper depth = lower water table)
Curved PDP → Nonlinear process (e.g., seasonal recharge cycles)
Flat PDP → Feature doesn’t matter much for this aquifer
Step PDP → Threshold behavior (e.g., season transitions)

Key insight: If the month PDP shows a peak in spring (months 3-5), this indicates snowmelt or spring recharge dominates the system.

49.7.5 3. Prediction Decomposition (SHAP-like Analysis)

For a specific prediction, we can decompose it into contributions from each feature. This shows why the model made a particular prediction.

# Select an interesting test example (median prediction)
test_indices = y_test.sort_values().index
median_idx = test_indices[len(test_indices) // 2]

# Get the sample
sample = X_test.loc[median_idx:median_idx]
actual = y_test.loc[median_idx]
predicted = model.predict(sample)[0]

# Calculate baseline (mean prediction)
baseline = y_train.mean()

# Approximate feature contributions using tree structure
# For each feature, calculate prediction with feature vs without (set to mean)
contributions = {}
for feature in feature_cols:
    # Create copy with feature set to mean
    sample_without = sample.copy()
    sample_without[feature] = X_train[feature].mean()

    # Predict with and without
    pred_with = model.predict(sample)[0]
    pred_without = model.predict(sample_without)[0]

    # Contribution is the difference
    contributions[feature] = pred_with - pred_without

# Convert to DataFrame
contrib_df = pd.DataFrame({
    'Feature': list(contributions.keys()),
    'Contribution': list(contributions.values()),
    'Value': [sample[feat].values[0] for feat in contributions.keys()]
}).sort_values('Contribution', ascending=True)

# Create waterfall chart
fig = go.Figure()

# Calculate cumulative sum for waterfall
cumsum = baseline
y_values = [baseline]
measures = ['absolute']

for idx, row in contrib_df.iterrows():
    cumsum += row['Contribution']
    y_values.append(row['Contribution'])
    measures.append('relative')

# Add final prediction
y_values.append(predicted)
measures.append('total')

# Create labels
labels = ['Baseline<br>(Mean)'] + [
    f"{row['Feature']}<br>= {row['Value']:.2f}"
    for _, row in contrib_df.iterrows()
] + ['Final<br>Prediction']

# Create waterfall
fig.add_trace(go.Waterfall(
    name="Contribution",
    orientation="v",
    measure=measures,
    x=labels,
    y=y_values,
    text=[f"{val:.2f} ft" for val in y_values],
    textposition="outside",
    connector={"line": {"color": "rgb(63, 63, 63)"}},
    increasing={"marker": {"color": "#3cd4a8"}},
    decreasing={"marker": {"color": "#f59e0b"}},
    totals={"marker": {"color": "#2e8bcc"}}
))

fig.update_layout(
    title={
        'text': f"Prediction Decomposition: Why Predict {predicted:.1f} ft?<br>" +
                f"<sub>Actual: {actual:.1f} ft | Error: {abs(predicted - actual):.2f} ft</sub>",
        'x': 0.5,
        'xanchor': 'center'
    },
    yaxis_title="Water Surface Elevation (ft)",
    xaxis_title="Feature Contributions",
    height=600,
    template='plotly_white',
    showlegend=False,
    font=dict(size=11)
)

fig.show()

# Print interpretation
print(f"\nPrediction Interpretation:")
print(f"  Baseline (average water level): {baseline:.2f} ft")
print(f"  Top positive contributor: {contrib_df.iloc[-1]['Feature']} (+{contrib_df.iloc[-1]['Contribution']:.2f} ft)")
print(f"  Top negative contributor: {contrib_df.iloc[0]['Feature']} ({contrib_df.iloc[0]['Contribution']:.2f} ft)")
print(f"  Final prediction: {predicted:.2f} ft")
print(f"  Actual value: {actual:.2f} ft")
print(f"  Prediction error: {abs(predicted - actual):.2f} ft")


Prediction Interpretation:
  Baseline (average water level): 664.74 ft
  Top positive contributor: depth_to_water (+12.13 ft)
  Top negative contributor: Precipitation_mm_roll14d_sum (-0.00 ft)
  Final prediction: 676.79 ft
  Actual value: 676.76 ft
  Prediction error: 0.03 ft

💻 For Computer Scientists

This approximation estimates local feature contributions by perturbing features one at a time. It’s similar to SHAP values but computationally simpler.

True SHAP uses game theory to calculate Shapley values (exact marginal contributions accounting for all feature interactions). This approximation is faster but less rigorous.

Key insight: For production systems, use true SHAP (shap library). For exploratory analysis, this approximation is sufficient.

🌍 For Geologists/Hydrologists

The waterfall chart shows why this specific well has this predicted water level:

Baseline = What we’d predict knowing nothing (average across all wells)
Each bar = How much this well’s specific characteristic changes the prediction
Green bars = Features pushing water level UP (more water)
Orange bars = Features pushing water level DOWN (less water)

Example interpretation: - If prev_water_level is +5 ft, it means “this well had high water last time, so we predict high water now” - If season = 4 (fall) is -2 ft, it means “fall typically has lower water levels due to reduced recharge”

Key insight: This explains individual predictions to stakeholders, enabling trust and validation.

49.7.6 4. Model Performance Visualization

Show code

# Create scatter plot of predicted vs actual
fig = go.Figure()

# Test set predictions
fig.add_trace(go.Scatter(
    x=y_test,
    y=y_pred_test,
    mode='markers',
    name='Test Set',
    marker=dict(
        color='#2e8bcc',
        size=5,
        opacity=0.6,
        line=dict(width=0)
    ),
    hovertemplate='Actual: %{x:.2f} ft<br>' +
                  'Predicted: %{y:.2f} ft<br>' +
                  '<extra></extra>'
))

# Perfect prediction line
min_val = min(y_test.min(), y_pred_test.min())
max_val = max(y_test.max(), y_pred_test.max())

fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    name='Perfect Prediction',
    line=dict(color='red', dash='dash', width=2)
))

fig.update_layout(
    title={
        'text': f"Prediction Accuracy: R² = {test_r2:.3f}, RMSE = {test_rmse:.2f} ft",
        'x': 0.5,
        'xanchor': 'center'
    },
    xaxis_title="Actual Water Surface Elevation (ft)",
    yaxis_title="Predicted Water Surface Elevation (ft)",
    height=600,
    template='plotly_white',
    font=dict(size=12),
    hovermode='closest'
)

# Make axes equal
fig.update_xaxes(scaleanchor="y", scaleratio=1)
fig.update_yaxes(scaleanchor="x", scaleratio=1)

fig.show()

Production Deployment Implications

This working example demonstrates key explainability requirements for production:

Feature Importance → Guides data collection priorities (focus on high-importance features)
Partial Dependence → Validates model behavior matches physical expectations
Prediction Decomposition → Enables stakeholder trust and regulatory compliance
Performance Metrics → Quantifies uncertainty for decision-making

For a $45K drilling decision: Show stakeholders the decomposition plot + feature importance to explain why the model recommends a specific location.

49.8 Lessons Learned When Deploying Explainability

🎓 Pattern Summary: Common Explainability Pitfalls

Why These Lessons Matter:

The following three cases represent common failure modes when deploying explainable AI in real-world operations. Each illustrates a different dimension of the explainability challenge:

Technical accuracy ≠ operational trust (Case 1)
Regulatory compliance requires documentation (Case 2)
Model coherence must match domain knowledge (Case 3)

Success Factors for Explainable AI Deployment:

Before Deployment: - ✅ Train models AND explainers simultaneously (not as afterthought) - ✅ Validate that explanations match physical knowledge - ✅ Prepare regulatory compliance package before first use - ✅ Test explanations with actual stakeholders (not just data scientists) - ✅ Build override workflow into system from day one

During Operations: - ✅ Monitor explanation quality, not just prediction accuracy - ✅ Track when humans override model (this is valuable feedback) - ✅ Update explanations when model is retrained - ✅ Provide multiple explanation levels (executive → technical) - ✅ Log every prediction + explanation for audit trail

After Deployment: - ✅ Quarterly review of explanation coherence with domain experts - ✅ Retrain when explanations stop making physical sense - ✅ Update compliance documentation when regulations change - ✅ Survey stakeholders: “Do you understand the explanations?” - ✅ Publish lessons learned (contribute to community knowledge)

Implementation Recommendations:

For New Projects: 1. Start with simple, interpretable models (linear, tree-based) 2. Only add complexity if needed (neural nets as last resort) 3. Build explanation dashboard before production deployment 4. Run parallel deployment: Model + human expert for 3 months 5. Measure trust metrics: % of predictions accepted without question

For Existing Black-Box Systems: 1. Add SHAP explainer to existing model (retrofit explainability) 2. Validate explanations with domain experts (check for nonsense) 3. Document known failure modes (when explanations mislead) 4. Consider retraining with interpretability constraints (monotonicity, sparsity) 5. Phase in explainability: Start with high-stakes decisions only

Red Flags to Watch For:

🚩 Contradictory explanations: SHAP says X, PDP says opposite
🚩 Physically impossible relationships: Deeper = more water (wrong for water table)
🚩 Single-feature dominance: One feature >80% of SHAP value (overfitting?)
🚩 Stakeholder confusion: “I don’t understand these explanations”
🚩 High override rate: Experts disagree with model >30% of time

49.8.1 Case 1: The Model Without Context

Failure: Operator drilled at model recommendation without understanding why

Result: Hit clay instead of sand (model was 14% wrong, fell in error range)

Root cause: Operator trusted 86% accuracy = “always right” (misunderstanding)

Fix: Now require operators to review SHAP values BEFORE drilling. If top contributing feature is unclear or weak, trigger expert review.

49.8.2 Case 2: Black Box Rejection

Failure: Regulator rejected permit because “AI decisions not transparent”

Result: 6-month permitting delay, $180K cost overrun

Root cause: Submitted prediction without explanation of how model works

Fix: Now include 2-page “Model Explainability Summary” with every permit: - How model was trained - What features it uses - Why it made this specific prediction (SHAP) - Human expert confirmation

49.8.3 Case 3: Overfitting to Local Noise

Failure: Model gave high confidence (95%) at location that turned out to be anomaly

Result: Dry hole ($45K loss)

Root cause: Model memorized local noise pattern (small-scale heterogeneity) instead of large-scale geology

Fix: Added “explanation coherence check”: - If SHAP values don’t align with known geology, flag for review - If model relies on single local feature (not regional trend), reduce confidence - Require 3+ features contributing >10% each (not all from one dominant feature)

49.9 Best Practices

49.9.1 Do’s

✅ Always explain high-stakes decisions (>$10K impact)

✅ Use multiple explanation methods (SHAP + feature importance + domain validation)

✅ Translate to stakeholder language (no jargon for non-technical users)

✅ Show uncertainty (confidence intervals, not just point predictions)

✅ Enable override (human expert can disagree and flag for retraining)

✅ Track explanation quality (do stakeholders understand? measure with surveys)

49.9.2 Don’ts

❌ Don’t say “trust the AI” without explanation

❌ Don’t show raw SHAP values to non-technical users (translate to plain English)

❌ Don’t ignore conflicting explanations (if SHAP says X but PDP says Y, investigate)

❌ Don’t explain averages when stakeholders need specific cases

❌ Don’t use explanation as excuse (“model was right 86% of the time” doesn’t help the 14% who got bad prediction)

49.11 Summary

Explainable AI transforms black-box predictions into actionable understanding:

✅ SHAP explanations - Feature contributions for every prediction

✅ Stakeholder-specific views - Different explanations for managers, geologists, regulators

✅ Partial dependence plots - How each feature affects predictions

✅ Audit trail - Predictions + explanations logged for regulatory compliance

✅ Override mechanism - Human review when explanations raise concerns

Key Insight: Explanation builds trust. Without it, even accurate models won’t be adopted. With it, operators become partners, not just users.

49.12 Reflection Questions

For one of the models in this book (e.g., material classification, water-level forecasting, or well placement), which explanation method would you prioritize first, and why?
How would you judge whether a model explanation is physically plausible in your aquifer, rather than just mathematically consistent with the data?
When presenting model-assisted recommendations to non-technical stakeholders, what details would you omit or simplify, and what would you insist on showing?
Where could explainability fail or mislead you (e.g., correlated features, partial dependence assumptions), and how would you design checks or workflows to catch those cases?
What processes (governance, documentation, training) would you put in place so that explainable AI remains trustworthy as models, data, and staff change over time?

--- title: "Explainable AI Insights" subtitle: "Interpreting Black-Box Model Decisions with SHAP & Attention" code-fold: true --- ::: {.callout-tip icon=false} ## For Newcomers **You will learn:** - Why "trust me, the model said so" isn't good enough for $45K decisions - How SHAP values explain which features drove a specific prediction - How to present AI recommendations to skeptical stakeholders - The difference between global (overall) and local (this prediction) explanations Explainable AI bridges the gap between black-box accuracy and human understanding. When a geologist asks "Why does the model think there's sand here?", we can now give a real answer. ::: ## What You Will Learn in This Chapter By the end of this chapter, you will be able to: - Explain why high-stakes groundwater decisions require more than raw model accuracy. - Use feature importance, partial dependence, and SHAP-like methods to interpret groundwater predictions. - Connect model explanations back to physical aquifer processes that geologists and hydrologists care about. - Tailor explanation style and detail to different stakeholders (operators, managers, regulators, public). - Recognize the limits of explainability methods and design guardrails for safe deployment. ## Why Explainability Matters **Problem**: Stakeholders ask "Why did the model predict sand here?" **Traditional ML Answer**: "Random Forest said so" (not acceptable for $45K drilling decision) **Explainable AI Answer**: - "Model predicted sand (92% confidence) because:" 1. High resistivity (8.2 Ω·m) → **+35% sand probability** 2. Northwestern location (glacial outwash zone) → **+28% sand probability** 3. Depth 45m (mid-aquifer) → **+18% sand probability** 4. Nearby wells hit sand → **+11% sand probability** **Result**: Geologist understands reasoning, trusts decision, approves drilling. --- ## Explainability Methods ### 1. SHAP Values (SHapley Additive exPlanations) ::: {.callout-note icon=false} ## 📘 Understanding SHAP Values **What Is It?** **SHAP (SHapley Additive exPlanations)** is a unified framework for interpreting machine learning predictions, introduced by Lundberg & Lee in 2017. It's based on Shapley values from cooperative game theory (1953 Nobel Prize-winning work by Lloyd Shapley). SHAP answers: "How much did each feature contribute to moving the prediction away from the baseline (average) prediction?" **Why Does It Matter?** When a model predicts "92% probability of sand" at a drilling location, stakeholders ask "Why?" SHAP provides a mathematically rigorous answer by decomposing the prediction into contributions from each feature (depth, location, resistivity, etc.). This enables: 1. **Trust**: Geologists can validate if model reasoning matches geological knowledge 2. **Debugging**: Identify when model relies on spurious correlations 3. **Compliance**: Regulators require explanations for high-stakes decisions 4. **Learning**: Domain experts discover new patterns from model insights **How Does It Work?** SHAP uses game theory to fairly distribute "credit" for a prediction: 1. **Baseline**: Start with average prediction across all training data (e.g., 50% sand probability) 2. **Actual prediction**: Model predicts 92% sand for this specific location 3. **Difference**: 92% - 50% = 42% needs to be explained 4. **Attribution**: SHAP calculates how much each feature contributed to that 42% **Key principle**: Contributions must sum to the total difference (additivity property). **What Will You See?** - **Waterfall plots**: Show how prediction builds up from baseline through feature contributions - **Feature importance**: Which features matter most across all predictions - **Force plots**: Interactive visualization showing feature contributions for one prediction - **Dependence plots**: How feature values affect predictions **Historical Context**: Before SHAP, explainability methods gave inconsistent results. SHAP unified them with solid game-theoretic foundations, becoming the industry standard by 2020. ::: **Concept**: How much did each feature contribute to this specific prediction? **Example**: Predicting material type at (405000, 4428000, -50m) ```python import shap # Train model model = RandomForestClassifier() model.fit(X_train, y_train) # Explain single prediction explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test[0]) # Visualize shap.waterfall_plot(shap_values[0]) ``` **Output**: ``` Base prediction: 50% (prior probability of sand) Feature Contributions: is_sand = 1 → +42% (most important) Z_norm = -0.8 → +15% (deep location favors sand) Y_norm = 0.6 → +12% (northern region trend) radial_position = 2.1 → -8% (far from center) depth_below_surface = 45 → +5% (mid-aquifer) Final prediction: 50% + 42% + 15% + 12% - 8% + 5% = 116% → clipped to 92% ``` **Interpretation**: `is_sand` feature (from domain knowledge) contributes most. Depth and location add confidence. ::: {.callout-tip icon=false} ## 📊 How to Read SHAP Waterfall Plots **Understanding the Visualization**: **Baseline (Starting Point)**: - The bottom value (e.g., 50%) is the **average prediction** across all training data - This is what you'd predict knowing nothing about this specific location - Think of it as "prior probability" before seeing any features **Feature Contribution Bars**: - **Red bars (pointing right)** = Feature increases prediction (pushes toward "yes sand") - **Blue bars (pointing left)** = Feature decreases prediction (pushes toward "no sand") - **Bar length** = How much that feature matters for THIS prediction - **Top to bottom** = Bars stack to show cumulative effect **Reading the Values**: - `is_sand = 1 → +42%` means "knowing this location has sand indicator adds 42% to probability" - `Z_norm = -0.8 → +15%` means "being at this elevation adds 15% more" - `radial_position = 2.1 → -8%` means "being far from center reduces probability by 8%" **Final Prediction**: - Top value shows where all contributions sum to - This is the actual model prediction for this specific case **How to Explain to Stakeholders**: **For Drilling Decision (Example)**: > "The model predicts **92% probability of sand** at this location. Here's why: > > 1. **Main reason (+42%)**: Geophysical signature matches known sand deposits > 2. **Supporting evidence (+15%)**: Depth is 50m, which is in the primary aquifer zone > 3. **Geographic pattern (+12%)**: Northern locations typically have better sorted sediments (glacial outwash) > 4. **Minor concern (-8%)**: Site is farther from city center than ideal > > **Bottom line**: Three strong positive indicators outweigh the one minor negative. The model is confident (92%) and the reasoning aligns with geological knowledge." **When to Distrust SHAP Values**: - **Contradictory contributions**: If high-importance features cancel each other out → low confidence - **Single dominant feature**: If one feature is >80% of total → model might be overfitting - **Counterintuitive signs**: If deeper depth decreases water availability (opposite of physics) → investigate ::: ### 2. Feature Importance (Global) ::: {.callout-note icon=false} ## 📘 Understanding Feature Importance **What Is It?** **Feature importance** measures which input variables contribute most to predictions across the entire dataset (global explanation). Introduced by Breiman (2001) for Random Forests, it answers: "If I could only collect data on 3 features, which 3 would give the best predictions?" Permutation importance (Breiman 2001, Fisher et al. 2019) is the most reliable method. **Why Does It Matter?** Collecting and maintaining data is expensive. If the model only uses 5 of your 20 features, you can: 1. **Reduce costs**: Stop measuring low-importance features 2. **Focus quality control**: Monitor high-importance sensors more closely 3. **Guide new sites**: Install only critical sensors at new wells 4. **Validate model**: If "surface_elevation" is unimportant but geology says it should matter, investigate **How Does It Work?** Permutation importance uses a clever trick: 1. **Baseline**: Calculate model accuracy on test data 2. **Shuffle one feature**: Randomly scramble values for feature X (breaks its relationship with target) 3. **Re-test**: Calculate accuracy with scrambled feature 4. **Importance = Accuracy drop**: Large drop = feature was important, small drop = feature wasn't used **Example**: If shuffling "depth" drops accuracy from 86% to 62% (24% drop), but shuffling "temperature" only drops it to 85% (1% drop), then depth is 24× more important than temperature. **What Will You See?** Bar charts showing importance percentages. Top features get most attention in data collection and quality control. ::: **Question**: Across all predictions, which features matter most? **Method**: Permutation importance - Shuffle feature values randomly - Measure accuracy drop - Larger drop = more important feature **Results** (from material classification model): | Rank | Feature | Importance | Interpretation | |------|---------|------------|----------------| | 1 | is_sand | 27.0% | Binary sand indicator dominates | | 2 | Z_norm | 12.2% | Elevation/depth critical | | 3 | depth_below_surface | 11.6% | Vertical position matters | | 4 | Y_norm | 10.0% | North-south trends (glacial) | | 5 | radial_position | 9.0% | Distance from origin | **Action**: Focus data collection on top 5 features. Others contribute <10% each. ### 3. Partial Dependence Plots ::: {.callout-note icon=false} ## 📘 Understanding Partial Dependence Plots (PDPs) **What Is It?** **Partial Dependence Plots (PDPs)** visualize the marginal effect of one or two features on model predictions, introduced by Friedman (2001). They answer: "If I change depth from 20m to 60m while keeping everything else average, how does the prediction change?" PDPs reveal the **functional relationship** the model learned between a feature and the outcome. **Why Does It Matter?** PDPs validate whether the model learned physically plausible relationships: - **Validation**: Does deeper depth increase sand probability? (Geology says yes—PDPs should too) - **Discovery**: Unexpected PDP shapes reveal surprising patterns (e.g., "sweet spot" at 40-50m depth) - **Communication**: Non-technical stakeholders understand line plots better than equations - **Debugging**: Flat PDP means model ignores that feature (despite high importance elsewhere—investigate!) **How Does It Work?** PDPs use a "what-if" approach: 1. **Select feature**: e.g., depth 2. **Create grid**: Test depths from 0m to 100m in 50 steps 3. **For each depth value**: - Replace all data points' depth with this value - Keep other features at their original values - Predict for all modified points - Average predictions 4. **Plot**: Depth (x-axis) vs Average Prediction (y-axis) **Key insight**: PDP shows the **average effect** of depth, marginalizing over all other feature combinations. **What Will You See?** Line plots showing how predictions change as one feature varies. Look for: - **Linear trends**: Simple relationships (e.g., deeper = lower water table) - **Curves**: Nonlinear effects (e.g., optimal depth range for sand) - **Steps**: Threshold behavior (e.g., season changes) - **Flat lines**: Feature doesn't affect predictions **Limitation**: PDPs assume feature independence. If features correlate (e.g., depth and temperature), PDPs can show unrealistic scenarios. ::: **Question**: How does changing one feature affect predictions? **Example**: How does depth affect sand probability? ``` Depth (m) Sand Probability 0 30% (shallow = clay/topsoil) 20 45% (transition zone) 40 85% (mid-aquifer = sand) 60 75% (deep = mixed) 80 40% (very deep = bedrock) ``` **Insight**: **40-60m depth is sweet spot** for sand prediction (85% probability). **Operational Use**: When screening drilling locations, prioritize depths 40-60m. ### 4. LSTM Attention Weights ::: {.callout-note icon=false} ## 📘 What Is Attention Mechanism? **What Is It?** **Attention mechanism** is a technique in deep learning (introduced by Bahdanau et al. 2014, popularized by Transformers in 2017) that allows models to **focus on the most relevant parts of input data** when making predictions. For time series, it answers: "Which past time steps matter most for predicting today's value?" **Why Does It Matter for Time Series?** Traditional LSTMs (Long Short-Term Memory networks) process sequences sequentially and "compress" all past information into a fixed-size memory. This creates problems: 1. **Information loss**: Distant past gets "forgotten" as new data arrives 2. **Equal weighting**: All time steps treated similarly (but some days matter more than others) 3. **Black box**: Hard to understand what the model "remembers" **Attention solves this** by explicitly calculating **importance weights** for each past time step. For groundwater forecasting: - **High attention on T-1 (yesterday)** → Recent conditions dominate (fast-responding aquifer) - **High attention on T-30 (last month)** → Seasonal patterns important (slow-responding aquifer) - **High attention on T-7 (last week)** → Weekly pumping cycles matter (human influence) **How Does It Work?** Think of attention like a spotlight on a timeline: 1. **Model looks at all past days** (T-1, T-2, ..., T-90) 2. **Calculates relevance score** for each day based on current context 3. **Assigns attention weights** (sum to 100%) - more weight = more focus 4. **Uses weighted combination** of past values for prediction **Attention Weight = How much the model "looks at" that specific past day** **What Will You See?** Heatmaps or bar charts showing attention weights over time. Patterns reveal: - **Sharp peak at T-1** = Short-term persistence dominates - **Distributed weights** = Long-range dependencies (complex system memory) - **Periodic peaks** = Seasonal or cyclic patterns (e.g., every 7 days = weekly) **Operational Value** If model pays high attention to T-30, and precipitation 30 days ago was high, you can explain: > "Forecast predicts rising water levels because **the model learned that precipitation 30 days ago** (which was above normal) typically shows up in groundwater levels around now. This aligns with our understanding of aquifer recharge lag time." ::: **Question**: Which past time steps does the forecast model focus on? **Method**: Extract attention weights from Transformer/LSTM model **Example**: 30-day water level forecast | Time Lag | Attention Weight | Interpretation | |----------|------------------|----------------| | T-1 (yesterday) | 35% | Most recent observation critical | | T-7 (last week) | 18% | Weekly pumping cycle | | T-30 (last month) | 8% | Monthly seasonal pattern | | T-90 (3 months ago) | 2% | Long-term climate memory | **Visualization**: - Heatmap showing attention over time - Highlights which past days influence today's forecast **Operational Use**: - If model focuses on T-1: Short-term persistence dominates (stable conditions) - If model focuses on T-30: Seasonal cycle drives forecast (plan for seasonal changes) --- ## Decision Support Framework ::: {.callout-important icon=false} ## 🎯 Multi-Stakeholder Explanation Strategy **The Challenge**: Different stakeholders need different levels of detail and different types of explanations. A **one-size-fits-all** explanation will either overwhelm non-technical users or under-inform technical experts. **The Solution**: Tailored explanation strategy based on stakeholder needs, technical background, and decision authority. **When to Use Which Explanation Type**: **For Executive/Budget Decisions** (Managers, City Council): - **Need**: Simple, high-level justification for funding/approval - **Method**: SHAP summary → Top 3 bullet points - **Format**: Executive summary with confidence metrics - **Example**: "92% confident, three independent factors support this" **For Technical Validation** (Geologists, Hydrologists): - **Need**: Verify model learned correct physical relationships - **Method**: Feature importance + Partial Dependence Plots - **Format**: Technical report with domain interpretation - **Example**: "Model learned that depth-conductivity relationship matches Archie's Law" **For Regulatory Compliance** (EPA, State Regulators): - **Need**: Prove model is scientifically defensible and unbiased - **Method**: SHAP + domain validation + audit trail - **Format**: Formal compliance document with methodology - **Example**: "Model training dataset, validation metrics, and decision logic documented per EPA ML Guidelines 2024" **For Real-Time Operations** (Field Technicians, Operators): - **Need**: Understand why alert triggered or forecast changed - **Method**: Attention weights + recent data changes - **Format**: Dashboard alert with 1-sentence explanation - **Example**: "Water level spike detected because heavy rain 3 days ago (model expected this)" **For Public Communication** (Community Meetings, Media): - **Need**: Build trust without overwhelming with technical details - **Method**: Simplified SHAP → Plain language - **Format**: Infographic or 1-page summary - **Example**: "Computer model analyzed 10 years of data and found this location has best combination of water quality and quantity" **Regulatory Requirements** (Know Your Jurisdiction): - **EPA ML Guidelines (2024)**: Requires explainability for environmental decisions - **EU AI Act**: High-risk AI systems must be transparent and auditable - **State water law**: Varies - some require human expert confirmation - **Legal liability**: Documented decision rationale reduces litigation risk **Best Practice**: Always provide **multiple explanation levels**: 1. **One-sentence summary** (for quick reading) 2. **Three bullet points** (for stakeholder meetings) 3. **Full technical report** (for expert review and compliance) ::: ### Explanation Types by Stakeholder | Stakeholder | Question | Explainability Method | Output Format | |-------------|----------|----------------------|---------------| | **Manager** | "Why drill here?" | SHAP summary | 3 bullet points | | **Geologist** | "What geology patterns did model learn?" | Feature importance + PDP | Technical report | | **Regulator** | "Prove model is scientifically sound" | SHAP + domain validation | Audit document | | **Operator** | "Why did forecast change?" | Attention weights | Dashboard alert | | **Public** | "How do you make decisions?" | Simplified SHAP | Infographic | ### Example Explaining Well Siting **Scenario**: Recommend drilling at (403500, 4428500) **Stakeholder**: City Council (non-technical, budget authority) **Explanation**: > **Recommendation**: Drill production well at Location A (map coordinates provided) > > **Why this location?** > 1. **Geology**: Model predicts 92% probability of sand (high-yield aquifer) > - Based on HTEM resistivity data (same geology as nearby successful wells) > - Confirmed by material type classification (well-sorted sand) > > 2. **Confidence**: Prediction uncertainty is LOW (±15 GPM vs ±45 GPM at alternative sites) > - Many HTEM measurements nearby (dense data coverage) > - Three existing wells within 800m all successful > > 3. **Cost**: Drilling cost $38K (16% below alternative high-yield site) > - Shallower depth (42m vs 55m) > - Better road access (shorter mobilization) > > 4. **Risk**: 95% chance yield exceeds 120 GPM > - Compared to 60% at max-yield alternative > - Risk-adjusted value is 2.1× higher > > **Bottom Line**: This site balances yield, cost, and confidence better than alternatives. Model used 12 data-driven factors, with nearby geology and data density being most important. --- ## Operational Dashboards ::: {.callout-tip icon=false} ## 📊 How Operational Dashboards Work **Purpose**: Integrate explainability into daily operations so that **every prediction** can be understood and validated in real-time. **The Problem Without Dashboards**: - Operators receive model predictions but don't understand why - No easy way to validate if prediction makes physical sense - Trust erosion when predictions occasionally fail - Manual effort to generate explanations for each decision **The Solution**: Embedded explainability in operational dashboards **Key Dashboard Components**: **1. Interactive Prediction Map** - Click any location → See predicted value + SHAP explanation - Color-coded confidence (green = high, yellow = medium, red = low) - Overlay with validation data (existing wells, geology) **2. Explanation Panel** (appears on click) - Top 3 contributing features (SHAP values) - Comparison to similar locations - Confidence interval and uncertainty - Domain expert notes (if available) **3. Override Workflow** - Geologist can flag prediction as "disagree" - System prompts for reason (improves future training) - Logs disagreement for quarterly model review - Tracks "expert vs model" accuracy over time **4. Automated Quality Checks** - Flag predictions with contradictory features - Warn when prediction extrapolates beyond training data - Alert when uncertainty is high (>30% confidence interval) **User Workflows**: **Daily Monitoring Workflow**: 1. Open dashboard, review today's predictions 2. Click any unusual predictions 3. Read SHAP explanation: "Does this make sense?" 4. If yes → Approve. If no → Flag for expert review 5. Weekly: Review flagged predictions with team **Decision Support Workflow** (e.g., drilling site selection): 1. Model recommends top 5 locations 2. For each location, click to see: - Predicted yield (e.g., 135 GPM) - Why: SHAP shows depth, geology, proximity factors - Similar cases: "8 nearby wells avg 128 GPM" - Confidence: ±15 GPM (tight) vs ±45 GPM (wide) 3. Narrow to top 2 based on confidence + cost 4. Present to stakeholders with embedded explanations **Integration with Other Systems**: - **GIS**: Export explanation data to ArcGIS for spatial analysis - **Permitting**: Generate PDF report with prediction + explanation for regulators - **Field operations**: Send mobile-friendly explanation to technicians - **Reporting**: Monthly summary of predictions, outcomes, and accuracy ::: ### Explainability Features **1. Prediction Detail Panel**: - Click any prediction → See SHAP values - Top 5 contributing features highlighted - Comparison to similar past predictions **2. Confidence Indicators**: - Color-code by explanation clarity - 🟢 Green: Clear dominant features (>50% from top 3) - 🟡 Yellow: Mixed contributions (distributed across 5+ features) - 🔴 Red: Weak signal (high uncertainty, conflicting features) **3. Similar Cases**: - "Model made similar predictions at 12 nearby locations" - "8/12 were confirmed by drilling" - "Average actual yield: 132 GPM (vs predicted 135 GPM)" **4. Override Mechanism**: - If geologist disagrees, can flag prediction - System logs disagreement for model retraining - Track "model vs expert" accuracy over time --- ## Regulatory Compliance ::: {.callout-important icon=false} ## ⚖️ Regulatory Compliance Framework **Why Compliance Matters**: Regulators (EPA, state water agencies, permit authorities) increasingly require **transparency and accountability** for AI-assisted environmental decisions. Explainability isn't optional—it's a legal requirement in many jurisdictions. **Key Regulatory Requirements by Jurisdiction**: **EPA ML Guidelines (2024)** - Federal environmental decisions - **Requirement**: Document model methodology, validation, and decision rationale - **Applies to**: Groundwater monitoring, contamination assessment, permitting - **Key mandate**: "AI systems must provide interpretable explanations for high-stakes decisions" - **What you need**: SHAP values, feature importance, validation metrics, audit trail **EU AI Act (2023)** - High-risk AI systems - **Requirement**: Transparency, human oversight, technical documentation - **Applies to**: Critical infrastructure (water supply), environmental monitoring - **Key mandate**: Users must be informed when interacting with AI, decisions must be explainable - **What you need**: User disclosure, explanation on demand, risk assessment documentation **State Water Law** - Varies by state - **California**: Requires scientific basis for groundwater management decisions (SGMA) - **Texas**: Groundwater conservation districts require evidence-based permitting - **Florida**: Water management districts require technical justification for well permits - **What you need**: Expert review + model explanation for permit applications **Required Documentation** (Standard Audit Package): 1. **Model Methodology** - Algorithm type (Random Forest, XGBoost, Neural Network) - Training dataset size, source, date range - Features used and their physical meaning - Validation approach (holdout, cross-validation, temporal split) 2. **Performance Metrics** - Accuracy, precision, recall, F1 score - Uncertainty quantification (confidence intervals) - Error analysis (where does model fail?) - Comparison to baseline/traditional methods 3. **Explainability Evidence** - SHAP values for specific decisions - Feature importance across all predictions - Partial dependence plots showing learned relationships - Validation that model learned physically plausible patterns 4. **Version Control & Traceability** - Which model version made this decision? - When was model trained/updated? - What data was used for training? - Who reviewed and approved the model? 5. **Human Oversight** - Expert review process - Override mechanism (when humans disagree) - Tracking of model vs expert accuracy - Escalation procedures for uncertain predictions **Best Practices for Compliance**: - **Automate audit trail generation** - Log every prediction + explanation - **Pre-generate compliance templates** - One-click export for permit applications - **Quarterly expert review** - Licensed hydrogeologist validates model behavior - **Public transparency** - Make methodology available (not proprietary black box) - **Versioning discipline** - Never deploy unversioned models **Legal Liability Protection**: Documented explainability reduces legal risk: - **If model is wrong**: "We used best available science, documented reasoning, expert reviewed" - **If decision is challenged**: Audit trail shows due diligence and transparency - **If regulation changes**: Version control allows retrospective validation ::: ### Audit Trail Requirements **What regulators need**: 1. **Methodology**: How model was trained (data, algorithm, validation) 2. **Performance**: Accuracy, false positive rate, uncertainty quantification 3. **Explainability**: Why specific decision was made 4. **Version control**: Which model version made decision, when trained 5. **Human oversight**: Who reviewed, approved, implemented decision **Example Audit Report** (drilling permit application): > **Model-Assisted Decision: Production Well #47** > > **Decision**: Approve drilling at (403500, 4428500, depth 42m) > > **Model Information**: > - Model: Random Forest Material Classifier v1.5 > - Training date: 2024-08-15 > - Training data: 163,158 HTEM samples from Unit D > - Test accuracy: 86.4% (20% holdout validation) > - Uncertainty: ±15 GPM (95% confidence interval) > > **Prediction Explanation**: > - Material type: MT 11 (well-sorted sand) > - Confidence: 92% > - SHAP top factors: > 1. is_sand indicator: +42% probability > 2. Elevation (Z = -50m): +15% probability > 3. Northing (Y = 4428500): +12% probability > - Similar nearby wells: 3 within 800m, all successful (avg 128 GPM) > > **Human Review**: > - Reviewed by: Dr. Jane Smith, Licensed Hydrogeologist (PG #12345) > - Review date: 2024-11-20 > - Expert assessment: "Recommendation supported by geology. HTEM data density high in this area. Approve." > - Final decision: **APPROVED for drilling** (City Council 2024-11-22) > > **Outcome Tracking**: > - Actual yield: Will be measured upon well completion (expected 2025-01-15) > - Model accuracy will be validated and updated in quarterly review --- ## Working Example: Groundwater Level Prediction In this section, we'll build a practical explainability system for predicting groundwater levels using gradient boosting. We'll train a model and then explain its predictions using multiple techniques. ```{python} #| label: setup #| code-fold: true #| warning: false import os import sys import pandas as pd import numpy as np from pathlib import Path import plotly.graph_objects as go import plotly.express as px from plotly.subplots import make_subplots from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score from sklearn.inspection import partial_dependence import warnings warnings.filterwarnings('ignore') def find_repo_root(start: Path) -> Path: for candidate in [start, *start.parents]: if (candidate / "src").exists(): return candidate return start quarto_project = Path(os.environ.get("QUARTO_PROJECT_DIR", str(Path.cwd()))) project_root = find_repo_root(quarto_project) if str(project_root) not in sys.path: sys.path.append(str(project_root)) from src.utils import get_data_path ``` ### Data Preparation ```{python} #| label: load-data #| code-fold: true #| warning: false # Load real data using FusionBuilder from src.data_fusion import FusionBuilder from src.data_loaders import IntegratedDataLoader # Initialize flag to track data loading status data_loaded = False df = None feature_cols = [] model = None try: htem_root = get_data_path("htem_root") aquifer_db_path = get_data_path("aquifer_db") weather_db_path = get_data_path("warm_db") usgs_stream_root = get_data_path("usgs_stream") loader = IntegratedDataLoader( htem_path=htem_root, aquifer_db_path=aquifer_db_path, weather_db_path=weather_db_path, usgs_stream_path=usgs_stream_root ) builder = FusionBuilder(loader) # Build ML-ready dataset with all features df = builder.build_temporal_dataset( wells=None, # All wells start_date='2010-01-01', end_date='2023-12-31', include_weather=True, include_stream=True, add_features=True ) loader.close() data_loaded = True except Exception as e: print(f"Warning: Could not load data - {e}") data_loaded = False if data_loaded and df is not None: # Rename for consistency with the rest of the chapter # FusionBuilder uses Water_Level_ft as the column name if 'Water_Level_ft' in df.columns: df['Water_Surface_Elevation'] = df['Water_Level_ft'] elif 'water_level' in df.columns: df['Water_Surface_Elevation'] = df['water_level'] # Standardize column names for compatibility if 'WellID' in df.columns: df['well_id'] = df['WellID'] if 'Date' in df.columns: df['date'] = df['Date'] print(f"✅ Loaded {len(df):,} records using FusionBuilder") print(f" Wells: {df['well_id'].nunique() if 'well_id' in df.columns else df['WellID'].nunique()}") print(f" Features: {df.shape[1]}") # Ensure we have required features if 'month' not in df.columns and 'date' in df.columns: df['month'] = pd.to_datetime(df['date']).dt.month df['day_of_year'] = pd.to_datetime(df['date']).dt.dayofyear df['season'] = df['month'].apply(lambda x: 1 if x in [12, 1, 2] else 2 if x in [3, 4, 5] else 3 if x in [6, 7, 8] else 4) # Create lag features if not present if 'prev_water_level' not in df.columns: if 'water_level_lag1d' in df.columns: df['prev_water_level'] = df['water_level_lag1d'] elif 'Water_Level_ft_lag1d' in df.columns: df['prev_water_level'] = df['Water_Level_ft_lag1d'] # Create depth_to_water if not present if 'depth_to_water' not in df.columns: if 'Depth_to_Water' in df.columns: df['depth_to_water'] = df['Depth_to_Water'] elif 'Water_Surface_Elevation' in df.columns: # Use elevation as proxy (higher elevation = shallower water) df['depth_to_water'] = df['Water_Surface_Elevation'].max() - df['Water_Surface_Elevation'] # Create month if not present if 'month' not in df.columns and 'Month' in df.columns: df['month'] = df['Month'] # Create season if not present if 'season' not in df.columns and 'Quarter' in df.columns: df['season'] = df['Quarter'] elif 'season' not in df.columns and 'month' in df.columns: df['season'] = df['month'].apply(lambda x: 1 if x in [12, 1, 2] else 2 if x in [3, 4, 5] else 3 if x in [6, 7, 8] else 4) # Verify target column exists if 'Water_Surface_Elevation' in df.columns: print(f"Dataset: {len(df):,} records") print(f"Target variable range: {df['Water_Surface_Elevation'].min():.1f} to {df['Water_Surface_Elevation'].max():.1f}") else: print(f"⚠️ Warning: Water_Surface_Elevation column not found") print(f"Available columns: {list(df.columns[:10])}") data_loaded = False else: print("⚠️ ERROR: Data loading failed - cannot proceed without real data") print("This chapter requires actual groundwater measurements from FusionBuilder.") print("Please verify:") print(" 1. Database paths are correct (aquifer.db, warm.db)") print(" 2. Well data exists in OB_WELL_MEASUREMENTS_CHAMPAIGN_COUNTY table") print(" 3. FusionBuilder.build_temporal_dataset() has well data") df = None data_loaded = False ``` ### Feature Engineering & Model Training ```{python} #| label: train-model #| code-fold: true #| warning: false # Select features for modeling - use FusionBuilder columns if available if data_loaded and df is not None and 'Water_Surface_Elevation' in df.columns: # Use features from FusionBuilder (correct column names) potential_features = [ # Derived features (for partial dependence plots) 'prev_water_level', 'month', 'depth_to_water', 'season', # Weather features 'Precipitation_mm', 'Temperature_C', 'PET_mm', 'NetWater_mm', # Rolling precipitation 'Precipitation_mm_roll7d_sum', 'Precipitation_mm_roll14d_sum', 'Precipitation_mm_roll30d_sum', # Cumulative features 'Precipitation_mm_cum7d', 'NetWater_mm_cum30d', # Lag features 'Water_Level_ft_lag1d', 'Water_Level_ft_lag7d', 'Water_Level_ft_lag14d', 'Water_Level_ft_lag30d', # Temporal features 'DayOfYear_sin', 'DayOfYear_cos', 'Month_sin', 'Month_cos', # Stream features 'Discharge_cfs', 'Discharge_cfs_roll7d_mean' ] feature_cols = [col for col in potential_features if col in df.columns] # Add any available lag/rolling features not already included for col in df.columns: if ('lag' in col or 'roll' in col or 'cum' in col) and col not in feature_cols: if df[col].dtype in ['float64', 'int64'] and df[col].notna().mean() > 0.5: feature_cols.append(col) feature_cols = feature_cols[:15] # Limit to top 15 for visualization # Remove any remaining NaN values available_cols = [col for col in feature_cols if col in df.columns] df_clean = df[available_cols + ['Water_Surface_Elevation']].dropna() else: print("⚠️ ERROR: Cannot train model - no valid data loaded") feature_cols = [] available_cols = [] df_clean = pd.DataFrame() if len(available_cols) > 0: X = df_clean[available_cols] y = df_clean['Water_Surface_Elevation'] feature_cols = available_cols # Update to actual available features # Split into train/test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) # Train gradient boosting model model = GradientBoostingRegressor( n_estimators=100, learning_rate=0.1, max_depth=5, min_samples_split=20, min_samples_leaf=10, random_state=42 ) model.fit(X_train, y_train) else: print("⚠️ No features available - skipping model training") X_train = X_test = pd.DataFrame() y_train = y_test = pd.Series() model = None if model is not None and len(X_train) > 0: # Make predictions y_pred_train = model.predict(X_train) y_pred_test = model.predict(X_test) # Calculate metrics train_r2 = r2_score(y_train, y_pred_train) test_r2 = r2_score(y_test, y_pred_test) train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train)) test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test)) print(f"Model Performance:") print(f" Training R²: {train_r2:.3f}, RMSE: {train_rmse:.2f} ft") print(f" Testing R²: {test_r2:.3f}, RMSE: {test_rmse:.2f} ft") else: print("⚠️ Model not trained - skipping predictions") y_pred_train = y_pred_test = np.array([]) train_r2 = test_r2 = train_rmse = test_rmse = 0.0 ``` ### 1. Feature Importance Analysis Feature importance tells us which variables matter most for predicting groundwater levels across all predictions. ```{python} #| label: fig-feature-importance #| fig-cap: "Feature importance analysis showing which variables contribute most to water level predictions. Previous water level dominates due to high aquifer autocorrelation." #| code-fold: false #| warning: false # Check if model was successfully trained if model is None or len(feature_cols) == 0: print("⚠️ ERROR: No trained model available for feature importance analysis") print("Cannot display feature importance without valid training data") else: # Extract feature importances importances = model.feature_importances_ feature_names = feature_cols # Create DataFrame for sorting importance_df = pd.DataFrame({ 'Feature': feature_names, 'Importance': importances, 'Importance_Pct': importances * 100 }).sort_values('Importance', ascending=True) # Create horizontal bar chart fig = go.Figure() fig.add_trace(go.Bar( y=importance_df['Feature'], x=importance_df['Importance_Pct'], orientation='h', marker=dict( color=importance_df['Importance_Pct'], colorscale='Blues', showscale=True, colorbar=dict(title="Importance %") ), text=[f"{val:.1f}%" for val in importance_df['Importance_Pct']], textposition='auto', hovertemplate='%{y} ' + 'Importance: %{x:.2f}% ' + '<extra></extra>' )) fig.update_layout( title={ 'text': "Feature Importance: What Drives Water Level Predictions?", 'x': 0.5, 'xanchor': 'center' }, xaxis_title="Importance (%)", yaxis_title="Feature", height=500, template='plotly_white', font=dict(size=12), showlegend=False ) fig.show() ``` ::: {.callout-note icon=false} ## 💻 For Computer Scientists Feature importance measures how much each input variable contributes to reducing prediction error across the entire dataset. Gradient Boosting uses Gini importance (mean decrease in impurity). **Key insight**: `prev_water_level` dominates because groundwater systems have high autocorrelation—yesterday's level strongly predicts today's level. ::: ::: {.callout-tip icon=false} ## 🌍 For Geologists/Hydrologists The feature importance rankings reveal physical processes: - **High `prev_water_level` importance** → Aquifer has **memory** (confined or low-permeability system) - **High `season/month` importance** → Strong seasonal recharge cycles - **High `surface_elevation` importance** → Topographic control on water table **Key insight**: If temporal features dominate spatial features, this indicates the aquifer responds more to climate than local geology. ::: ### 2. Partial Dependence Plots Partial dependence plots show how changing one feature affects predictions while holding all other features constant. ```{python} #| label: partial-dependence #| code-fold: false #| warning: false # Calculate partial dependence for key features # Only use features that exist in feature_cols desired_features = ['prev_water_level', 'month', 'depth_to_water', 'season'] features_to_plot = [f for f in desired_features if f in feature_cols] # If none of the desired features exist, use top 4 available features if len(features_to_plot) < 2: features_to_plot = feature_cols[:min(4, len(feature_cols))] # Create subplots n_features = len(features_to_plot) n_rows = (n_features + 1) // 2 n_cols = min(2, n_features) fig = make_subplots( rows=n_rows, cols=n_cols, subplot_titles=[f"PDP: {feat}" for feat in features_to_plot], vertical_spacing=0.12, horizontal_spacing=0.1 ) for idx, feature in enumerate(features_to_plot): row = (idx // 2) + 1 col = (idx % 2) + 1 try: feature_idx = feature_cols.index(feature) except ValueError: print(f"Feature '{feature}' not found in feature_cols, skipping") continue # Calculate partial dependence pd_result = partial_dependence( model, X_train, features=[feature_idx], grid_resolution=50 ) # Extract values pd_values = pd_result['average'][0] feature_values = pd_result['grid_values'][0] # Add trace fig.add_trace( go.Scatter( x=feature_values, y=pd_values, mode='lines', line=dict(color='#2e8bcc', width=3), fill='tonexty', fillcolor='rgba(46, 139, 204, 0.2)', name=feature, hovertemplate='' + feature + ' ' + 'Value: %{x:.2f} ' + 'Predicted Water Level: %{y:.2f} ft ' + '<extra></extra>' ), row=row, col=col ) # Update axes fig.update_xaxes(title_text=feature, row=row, col=col) fig.update_yaxes(title_text="Water Level (ft)", row=row, col=col) fig.update_layout( title={ 'text': "Partial Dependence: How Each Feature Affects Predictions", 'x': 0.5, 'xanchor': 'center' }, height=700, template='plotly_white', showlegend=False, font=dict(size=11) ) fig.show() ``` ::: {.callout-note icon=false} ## 💻 For Computer Scientists Partial dependence plots (PDPs) marginalize over other features to show the **marginal effect** of one feature. They answer: "If I change X while keeping everything else average, how does the prediction change?" **Limitation**: PDPs assume feature independence. If features are correlated (e.g., `month` and `season`), PDPs can be misleading. ::: ::: {.callout-tip icon=false} ## 🌍 For Geologists/Hydrologists Physical interpretation of PDP patterns: - **Linear PDP** → Simple relationship (e.g., deeper depth = lower water table) - **Curved PDP** → Nonlinear process (e.g., seasonal recharge cycles) - **Flat PDP** → Feature doesn't matter much for this aquifer - **Step PDP** → Threshold behavior (e.g., season transitions) **Key insight**: If the `month` PDP shows a peak in spring (months 3-5), this indicates **snowmelt or spring recharge dominates** the system. ::: ### 3. Prediction Decomposition (SHAP-like Analysis) For a specific prediction, we can decompose it into contributions from each feature. This shows **why** the model made a particular prediction. ```{python} #| label: prediction-decomposition #| code-fold: false #| warning: false # Select an interesting test example (median prediction) test_indices = y_test.sort_values().index median_idx = test_indices[len(test_indices) // 2] # Get the sample sample = X_test.loc[median_idx:median_idx] actual = y_test.loc[median_idx] predicted = model.predict(sample)[0] # Calculate baseline (mean prediction) baseline = y_train.mean() # Approximate feature contributions using tree structure # For each feature, calculate prediction with feature vs without (set to mean) contributions = {} for feature in feature_cols: # Create copy with feature set to mean sample_without = sample.copy() sample_without[feature] = X_train[feature].mean() # Predict with and without pred_with = model.predict(sample)[0] pred_without = model.predict(sample_without)[0] # Contribution is the difference contributions[feature] = pred_with - pred_without # Convert to DataFrame contrib_df = pd.DataFrame({ 'Feature': list(contributions.keys()), 'Contribution': list(contributions.values()), 'Value': [sample[feat].values[0] for feat in contributions.keys()] }).sort_values('Contribution', ascending=True) # Create waterfall chart fig = go.Figure() # Calculate cumulative sum for waterfall cumsum = baseline y_values = [baseline] measures = ['absolute'] for idx, row in contrib_df.iterrows(): cumsum += row['Contribution'] y_values.append(row['Contribution']) measures.append('relative') # Add final prediction y_values.append(predicted) measures.append('total') # Create labels labels = ['Baseline (Mean)'] + [ f"{row['Feature']} = {row['Value']:.2f}" for _, row in contrib_df.iterrows() ] + ['Final Prediction'] # Create waterfall fig.add_trace(go.Waterfall( name="Contribution", orientation="v", measure=measures, x=labels, y=y_values, text=[f"{val:.2f} ft" for val in y_values], textposition="outside", connector={"line": {"color": "rgb(63, 63, 63)"}}, increasing={"marker": {"color": "#3cd4a8"}}, decreasing={"marker": {"color": "#f59e0b"}}, totals={"marker": {"color": "#2e8bcc"}} )) fig.update_layout( title={ 'text': f"Prediction Decomposition: Why Predict {predicted:.1f} ft? " + f"Actual: {actual:.1f} ft | Error: {abs(predicted - actual):.2f} ft", 'x': 0.5, 'xanchor': 'center' }, yaxis_title="Water Surface Elevation (ft)", xaxis_title="Feature Contributions", height=600, template='plotly_white', showlegend=False, font=dict(size=11) ) fig.show() # Print interpretation print(f"\nPrediction Interpretation:") print(f" Baseline (average water level): {baseline:.2f} ft") print(f" Top positive contributor: {contrib_df.iloc[-1]['Feature']} (+{contrib_df.iloc[-1]['Contribution']:.2f} ft)") print(f" Top negative contributor: {contrib_df.iloc[0]['Feature']} ({contrib_df.iloc[0]['Contribution']:.2f} ft)") print(f" Final prediction: {predicted:.2f} ft") print(f" Actual value: {actual:.2f} ft") print(f" Prediction error: {abs(predicted - actual):.2f} ft") ``` ::: {.callout-note icon=false} ## 💻 For Computer Scientists This approximation estimates **local feature contributions** by perturbing features one at a time. It's similar to SHAP values but computationally simpler. **True SHAP** uses game theory to calculate Shapley values (exact marginal contributions accounting for all feature interactions). This approximation is faster but less rigorous. **Key insight**: For production systems, use true SHAP (`shap` library). For exploratory analysis, this approximation is sufficient. ::: ::: {.callout-tip icon=false} ## 🌍 For Geologists/Hydrologists The waterfall chart shows **why this specific well has this predicted water level**: - **Baseline** = What we'd predict knowing nothing (average across all wells) - **Each bar** = How much this well's specific characteristic changes the prediction - **Green bars** = Features pushing water level UP (more water) - **Orange bars** = Features pushing water level DOWN (less water) **Example interpretation**: - If `prev_water_level` is +5 ft, it means "this well had high water last time, so we predict high water now" - If `season = 4` (fall) is -2 ft, it means "fall typically has lower water levels due to reduced recharge" **Key insight**: This explains individual predictions to stakeholders, enabling trust and validation. ::: ### 4. Model Performance Visualization ```{python} #| label: performance-viz #| code-fold: true #| warning: false # Create scatter plot of predicted vs actual fig = go.Figure() # Test set predictions fig.add_trace(go.Scatter( x=y_test, y=y_pred_test, mode='markers', name='Test Set', marker=dict( color='#2e8bcc', size=5, opacity=0.6, line=dict(width=0) ), hovertemplate='Actual: %{x:.2f} ft ' + 'Predicted: %{y:.2f} ft ' + '<extra></extra>' )) # Perfect prediction line min_val = min(y_test.min(), y_pred_test.min()) max_val = max(y_test.max(), y_pred_test.max()) fig.add_trace(go.Scatter( x=[min_val, max_val], y=[min_val, max_val], mode='lines', name='Perfect Prediction', line=dict(color='red', dash='dash', width=2) )) fig.update_layout( title={ 'text': f"Prediction Accuracy: R² = {test_r2:.3f}, RMSE = {test_rmse:.2f} ft", 'x': 0.5, 'xanchor': 'center' }, xaxis_title="Actual Water Surface Elevation (ft)", yaxis_title="Predicted Water Surface Elevation (ft)", height=600, template='plotly_white', font=dict(size=12), hovermode='closest' ) # Make axes equal fig.update_xaxes(scaleanchor="y", scaleratio=1) fig.update_yaxes(scaleanchor="x", scaleratio=1) fig.show() ``` ::: {.callout-important icon=false} ## Production Deployment Implications This working example demonstrates key explainability requirements for production: 1. **Feature Importance** → Guides data collection priorities (focus on high-importance features) 2. **Partial Dependence** → Validates model behavior matches physical expectations 3. **Prediction Decomposition** → Enables stakeholder trust and regulatory compliance 4. **Performance Metrics** → Quantifies uncertainty for decision-making **For a $45K drilling decision**: Show stakeholders the decomposition plot + feature importance to explain why the model recommends a specific location. ::: --- ## Lessons Learned When Deploying Explainability ::: {.callout-note icon=false} ## 🎓 Pattern Summary: Common Explainability Pitfalls **Why These Lessons Matter**: The following three cases represent **common failure modes** when deploying explainable AI in real-world operations. Each illustrates a different dimension of the explainability challenge: 1. **Technical accuracy ≠ operational trust** (Case 1) 2. **Regulatory compliance requires documentation** (Case 2) 3. **Model coherence must match domain knowledge** (Case 3) **Success Factors for Explainable AI Deployment**: **Before Deployment**: - ✅ Train models AND explainers simultaneously (not as afterthought) - ✅ Validate that explanations match physical knowledge - ✅ Prepare regulatory compliance package before first use - ✅ Test explanations with actual stakeholders (not just data scientists) - ✅ Build override workflow into system from day one **During Operations**: - ✅ Monitor explanation quality, not just prediction accuracy - ✅ Track when humans override model (this is valuable feedback) - ✅ Update explanations when model is retrained - ✅ Provide multiple explanation levels (executive → technical) - ✅ Log every prediction + explanation for audit trail **After Deployment**: - ✅ Quarterly review of explanation coherence with domain experts - ✅ Retrain when explanations stop making physical sense - ✅ Update compliance documentation when regulations change - ✅ Survey stakeholders: "Do you understand the explanations?" - ✅ Publish lessons learned (contribute to community knowledge) **Implementation Recommendations**: **For New Projects**: 1. **Start with simple, interpretable models** (linear, tree-based) 2. **Only add complexity if needed** (neural nets as last resort) 3. **Build explanation dashboard before production deployment** 4. **Run parallel deployment**: Model + human expert for 3 months 5. **Measure trust metrics**: % of predictions accepted without question **For Existing Black-Box Systems**: 1. **Add SHAP explainer to existing model** (retrofit explainability) 2. **Validate explanations with domain experts** (check for nonsense) 3. **Document known failure modes** (when explanations mislead) 4. **Consider retraining with interpretability constraints** (monotonicity, sparsity) 5. **Phase in explainability**: Start with high-stakes decisions only **Red Flags to Watch For**: - 🚩 **Contradictory explanations**: SHAP says X, PDP says opposite - 🚩 **Physically impossible relationships**: Deeper = more water (wrong for water table) - 🚩 **Single-feature dominance**: One feature >80% of SHAP value (overfitting?) - 🚩 **Stakeholder confusion**: "I don't understand these explanations" - 🚩 **High override rate**: Experts disagree with model >30% of time ::: ### Case 1: The Model Without Context **Failure**: Operator drilled at model recommendation without understanding why **Result**: Hit clay instead of sand (model was 14% wrong, fell in error range) **Root cause**: Operator trusted 86% accuracy = "always right" (misunderstanding) **Fix**: Now require operators to review SHAP values BEFORE drilling. If top contributing feature is unclear or weak, trigger expert review. ### Case 2: Black Box Rejection **Failure**: Regulator rejected permit because "AI decisions not transparent" **Result**: 6-month permitting delay, $180K cost overrun **Root cause**: Submitted prediction without explanation of how model works **Fix**: Now include 2-page "Model Explainability Summary" with every permit: - How model was trained - What features it uses - Why it made this specific prediction (SHAP) - Human expert confirmation ### Case 3: Overfitting to Local Noise **Failure**: Model gave high confidence (95%) at location that turned out to be anomaly **Result**: Dry hole ($45K loss) **Root cause**: Model memorized local noise pattern (small-scale heterogeneity) instead of large-scale geology **Fix**: Added "explanation coherence check": - If SHAP values don't align with known geology, flag for review - If model relies on single local feature (not regional trend), reduce confidence - Require 3+ features contributing >10% each (not all from one dominant feature) --- ## Best Practices ### Do's ✅ **Always explain high-stakes decisions** (>$10K impact) ✅ **Use multiple explanation methods** (SHAP + feature importance + domain validation) ✅ **Translate to stakeholder language** (no jargon for non-technical users) ✅ **Show uncertainty** (confidence intervals, not just point predictions) ✅ **Enable override** (human expert can disagree and flag for retraining) ✅ **Track explanation quality** (do stakeholders understand? measure with surveys) ### Don'ts ❌ **Don't say "trust the AI"** without explanation ❌ **Don't show raw SHAP values** to non-technical users (translate to plain English) ❌ **Don't ignore conflicting explanations** (if SHAP says X but PDP says Y, investigate) ❌ **Don't explain averages when stakeholders need specific cases** ❌ **Don't use explanation as excuse** ("model was right 86% of the time" doesn't help the 14% who got bad prediction) --- ## Production Deployment Checklist - [ ] SHAP explainer trained for all production models - [ ] Feature importance calculated and documented - [ ] Partial dependence plots generated for top 5 features - [ ] Attention weights extracted (for deep learning models) - [ ] Explanation dashboard deployed (click prediction → see SHAP) - [ ] Stakeholder-specific explanation templates created - [ ] Audit trail system logging predictions + explanations - [ ] Override mechanism implemented (human review workflow) - [ ] Explanation quality survey deployed (quarterly feedback) - [ ] Regulatory compliance package prepared (methodology + explainability) **Status**: ✅ **Production-ready** with continuous explanation quality monitoring. --- **Explainability System Version**: v2.3 **Methods**: SHAP, Feature Importance, PDP, Attention Weights **Stakeholders Served**: 5 (managers, geologists, regulators, operators, public) **Audit Compliance**: Meets EPA ML Guidelines (2024) **Next Review**: 2025-02-01 **Responsible**: Data Science + Communications + Legal --- ## Summary Explainable AI transforms **black-box predictions into actionable understanding**: ✅ **SHAP explanations** - Feature contributions for every prediction ✅ **Stakeholder-specific views** - Different explanations for managers, geologists, regulators ✅ **Partial dependence plots** - How each feature affects predictions ✅ **Audit trail** - Predictions + explanations logged for regulatory compliance ✅ **Override mechanism** - Human review when explanations raise concerns **Key Insight**: Explanation builds **trust**. Without it, even accurate models won't be adopted. With it, operators become partners, not just users. --- ## Reflection Questions 1. For one of the models in this book (e.g., material classification, water-level forecasting, or well placement), which explanation method would you prioritize first, and why? 2. How would you judge whether a model explanation is physically plausible in your aquifer, rather than just mathematically consistent with the data? 3. When presenting model-assisted recommendations to non-technical stakeholders, what details would you omit or simplify, and what would you insist on showing? 4. Where could explainability fail or mislead you (e.g., correlated features, partial dependence assumptions), and how would you design checks or workflows to catch those cases? 5. What processes (governance, documentation, training) would you put in place so that explainable AI remains trustworthy as models, data, and staff change over time? --- ## Related Chapters - [Material Classification ML](material-classification-ml.qmd) - Models being explained - [Water Level Forecasting](water-level-forecasting.qmd) - Forecast explanation - [Well Placement Optimizer](well-placement-optimizer.qmd) - Decision explanation - [Value of Information](../part-4-fusion/value-of-information.qmd) - Economic justification

49.1 What You Will Learn in This Chapter

49.2 Why Explainability Matters

49.3 Explainability Methods

49.3.1 1. SHAP Values (SHapley Additive exPlanations)

49.3.2 2. Feature Importance (Global)

49.3.3 3. Partial Dependence Plots

49.3.4 4. LSTM Attention Weights

49.4 Decision Support Framework

49.4.1 Explanation Types by Stakeholder

49.4.2 Example Explaining Well Siting

49.5 Operational Dashboards

49.5.1 Explainability Features

49.6 Regulatory Compliance

49.6.1 Audit Trail Requirements

49.7 Working Example: Groundwater Level Prediction

49.7.1 Data Preparation

49.7.2 Feature Engineering & Model Training

49.7.3 1. Feature Importance Analysis

49.7.4 2. Partial Dependence Plots

49.7.5 3. Prediction Decomposition (SHAP-like Analysis)

49.7.6 4. Model Performance Visualization

49.8 Lessons Learned When Deploying Explainability

49.8.1 Case 1: The Model Without Context

49.8.2 Case 2: Black Box Rejection

49.8.3 Case 3: Overfitting to Local Noise

49.9 Best Practices

49.9.1 Do’s

49.9.2 Don’ts

49.10 Production Deployment Checklist

49.11 Summary

49.12 Reflection Questions

49.13 Related Chapters