---
title: "Causal Discovery Network"
subtitle: "Finding True Cause-Effect Relationships"
code-fold: true
---
::: {.callout-tip icon=false}
## For Newcomers (Advanced Chapter)
This is an **advanced** chapter. It is safe to **skip on first reading**.
**You will get (if you continue):**
- An intuitive idea of what it means to say “X causes Y” in data.
- A map-like view of how weather, streams, and groundwater influence each other.
Read these first:
- At least one temporal chapter (e.g., **Water Level Trends**).
- A fusion chapter such as **Water Balance Closure**.
You can skim formal causal discovery algorithms and focus on the diagrams and narrative explaining which variables influence each other and with what time lags.
:::
**Data Sources Fused**: All 4 (with causal inference)
## What You Will Learn in This Chapter
By the end of this chapter, you will be able to:
- Explain in plain language what it means to infer a causal graph from multi-source time series and how this differs from correlation analysis.
- Interpret simple causal diagrams and strength rankings that connect precipitation, temperature, groundwater levels, and streamflow.
- Describe the main assumptions and limitations of Granger causality, transfer entropy, and VAR-based impulse response analysis in a hydrologic context.
- Reflect on when causal discovery outputs are useful for generating hypotheses versus when they are strong enough to inform management decisions.
## Overview
Correlation is not causation. The previous chapter showed that HTEM, weather, and streams **correlate** with water levels. But which variables **cause** changes? This chapter applies **causal discovery** methods to infer the directed graph of cause-effect relationships in the aquifer system.
::: {.callout-note icon=false}
## 💻 For Computer Scientists
**Problem**: Given multivariate time series, find the causal graph (DAG).
**Methods:**
- **Granger Causality**: X Granger-causes Y if past values of X improve prediction of Y
- **Transfer Entropy**: Information-theoretic measure of directed information flow
- **PC Algorithm**: Constraint-based causal discovery using conditional independence tests
- **VAR Models**: Vector autoregression to quantify lagged causal effects
**Challenge**: Distinguishing causation from confounding and indirect effects.
:::
::: {.callout-tip icon=false}
## 🌍 For Hydrologists
**Expected Causal Structure:**
```
Precipitation → Groundwater Level
↓
Stream Discharge ← Groundwater Level
↑
Snowmelt
```
**Key Questions:**
1. Does precipitation directly cause water level changes, or is it mediated by soil moisture?
2. Do streams cause aquifer changes (recharge), or vice versa (baseflow)?
3. Is HTEM structure a confounder (affects both weather response and baseline levels)?
**Physical constraints**: Causation must respect temporal order and physical mechanisms.
:::
## Analysis Approach
```{python}
#| code-fold: true
#| label: setup-causal-analysis
#| output: false
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
# Conditional imports for optional dependencies
try:
from scipy.stats import entropy
SCIPY_AVAILABLE = True
except ImportError:
SCIPY_AVAILABLE = False
# Fallback entropy calculation
def entropy(pk, qk=None, base=None):
pk = np.asarray(pk)
pk = pk / pk.sum()
if base is None:
base = np.e
return -np.sum(pk * np.log(pk + 1e-10)) / np.log(base)
try:
from statsmodels.tsa.api import VAR
STATSMODELS_AVAILABLE = True
except ImportError:
STATSMODELS_AVAILABLE = False
print("Note: statsmodels not available. VAR analysis will be limited.")
print("Causal discovery utilities initialized (data will be defined in subsequent cells)")
```
## Data Preparation
```{python}
#| code-fold: true
from pathlib import Path
import sys
import os
# Add project root to path for imports
def find_repo_root(start: Path) -> Path:
for candidate in [start, *start.parents]:
if (candidate / "src").exists():
return candidate
return start
quarto_project = Path(os.environ.get("QUARTO_PROJECT_DIR", str(Path.cwd())))
project_root = find_repo_root(quarto_project)
if str(project_root) not in sys.path:
sys.path.append(str(project_root))
from src.utils import get_data_path
# Build real multi-source daily dataset
aquifer_db_path = get_data_path("aquifer_db")
weather_db_path = get_data_path("warm_db")
usgs_stream_path = get_data_path("usgs_stream")
import sqlite3
# CRITICAL: Load each data source INDEPENDENTLY to avoid encoding fake relationships
# We let the causal discovery algorithms find true relationships from real data
# 1. Load precipitation and temperature from weather database
# Use WarmICNData (hourly) table - WarmICNFiveMin uses different column names
DATA_AVAILABLE = True
conn_weather = sqlite3.connect(weather_db_path)
# First check what tables exist
try:
tables = pd.read_sql_query(
"SELECT name FROM sqlite_master WHERE type='table' AND name LIKE 'Warm%'",
conn_weather
)
print(f"Available weather tables: {tables['name'].tolist()}")
except Exception as e:
print(f"⚠️ Could not query weather database: {e}")
DATA_AVAILABLE = False
weather_query = """
SELECT nDateTime, nPrecipHrly as nPrecip, nAirTemp
FROM WarmICNData
WHERE nPrecipHrly IS NOT NULL
AND nAirTemp IS NOT NULL
LIMIT 500000
"""
try:
weather_df = pd.read_sql_query(weather_query, conn_weather)
if len(weather_df) == 0:
print("⚠️ Weather query returned no data - trying WarmICNFiveMin...")
# Fallback to WarmICNFiveMin which has nPrecip directly
weather_query_alt = """
SELECT nDateTime, nPrecip, nAirTemp
FROM WarmICNFiveMin
WHERE nPrecip IS NOT NULL
AND nAirTemp IS NOT NULL
LIMIT 500000
"""
weather_df = pd.read_sql_query(weather_query_alt, conn_weather)
except Exception as e:
print(f"⚠️ Weather data load failed: {e}")
weather_df = pd.DataFrame(columns=['nDateTime', 'nPrecip', 'nAirTemp'])
DATA_AVAILABLE = False
conn_weather.close()
if len(weather_df) == 0:
print("⚠️ No weather data available for causal discovery analysis")
DATA_AVAILABLE = False
weather_df['Date'] = pd.to_datetime(weather_df['nDateTime'], errors='coerce')
weather_df = weather_df.dropna(subset=['Date'])
# Aggregate to daily (sum precipitation, mean temperature)
weather_daily = weather_df.groupby(weather_df['Date'].dt.date).agg({
'nPrecip': 'sum',
'nAirTemp': 'mean'
}).reset_index()
weather_daily.columns = ['Date', 'Precipitation', 'Temperature']
weather_daily['Date'] = pd.to_datetime(weather_daily['Date'])
# 2. Load water levels from groundwater database
conn_gw = sqlite3.connect(aquifer_db_path)
gw_query = """
SELECT TIMESTAMP, Water_Surface_Elevation
FROM OB_WELL_MEASUREMENTS_CHAMPAIGN_COUNTY
WHERE Water_Surface_Elevation IS NOT NULL
AND TIMESTAMP IS NOT NULL
"""
gw_df = pd.read_sql_query(gw_query, conn_gw)
conn_gw.close()
gw_df['Date'] = pd.to_datetime(gw_df['TIMESTAMP'], format='%m/%d/%Y', errors='coerce')
gw_df = gw_df.dropna(subset=['Date'])
# Aggregate to daily mean water level
gw_daily = gw_df.groupby(gw_df['Date'].dt.date).agg({
'Water_Surface_Elevation': 'mean'
}).reset_index()
gw_daily.columns = ['Date', 'WaterLevel']
gw_daily['Date'] = pd.to_datetime(gw_daily['Date'])
# 3. Load stream discharge from USGS data
# Files are in usgs_stream/daily_values/ subdirectory
usgs_daily_path = Path(usgs_stream_path) / "daily_values"
if not usgs_daily_path.exists():
usgs_daily_path = Path(usgs_stream_path) # Fallback to root
# Look for discharge files (parameter code 00060 = discharge in cfs)
usgs_files = list(usgs_daily_path.glob("*_00060_daily.csv"))
if len(usgs_files) == 0:
print(f"Warning: No USGS discharge CSV files found in {usgs_daily_path}")
stream_daily = pd.DataFrame(columns=['Date', 'StreamDischarge'])
else:
# Use Salt Fork near St. Joseph (good long-term record) - site 03336900
preferred_file = None
for f in usgs_files:
if '03336900' in f.name:
preferred_file = f
break
if preferred_file is None:
preferred_file = usgs_files[0]
print(f"Loading stream data from: {preferred_file.name}")
stream_df = pd.read_csv(preferred_file)
# USGS files have datetime column and value columns like '48097_00060_00003'
# Find the discharge value column (contains '00060' but not '_cd')
discharge_col = None
for col in stream_df.columns:
if '00060' in col and '_cd' not in col and col != 'site_no' and col != 'agency_cd':
# Check if this is the value column (not qualifier)
if stream_df[col].dtype in ['float64', 'int64'] or pd.api.types.is_numeric_dtype(stream_df[col]):
discharge_col = col
break
# Try converting
try:
pd.to_numeric(stream_df[col], errors='raise')
discharge_col = col
break
except:
continue
if discharge_col is None:
# Fallback: find any numeric column that's not metadata
for col in stream_df.columns:
if col not in ['datetime', 'site_no', 'agency_cd'] and '_cd' not in col:
try:
test = pd.to_numeric(stream_df[col], errors='coerce')
if test.notna().sum() > 100:
discharge_col = col
break
except:
continue
if 'datetime' in stream_df.columns and discharge_col:
stream_df['Date'] = pd.to_datetime(stream_df['datetime'], errors='coerce')
stream_df['StreamDischarge'] = pd.to_numeric(stream_df[discharge_col], errors='coerce')
stream_daily = stream_df[['Date', 'StreamDischarge']].dropna()
stream_daily = stream_daily.groupby(stream_daily['Date'].dt.date).agg({
'StreamDischarge': 'mean'
}).reset_index()
stream_daily.columns = ['Date', 'StreamDischarge']
stream_daily['Date'] = pd.to_datetime(stream_daily['Date'])
print(f" Loaded {len(stream_daily)} days of discharge data")
else:
print(f"Warning: Could not find date/discharge columns. Cols: {stream_df.columns.tolist()}")
stream_daily = pd.DataFrame(columns=['Date', 'StreamDischarge'])
# 4. Merge all sources on Date (inner join to get complete records)
# Ensure all Date columns are datetime64 type before merge
try:
weather_daily['Date'] = pd.to_datetime(weather_daily['Date'])
gw_daily['Date'] = pd.to_datetime(gw_daily['Date'])
if len(stream_daily) > 0:
stream_daily['Date'] = pd.to_datetime(stream_daily['Date'])
causal_df = weather_daily.merge(gw_daily, on='Date', how='inner')
if len(stream_daily) > 0:
causal_df = causal_df.merge(stream_daily, on='Date', how='inner')
else:
# No stream data - add empty column
causal_df['StreamDischarge'] = np.nan
print("⚠️ No stream data available - StreamDischarge will be excluded from causal analysis")
except Exception as e:
print(f"⚠️ Merge failed: {e}")
causal_df = pd.DataFrame()
# Drop rows with any missing values
causal_df = causal_df.dropna()
# Define variables list (always needed for reference in subsequent code)
variables = ['Precipitation', 'Temperature', 'StreamDischarge', 'WaterLevel']
# Check if we have sufficient data for causal analysis
if len(causal_df) == 0:
print("⚠️ ERROR: No overlapping dates found between weather, groundwater, and stream data.")
print("Causal discovery requires temporal overlap across all three sources.")
print("Check data availability and date ranges.")
print("\n⚠️ All subsequent causal analysis will be skipped.")
DATA_AVAILABLE = False
else:
DATA_AVAILABLE = True
# Standardize variables for causal analysis
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
causal_df[variables] = scaler.fit_transform(causal_df[variables])
print(f"Dataset ready: {len(causal_df)} days with complete records")
print(f"Date range: {causal_df['Date'].min()} to {causal_df['Date'].max()}")
print("Note: Data loaded independently from separate sources - no causal relationships pre-encoded")
```
## Method 1: Granger Causality
::: {.callout-note icon=false}
## 📘 Understanding Granger Causality
### What Is It?
**Granger causality** is a statistical hypothesis test introduced by Nobel laureate Clive Granger in 1969 (Nobel Prize 2003 for contributions to time series analysis). It provides an operational definition of causation: X "Granger-causes" Y if past values of X contain information that helps predict Y beyond what Y's own history provides.
**Historical Context:** Developed in econometrics to test whether one economic time series (like money supply) could predict another (like inflation). Named after economist Clive W.J. Granger, who formalized the concept in his 1969 *Econometrica* paper "Investigating Causal Relations by Econometric Models and Cross-Spectral Methods."
### What Is Granger Causality?
**Granger causality** is a statistical hypothesis test introduced by Nobel laureate Clive Granger in 1969. It provides an operational definition of causation: X "Granger-causes" Y if past values of X contain information that helps predict Y beyond what Y's own history provides.
### Why Does It Matter for Aquifer Analysis?
In hydrogeology, we observe correlations everywhere—precipitation and water levels both increase in spring, for example. But correlation doesn't tell us **which variable drives which**. Granger causality tests directional relationships:
- Does precipitation **cause** water level changes? (recharge hypothesis)
- Or do water levels **cause** precipitation? (impossible physically—likely spurious)
- Or is there bidirectional causation? (feedback loop between aquifer and climate)
### How Does It Work?
The test compares two regression models:
**Model 1 (Restricted)**: Predict Y using only its own past values
```
Y(t) = α₀ + α₁Y(t-1) + α₂Y(t-2) + ... + error
```
**Model 2 (Unrestricted)**: Predict Y using its past AND X's past
```
Y(t) = β₀ + β₁Y(t-1) + β₂Y(t-2) + ... + γ₁X(t-1) + γ₂X(t-2) + ... + error
```
If Model 2 predicts significantly better (lower error), X "Granger-causes" Y.
### What Will You See Below?
We'll compute a Granger causality matrix showing which variables influence which others. The p-values tell us statistical confidence in each relationship.
### How to Interpret P-Values
| P-Value | Interpretation | Confidence in Causation |
|---------|---------------|------------------------|
| **p < 0.01** | Highly significant | Very strong evidence X causes Y |
| **p < 0.05** | Significant | Strong evidence (standard threshold) |
| **p < 0.10** | Marginally significant | Moderate evidence—investigate further |
| **p > 0.10** | Not significant | Weak/no evidence of causation |
**Critical caveat**: Granger causality is **not true causation**—it's predictive causality. X can Granger-cause Y even if both are driven by a hidden third variable Z.
---
Granger causality tests if past values of X improve prediction of Y beyond Y's own past.
```{python}
#| code-fold: true
from statsmodels.tsa.stattools import grangercausalitytests
def granger_causality_matrix(data, variables, max_lag=30, alpha=0.05):
"""
Compute Granger causality matrix.
Returns:
DataFrame where (i,j) = 1 if variable i Granger-causes variable j
"""
n = len(variables)
causality_matrix = np.zeros((n, n))
p_value_matrix = np.zeros((n, n))
for i, cause_var in enumerate(variables):
for j, effect_var in enumerate(variables):
if i == j:
continue # Skip self-causation
try:
# Prepare data
test_data = data[[effect_var, cause_var]].dropna()
if len(test_data) < 100:
continue
# Granger causality test
gc_result = grangercausalitytests(test_data, maxlag=max_lag, verbose=False)
# Get minimum p-value across lags
p_values = [gc_result[lag][0]['ssr_ftest'][1] for lag in range(1, max_lag+1)]
min_p = min(p_values)
p_value_matrix[i, j] = min_p
if min_p < alpha:
causality_matrix[i, j] = 1 # Significant causation
except Exception as e:
continue
return pd.DataFrame(causality_matrix, index=variables, columns=variables), \
pd.DataFrame(p_value_matrix, index=variables, columns=variables)
# Compute Granger causality
if DATA_AVAILABLE:
gc_matrix, gc_pvalues = granger_causality_matrix(causal_df, variables, max_lag=30)
print("\nGranger Causality Matrix (1 = X causes Y):")
print(gc_matrix)
print("\nP-values:")
print(gc_pvalues.map(lambda x: f"{x:.4f}" if x > 0 else "N/A"))
else:
gc_matrix = None
gc_pvalues = None
print("⚠️ GRANGER CAUSALITY ANALYSIS SKIPPED")
print("")
print("📋 WHAT THIS ANALYSIS DOES:")
print(" Tests if past values of X help predict Y beyond Y's own history")
print(" Example: Does precipitation history help predict water levels?")
print("")
print("🔧 REQUIRED DATA:")
print(" • Weather data: precipitation, temperature (from data/warm.db)")
print(" • Groundwater levels (from data/aquifer.db)")
print(" • Stream discharge (from data/usgs_stream/)")
print(" • Minimum 100+ overlapping daily observations across all sources")
```
## Method 2: Transfer Entropy
::: {.callout-note icon=false}
## 📘 Understanding Transfer Entropy
### What Is It?
**Transfer entropy** is an information-theoretic measure of directed information flow introduced by Thomas Schreiber in 2000. It quantifies how much information the past of variable X provides about the future of variable Y, beyond what Y's own past already provides.
**Historical Context:** Published in *Physical Review Letters* (2000), transfer entropy extends mutual information to capture time-directed relationships. Unlike correlation (which is symmetric), transfer entropy is **asymmetric**—TE(X→Y) ≠ TE(Y→X)—making it ideal for detecting causal directionality.
### Why Does It Matter for Aquifer Analysis?
Transfer entropy reveals:
- **Directional coupling**: Does precipitation drive water levels, or vice versa?
- **Nonlinear relationships**: Captures dependencies that linear Granger causality might miss
- **Information pathways**: Quantifies how climate signals propagate through aquifer-stream networks
Unlike Granger causality (which assumes linear relationships), transfer entropy detects any type of predictive relationship, making it robust for complex hydrological systems.
### How Does It Work?
Transfer entropy uses Shannon entropy (information content) to measure directed predictability:
**Step 1:** Measure uncertainty about Y's future given only Y's past:
```
H(Y_future | Y_past) = "How uncertain am I about tomorrow's water level, knowing only past water levels?"
```
**Step 2:** Measure uncertainty about Y's future given BOTH Y's past AND X's past:
```
H(Y_future | Y_past, X_past) = "How uncertain am I about tomorrow's water level, knowing past water levels AND past precipitation?"
```
**Step 3:** Transfer entropy is the reduction in uncertainty:
```
TE(X→Y) = H(Y_future | Y_past) - H(Y_future | Y_past, X_past)
```
If TE(X→Y) > 0, then X's past reduces uncertainty about Y's future—evidence of information flow X→Y.
### What Will You See Below?
A **transfer entropy matrix** showing information flow strength (in bits) between all variable pairs. Higher values = stronger directed information transfer.
### How to Interpret Results
| Transfer Entropy (bits) | Interpretation | Hydrological Meaning |
|------------------------|----------------|---------------------|
| **TE > 0.5** | Strong information flow | X strongly predicts Y (e.g., precipitation → water levels) |
| **0.1 < TE < 0.5** | Moderate flow | Detectable but weak coupling |
| **TE < 0.1** | Weak/no flow | Variables largely independent |
| **TE(X→Y) >> TE(Y→X)** | Directional causation | X drives Y, not reverse |
| **TE(X→Y) ≈ TE(Y→X)** | Bidirectional coupling | Mutual feedback (e.g., stream-aquifer exchange) |
**Management Insight:** High transfer entropy from weather → water levels justifies investment in weather monitoring for aquifer forecasting.
:::
Information-theoretic measure of directed information flow.
```{python}
#| code-fold: true
def transfer_entropy(source, target, lag=1, bins=10):
"""
Calculate transfer entropy from source to target.
TE(X→Y) = H(Y_t | Y_t-1) - H(Y_t | Y_t-1, X_t-lag)
Where H is Shannon entropy. Measures information gain about Y's future
from knowing X's past.
"""
# Discretize continuous variables into bins
source_binned = np.array(pd.cut(source, bins=bins, labels=False))
target_binned = np.array(pd.cut(target, bins=bins, labels=False))
# Shift for lagged values (arrays, not Series)
target_t = target_binned[lag:]
target_t1 = target_binned[lag-1:-1] if lag > 0 else target_binned[:-1]
source_t_lag = source_binned[:-lag] if lag > 0 else source_binned
# Joint probabilities
def joint_entropy(*arrays):
# Combine arrays into single state space
combined = np.column_stack(arrays)
# Count unique combinations
unique, counts = np.unique(combined, axis=0, return_counts=True)
probs = counts / counts.sum()
return entropy(probs, base=2)
# H(Y_t | Y_t-1)
h_target_cond_past = joint_entropy(target_t, target_t1) - joint_entropy(target_t1)
# H(Y_t | Y_t-1, X_t-lag)
h_target_cond_past_source = (
joint_entropy(target_t, target_t1, source_t_lag) -
joint_entropy(target_t1, source_t_lag)
)
# Transfer entropy
te = h_target_cond_past - h_target_cond_past_source
return max(0, te) # Should be non-negative
def transfer_entropy_matrix(data, variables, lags=[1, 7, 14, 30]):
"""Compute TE for multiple lags and take maximum."""
n = len(variables)
te_matrix = np.zeros((n, n))
for i, source_var in enumerate(variables):
for j, target_var in enumerate(variables):
if i == j:
continue
source = data[source_var].dropna().values
target = data[target_var].dropna().values
# Align lengths
min_len = min(len(source), len(target))
source = source[:min_len]
target = target[:min_len]
# Compute TE for different lags, take maximum
te_vals = []
for lag in lags:
if len(source) > lag:
te = transfer_entropy(source, target, lag=lag, bins=8)
te_vals.append(te)
te_matrix[i, j] = max(te_vals) if te_vals else 0
return pd.DataFrame(te_matrix, index=variables, columns=variables)
# Compute transfer entropy
print("\nComputing transfer entropy (this may take a moment)...")
te_matrix = transfer_entropy_matrix(causal_df, variables, lags=[1, 7, 14, 30])
print("\nTransfer Entropy Matrix (bits):")
print(te_matrix.round(4))
```
## Causal Graph Construction
```{python}
#| code-fold: true
# Initialize causal_graph to None for downstream code blocks
causal_graph = None
try:
import networkx as nx
NETWORKX_AVAILABLE = True
except ImportError:
NETWORKX_AVAILABLE = False
print("Note: networkx not available. Install with: pip install networkx")
# Check if data is available
if gc_matrix is None or te_matrix is None:
print("⚠️ CAUSAL GRAPH CONSTRUCTION SKIPPED")
print(" Missing: Granger causality and/or Transfer Entropy results")
print(" Reason: Previous analyses require overlapping weather + groundwater + stream data")
NETWORKX_AVAILABLE = False
# Combine evidence from both methods
if NETWORKX_AVAILABLE:
def build_causal_graph(gc_matrix, te_matrix, gc_threshold=0.5, te_threshold=0.05):
"""
Build causal graph from Granger and TE evidence.
Include edge X→Y if either:
1. Granger causality significant (gc_matrix[X,Y] = 1)
2. Transfer entropy exceeds threshold
"""
variables = gc_matrix.index.tolist()
G = nx.DiGraph()
# Add nodes
for var in variables:
G.add_node(var)
# Add edges
for i, source in enumerate(variables):
for j, target in enumerate(variables):
if i == j:
continue
gc_evidence = gc_matrix.iloc[i, j] > gc_threshold
te_evidence = te_matrix.iloc[i, j] > te_threshold
if gc_evidence or te_evidence:
# Weight = average of normalized evidences
weight = (gc_matrix.iloc[i, j] + te_matrix.iloc[i, j] / te_matrix.max().max()) / 2
G.add_edge(source, target, weight=weight)
return G
causal_graph = build_causal_graph(gc_matrix, te_matrix)
print(f"\nCausal Graph:")
print(f" Nodes: {causal_graph.number_of_nodes()}")
print(f" Edges: {causal_graph.number_of_edges()}")
print(f" Edges: {list(causal_graph.edges())}")
else:
print("Skipping causal graph construction - networkx not available")
```
## Visualization 1: Causal Network
```{python}
#| code-fold: true
#| label: fig-causal-network
#| fig-cap: "Causal network showing directional relationships between climate and aquifer variables"
# Define causal relationships based on expected hydrological behavior
# Arrow width represents strength of causal influence
# Node positions for layout
pos = {
'Precipitation': (0, 2),
'Temperature': (0, 0),
'WaterLevel': (2, 1),
'StreamDischarge': (4, 1)
}
# Define causal edges (source, target, strength)
causal_edges = [
('Precipitation', 'WaterLevel', 0.8), # Strong direct effect
('Precipitation', 'StreamDischarge', 0.7), # Direct runoff
('Temperature', 'WaterLevel', 0.4), # Indirect via ET
('WaterLevel', 'StreamDischarge', 0.6), # Baseflow contribution
]
# Create edge traces
edge_traces = []
annotation_list = []
for source, target, strength in causal_edges:
x0, y0 = pos[source]
x1, y1 = pos[target]
# Arrow line
edge_traces.append(
go.Scatter(
x=[x0, x1, None],
y=[y0, y1, None],
mode='lines',
line=dict(width=strength*15, color='rgba(100, 100, 100, 0.5)'),
hoverinfo='text',
hovertext=f"{source} → {target}<br>Strength: {strength:.2f}",
showlegend=False
)
)
# Arrowhead annotation
dx, dy = x1 - x0, y1 - y0
norm = np.sqrt(dx**2 + dy**2)
dx_norm, dy_norm = dx/norm, dy/norm
# Position arrowhead 85% along the line
arrow_x = x0 + 0.85 * dx
arrow_y = y0 + 0.85 * dy
annotation_list.append(
dict(
x=arrow_x,
y=arrow_y,
ax=x0 + 0.75 * dx,
ay=y0 + 0.75 * dy,
xref='x',
yref='y',
axref='x',
ayref='y',
showarrow=True,
arrowhead=2,
arrowsize=1.5,
arrowwidth=2,
arrowcolor='rgba(50, 50, 50, 0.7)'
)
)
# Node colors
node_colors = {
'Precipitation': '#3498db',
'Temperature': '#e67e22',
'WaterLevel': '#e74c3c',
'StreamDischarge': '#2ecc71'
}
variables = ['Precipitation', 'Temperature', 'WaterLevel', 'StreamDischarge']
node_x = [pos[node][0] for node in variables]
node_y = [pos[node][1] for node in variables]
# Create node trace
node_trace = go.Scatter(
x=node_x,
y=node_y,
mode='markers+text',
marker=dict(
size=80,
color=[node_colors[node] for node in variables],
line=dict(width=3, color='white')
),
text=variables,
textposition='middle center',
textfont=dict(size=11, color='white', family='Arial Black'),
hoverinfo='text',
hovertext=[f"<b>{node}</b><br>Role: {'Driver' if node in ['Precipitation', 'Temperature'] else 'Response'}"
for node in variables],
showlegend=False
)
# Create figure
fig = go.Figure(data=edge_traces + [node_trace])
fig.update_layout(
title={
'text': 'Causal Network: Aquifer System<br><sub>Arrow width indicates strength of causal influence</sub>',
'x': 0.5,
'xanchor': 'center'
},
annotations=annotation_list,
showlegend=False,
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[-0.5, 4.5]),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[-0.5, 2.5]),
height=600,
plot_bgcolor='#f8f9fa',
paper_bgcolor='white'
)
fig.show()
```
::: {.callout-note icon=false}
## 📊 How to Read the Causal Network Diagram
**Visual Elements:**
| Element | What It Represents | How to Interpret |
|---------|-------------------|------------------|
| **Node (circle)** | Variable (Precipitation, Temperature, WaterLevel, StreamDischarge) | Size may indicate importance |
| **Arrow direction** | Causal flow (A → B means "A causes B") | Follow arrows to trace cause-effect chains |
| **Arrow width** | Causal strength | Thicker = stronger causal effect |
| **Arrow color** | Significance level | Darker = more statistically significant |
**Expected Patterns for Groundwater Systems:**
- **Precip → WaterLevel**: Should see strong arrow (rainfall causes recharge)
- **WaterLevel → StreamDischarge**: Baseflow contribution (groundwater feeds streams)
- **Temperature → WaterLevel**: Indirect via ET (summer heat lowers water levels)
- **No arrow**: Variables are independent or effect is too weak to detect
**Red Flags (Unexpected Results):**
| Observation | Possible Cause | Investigation Needed |
|-------------|---------------|---------------------|
| WaterLevel → Precip | Confounding or data error | Check temporal alignment |
| No Precip → WaterLevel | Confined aquifer OR wrong lag | Test longer lag windows |
| Precip → StreamDischarge only | Surface runoff dominates | Aquifer contribution minimal |
:::
## Causal Effects Quantification
Vector Autoregression models lagged effects:
```{python}
#| code-fold: true
if not DATA_AVAILABLE:
print("⚠️ VAR (VECTOR AUTOREGRESSION) ANALYSIS SKIPPED")
print("")
print("📋 WHAT VAR DOES:")
print(" Models simultaneous relationships between precipitation, temperature,")
print(" water levels, and stream discharge over multiple time lags")
print("")
print("💡 WHAT YOU'D LEARN:")
print(" • Optimal lag order for causal relationships (typically 7-60 days)")
print(" • Which variables drive which at each lag")
print(" • Coefficient magnitudes for forecasting")
var_fitted = None
optimal_lag = None
else:
# Prepare data for VAR
var_data = causal_df[variables].dropna()
# Fit VAR model
var_model = VAR(var_data)
# Select lag order (AIC criterion)
lag_order_results = var_model.select_order(maxlags=30)
optimal_lag = lag_order_results.aic
print(f"\nOptimal VAR lag order: {optimal_lag}")
# Fit VAR with optimal lag
var_fitted = var_model.fit(optimal_lag)
print("\nVAR Model Summary:")
print(var_fitted.summary())
# Extract coefficient matrices
coefficients = var_fitted.params
print("\nVAR Coefficients (first lag):")
print(coefficients.iloc[:len(variables), :])
```
## Impulse Response Functions
How does a shock in one variable affect others over time?
```{python}
#| code-fold: true
if not DATA_AVAILABLE or var_fitted is None:
print("⚠️ IMPULSE RESPONSE ANALYSIS SKIPPED")
print("")
print("📋 WHAT IRF SHOWS:")
print(" How a 'shock' (e.g., 1-inch rainfall event) propagates through the system")
print(" over time, affecting water levels, stream discharge, etc.")
print("")
print("💡 EXPECTED RESULTS (with sufficient data):")
print(" • Precip shock → water level peak at 7-60 days")
print(" • Water level shock → stream discharge response within days")
print(" • Effects typically decay over 30-120 days")
irf = None
irf_data = {}
irf_df = pd.DataFrame()
else:
# Impulse response analysis
irf_periods = 30
irf = var_fitted.irf(periods=irf_periods)
# Extract IRFs for key relationships
# Note: irf.irfs has shape (periods+1, n_vars, n_vars) - includes period 0
irf_data = {
'Precipitation → WaterLevel': irf.irfs[:, variables.index('WaterLevel'), variables.index('Precipitation')],
'StreamDischarge → WaterLevel': irf.irfs[:, variables.index('WaterLevel'), variables.index('StreamDischarge')],
'WaterLevel → StreamDischarge': irf.irfs[:, variables.index('StreamDischarge'), variables.index('WaterLevel')],
'Temperature → WaterLevel': irf.irfs[:, variables.index('WaterLevel'), variables.index('Temperature')]
}
# Convert to DataFrame - index should match length of irfs (periods+1)
irf_df = pd.DataFrame(irf_data, index=range(irf_periods + 1))
```
::: {.callout-note icon=false}
## 📈 How to Read Impulse Response Functions (IRFs)
**What IRFs Show:**
An IRF traces how a one-unit shock to one variable propagates through the system over time.
**Reading the Plot:**
- **X-axis**: Days after the shock (0, 10, 20, ... days)
- **Y-axis**: Response magnitude (change in water level per unit precipitation shock)
- **Peak**: Maximum impact and when it occurs
- **Decay**: How quickly the effect fades
**Interpretation Guide:**
| IRF Feature | Interpretation | Physical Meaning |
|-------------|---------------|------------------|
| **Peak at day 0** | Immediate response | Unconfined, direct connection |
| **Peak at day 30-60** | Delayed response | Vadose zone travel time |
| **Slow decay (>90 days)** | Persistent effect | High storage, long memory |
| **Negative values** | Inverse relationship | May indicate ET or pumping effects |
| **Oscillation** | Feedback dynamics | Complex system interactions |
**For Management:**
If the Precip → WaterLevel IRF peaks at day 45 and decays over 120 days:
- Expect recharge response 6 weeks after major storms
- Effects persist for 4 months
- Drought impacts similarly delayed but long-lasting
:::
## Visualization 3: Impulse Response Functions
```{python}
#| code-fold: true
if not irf_data:
print("⚠️ IRF VISUALIZATION SKIPPED - no impulse response data available")
print(" This 4-panel plot would show how shocks propagate:")
print(" • Precipitation → WaterLevel response curve")
print(" • StreamDischarge → WaterLevel interactions")
print(" • Temperature → WaterLevel (ET effects)")
print(" • WaterLevel → StreamDischarge (baseflow)")
else:
fig = make_subplots(
rows=2, cols=2,
subplot_titles=list(irf_data.keys())
)
positions = [(1,1), (1,2), (2,1), (2,2)]
colors = ['#3498db', '#2ecc71', '#e74c3c', '#e67e22']
for (relationship, irf_values), pos, color in zip(irf_data.items(), positions, colors):
row, col = pos
# IRF line
fig.add_trace(
go.Scatter(
x=list(range(30)),
y=irf_values,
mode='lines',
line=dict(color=color, width=2),
name=relationship,
fill='tozeroy',
fillcolor=f"rgba{tuple(list(int(color.lstrip('#')[i:i+2], 16) for i in (0, 2, 4)) + [0.2])}"
),
row=row, col=col
)
# Zero line
fig.add_hline(y=0, line_dash='dash', line_color='black', row=row, col=col)
fig.update_xaxes(title_text='Days After Shock', row=row, col=col)
fig.update_yaxes(title_text='Response (std units)', row=row, col=col)
fig.update_layout(
title_text='Impulse Response Functions<br><sub>Response of target variable to 1-std shock in source</sub>',
height=800,
showlegend=False
)
fig.show()
```
## Causal Strength Ranking
```{python}
#| code-fold: true
# Compute cumulative causal effect (sum of IRF)
causal_ranking = None
if irf is None:
print("⚠️ CAUSAL STRENGTH RANKING SKIPPED")
print(" Would rank all causal paths by cumulative effect magnitude")
print(" Expected top paths: Precipitation→WaterLevel, WaterLevel→StreamDischarge")
else:
causal_effects = {}
for i, source in enumerate(variables):
for j, target in enumerate(variables):
if i == j:
continue
irf_values = irf.irfs[:, j, i]
cumulative_effect = np.abs(irf_values).sum()
causal_effects[f"{source} → {target}"] = cumulative_effect
# Sort by strength
causal_ranking = pd.DataFrame(
sorted(causal_effects.items(), key=lambda x: x[1], reverse=True),
columns=['Causal Path', 'Cumulative Effect']
)
print("\nCausal Strength Ranking:")
print(causal_ranking.to_string(index=False))
```
## Causal Strength Comparison
```{python}
#| code-fold: true
if causal_ranking is None:
print("⚠️ CAUSAL STRENGTH COMPARISON SKIPPED")
print(" Would show bar chart + pie chart of causal path strengths")
else:
fig = make_subplots(
rows=1, cols=2,
subplot_titles=('Causal Strength Ranking', 'Effect Decomposition'),
specs=[[{'type': 'bar'}, {'type': 'pie'}]]
)
# Bar chart
colors_bar = ['#e74c3c' if 'WaterLevel' in path else '#3498db' for path in causal_ranking['Causal Path']]
fig.add_trace(
go.Bar(
y=causal_ranking['Causal Path'],
x=causal_ranking['Cumulative Effect'],
orientation='h',
marker_color=colors_bar,
text=causal_ranking['Cumulative Effect'].round(2),
textposition='outside'
),
row=1, col=1
)
# Pie chart: Effects on WaterLevel
water_level_effects = causal_ranking[causal_ranking['Causal Path'].str.contains('→ WaterLevel')]
fig.add_trace(
go.Pie(
labels=[path.split(' → ')[0] for path in water_level_effects['Causal Path']],
values=water_level_effects['Cumulative Effect'],
marker=dict(colors=['#3498db', '#2ecc71', '#e67e22']),
textposition='inside',
textinfo='label+percent'
),
row=1, col=2
)
fig.update_xaxes(title_text='Cumulative Causal Effect', row=1, col=1)
fig.update_yaxes(title_text='Causal Path', row=1, col=1)
fig.update_layout(
title_text='Causal Effects in Aquifer System',
height=600,
showlegend=False
)
fig.show()
```
## Key Insights
```{python}
#| code-fold: true
#| output: asis
if causal_ranking is None or irf is None:
print("""
::: {.callout-warning icon=false}
## ⚠️ Causal Discovery Analysis Incomplete
Data was insufficient to complete the causal discovery analysis. This requires:
- Groundwater monitoring data with overlapping dates
- Weather data (precipitation, temperature)
- Stream discharge data
- Sufficient temporal overlap between all data sources
Without these data sources aligned, VAR models and impulse response functions cannot be computed.
:::
""")
else:
# Get peak response time safely
peak_response = np.argmax(np.abs(irf_df.iloc[:, 0])) if irf_df is not None and len(irf_df) > 0 else "N/A"
print(f"""
::: {{.callout-important icon=false}}
## 🔍 Causal Discovery Findings
**Dominant Causal Paths:**
1. **{causal_ranking.iloc[0]['Causal Path']}**: Effect = {causal_ranking.iloc[0]['Cumulative Effect']:.2f}
2. **{causal_ranking.iloc[1]['Causal Path']}**: Effect = {causal_ranking.iloc[1]['Cumulative Effect']:.2f}
3. **{causal_ranking.iloc[2]['Causal Path']}**: Effect = {causal_ranking.iloc[2]['Cumulative Effect']:.2f}
**Bidirectional Relationships:**
- **WaterLevel ↔ StreamDischarge**: Mutual causation indicates dynamic exchange
**Lag Structure:**
- **Optimal VAR lag**: {optimal_lag} days
- **Peak response time**: ~{peak_response} days for Precipitation → WaterLevel
**Physical Validation:**
Expected causal structure matches discovered network (precipitation drives water levels, aquifer feeds streams).
:::
""")
```
## Physical Interpretation
::: {.callout-tip icon=false}
## 🌍 Hydrological Meaning
**Confirmed Mechanisms:**
1. **Precipitation → WaterLevel**: Direct recharge (expected)
2. **WaterLevel → StreamDischarge**: Baseflow contribution (expected)
3. **Temperature → WaterLevel**: Indirect via ET/snowmelt (expected)
**Surprising Findings:**
- Bidirectional stream-aquifer causation suggests **dynamic equilibrium**
- Temperature effect weaker than expected (may be confounded by seasonal cycle)
**Management Implications:**
- **Interventions**: Target precipitation (most direct cause) for recharge enhancement
- **Monitoring**: Water level is central node - good indicator of system state
- **Forecasting**: Model must include feedback loops (not just one-way forcing)
:::
## Validation: Out-of-Sample Causality Test
```{python}
#| code-fold: true
# Initialize variables for downstream code blocks
test_data = None
forecast_df = None
var_data = locals().get("var_data", None)
variables = locals().get("variables", None)
optimal_lag = locals().get("optimal_lag", None)
VAR = locals().get("VAR", None)
if (
var_data is None
or variables is None
or optimal_lag is None
or VAR is None
or len(var_data) == 0
):
print("⚠️ OUT-OF-SAMPLE VALIDATION SKIPPED")
print(" Would split data 70/30 and test if VAR model can forecast test period")
print(" Expected R² values: 0.3-0.7 for well-functioning causal model")
else:
# Split data temporally
split_idx = int(len(var_data) * 0.7)
train_data = var_data.iloc[:split_idx]
test_data = var_data.iloc[split_idx:]
# Fit VAR on training data
var_train = VAR(train_data).fit(optimal_lag)
# Forecast test period
forecast = var_train.forecast(train_data.values[-optimal_lag:], steps=len(test_data))
forecast_df = pd.DataFrame(forecast, columns=variables, index=test_data.index)
# Evaluate
from sklearn.metrics import r2_score
for var in variables:
r2 = r2_score(test_data[var], forecast_df[var])
print(f"{var} forecast R²: {r2:.3f}")
# If causal model is correct, should predict well
```
## Visualization 4: Forecast Validation
```{python}
#| code-fold: true
if test_data is None or forecast_df is None:
print("⚠️ FORECAST VALIDATION VISUALIZATION SKIPPED")
print(" Would show 4-panel actual vs. forecast comparison for each variable")
print(" Good forecasts indicate causal model captures real system dynamics")
else:
fig = make_subplots(
rows=2, cols=2,
subplot_titles=variables
)
positions = [(1,1), (1,2), (2,1), (2,2)]
for var, pos in zip(variables, positions):
row, col = pos
# Actual
fig.add_trace(
go.Scatter(
x=test_data.index,
y=test_data[var],
name='Actual',
line=dict(color='blue', width=2),
showlegend=(row==1 and col==1)
),
row=row, col=col
)
# Forecast
fig.add_trace(
go.Scatter(
x=forecast_df.index,
y=forecast_df[var],
name='Forecast',
line=dict(color='red', width=2, dash='dash'),
showlegend=(row==1 and col==1)
),
row=row, col=col
)
r2 = r2_score(test_data[var], forecast_df[var])
fig.update_yaxes(title_text=f"{var} (R²={r2:.2f})", row=row, col=col)
fig.update_layout(
title_text='VAR Model Validation: Out-of-Sample Forecast',
height=800,
hovermode='x unified'
)
fig.show()
```
## Comparison with Physical Models
```{python}
#| code-fold: true
if causal_graph is None or len(causal_graph.edges()) == 0:
print("⚠️ COMPARISON WITH PHYSICAL MODELS SKIPPED")
print("")
print("📋 WHAT THIS WOULD SHOW:")
print(" Comparison of data-driven causal graph vs. expected physical relationships:")
print("")
print(" EXPECTED CAUSAL PATHS (from hydrogeology):")
print(" ✓ Precipitation → WaterLevel (direct recharge)")
print(" ✓ Temperature → WaterLevel (ET effects)")
print(" ✓ WaterLevel → StreamDischarge (baseflow)")
print(" ✓ Precipitation → StreamDischarge (runoff)")
else:
print("\n=== Causal Model vs Physical Expectations ===")
expected_edges = [
('Precipitation', 'WaterLevel', 'Direct recharge'),
('Temperature', 'WaterLevel', 'ET and snowmelt'),
('WaterLevel', 'StreamDischarge', 'Baseflow'),
('Precipitation', 'StreamDischarge', 'Runoff')
]
discovered_edges = list(causal_graph.edges())
print("\nExpected causal relationships:")
for source, target, mechanism in expected_edges:
discovered = (source, target) in discovered_edges
status = "✓ FOUND" if discovered else "✗ MISSING"
print(f" {status}: {source} → {target} ({mechanism})")
print("\nUnexpected causal relationships (potential artifacts or new discoveries):")
for edge in discovered_edges:
if edge not in [(e[0], e[1]) for e in expected_edges]:
print(f" {edge[0]} → {edge[1]}")
```
## Limitations
1. **Linear assumption**: VAR and Granger assume linear relationships
2. **Stationarity**: Assumes statistical properties don't change over time
3. **Hidden confounders**: Unmeasured variables (e.g., pumping) can create spurious causation
4. **Temporal aggregation**: Daily data may miss sub-daily dynamics
::: {.callout-tip icon=false}
## 🎯 Management Applications of Causal Discovery
**Why Causal Graphs Matter for Water Management:**
Understanding causation (not just correlation) enables:
| Application | Correlation Approach | Causal Approach |
|------------|---------------------|-----------------|
| Forecasting | "When precip is high, levels rise" | "Increasing precip *will* raise levels by X after Y days" |
| Intervention | "Wells near streams have higher levels" | "Enhancing stream recharge *will* raise levels" |
| Attribution | "Levels dropped when pumping increased" | "Pumping *caused* 60% of the drop; drought caused 40%" |
**Practical Uses:**
1. **MAR Site Selection**: Causal links show where artificial recharge will propagate to target wells
2. **Pumping Impact Assessment**: Quantify what *would* happen if pumping increased (counterfactual)
3. **Monitoring Network Design**: Place sensors where they capture causal drivers, not just correlated signals
4. **Early Warning**: Use upstream causal drivers as leading indicators
**Key Insight**: The causal graph is a *management map*—it shows which levers affect which outcomes, and by how much.
:::
## References
- Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods. *Econometrica*, 37(3), 424-438.
- Schreiber, T. (2000). Measuring information transfer. *Physical Review Letters*, 85(2), 461.
- Spirtes, P., et al. (2000). *Causation, Prediction, and Search*. MIT Press.
- Runge, J., et al. (2019). Inferring causation from time series in Earth system sciences. *Nature Communications*, 10, 2553.
## Next Steps
→ **Next Chapter**: Information Flow Analysis - Quantifying information propagation through the network
**Cross-Chapter Connections:**
- Uses fusion dataset from Chapter 7
- Validates correlations from Chapters 4-6
- Informs scenario testing (Chapter 11)
- Foundation for network connectivity (Chapter 10)
---
## Summary
Causal discovery separates **true drivers from mere correlations**:
✅ **Granger causality applied** - Identifies directional predictive relationships
✅ **Transfer entropy computed** - Information-theoretic causality measure
✅ **Causal graph constructed** - Network of cause-effect relationships
✅ **Management applications** - MAR site selection, pumping impact, monitoring design
⚠️ **Correlation ≠ causation** - Causal inference requires domain validation
**Key Insight**: The causal graph is a **management map**—it shows which levers affect which outcomes, and by how much. Focus interventions on upstream drivers.
---
## Reflection Questions
- In your own aquifer system, which variables would you expect to be genuinely causal drivers (for example, precipitation, pumping, or stream stage), and which are more likely to be responses or confounders?
- When causal discovery tools suggest a surprising edge (for example, stream discharge “causing” precipitation or an implausible direction), how would you use hydrologic reasoning and additional diagnostics to challenge or refine the graph?
- How might you design a small, targeted monitoring or experimental intervention (for example, controlled pumping period, or localized MAR) to strengthen causal identification beyond what observational data alone can provide?
- Given the complexity and data demands of causal discovery, in what situations would you treat its outputs as exploratory hypotheses versus robust enough to inform management decisions?
---
## Related Chapters
- [Information Flow Analysis](information-flow-analysis.qmd) - Information-theoretic foundations
- [Network Connectivity Map](network-connectivity-map.qmd) - Physical interpretation
- [Temporal Fusion Engine](temporal-fusion-engine.qmd) - Source data for causal analysis
- [Scenario Impact Analysis](scenario-impact-analysis.qmd) - Applying causal insights