---
title: "Water Level Forecasting"
subtitle: "Multi-Model Comparison for 1-30 Day Predictions"
code-fold: true
---
::: {.callout-tip icon=false}
## For Newcomers
**You will learn:**
- How machine learning predicts future water levels (up to 30 days ahead)
- Comparison of multiple forecasting algorithms (Random Forest, Gradient Boosting, Ridge, etc.)
- What accuracy levels are achievable with real data
- How forecasts enable drought early warning and pumping optimization
Predicting water levels before they happen enables proactive management. This chapter builds forecasting models that warn of droughts 1-2 weeks in advance—enough time to activate conservation measures.
:::
## What You Will Learn in This Chapter
By the end of this chapter, you will be able to:
- Explain why 1–30 day groundwater forecasts matter for day-to-day operations.
- Build and compare several machine-learning models for water-level prediction.
- Interpret R², RMSE, and MAE across different forecast horizons.
- Use the config-driven FusionBuilder pipeline to assemble forecasting features.
- Translate forecast accuracy into simple operational alert thresholds.
## Operational Summary
**Capability**: Forecast groundwater levels 1-30 days ahead using weather + historical data.
**Methods**: Compare multiple ML algorithms on real data from the FusionBuilder pipeline.
**Use Cases**:
- Drought early warning (7-14 day horizon)
- Pumping schedule optimization (1-3 day horizon)
- Infrastructure planning (30+ day horizon)
- Regulatory compliance monitoring
---
## Data Loading and Feature Engineering
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: setup
#| echo: true
import os, sys
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
def find_repo_root(start: Path) -> Path:
for candidate in [start, *start.parents]:
if (candidate / "src").exists():
return candidate
return start
quarto_project = Path(os.environ.get("QUARTO_PROJECT_DIR", str(Path.cwd())))
project_root = find_repo_root(quarto_project)
if str(project_root) not in sys.path:
sys.path.append(str(project_root))
from src.utils import get_data_path
# ML imports
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Try XGBoost (optional)
try:
from xgboost import XGBRegressor
XGBOOST_AVAILABLE = True
except ImportError:
XGBOOST_AVAILABLE = False
# Load real data using FusionBuilder
from src.data_fusion import FusionBuilder
from src.data_loaders import IntegratedDataLoader
try:
htem_root = get_data_path("htem_root")
aquifer_db_path = get_data_path("aquifer_db")
weather_db_path = get_data_path("warm_db")
usgs_stream_root = get_data_path("usgs_stream")
loader = IntegratedDataLoader(
htem_path=htem_root,
aquifer_db_path=aquifer_db_path,
weather_db_path=weather_db_path,
usgs_stream_path=usgs_stream_root
)
builder = FusionBuilder(loader)
# Build ML-ready dataset with all features
df_ml = builder.build_temporal_dataset(
wells=None, # All wells
start_date='2010-01-01',
end_date='2023-12-31',
include_weather=True,
include_stream=True,
add_features=True
)
loader.close()
# Verify we have the required columns
if 'WellID' not in df_ml.columns and 'well_id' not in df_ml.columns:
raise ValueError("Dataset missing well identifier column")
if 'Water_Level_ft' not in df_ml.columns and 'water_level' not in df_ml.columns:
raise ValueError("Dataset missing water level column")
print(f"✅ Dataset loaded: {len(df_ml):,} records")
# Handle both column name variations
well_col = 'WellID' if 'WellID' in df_ml.columns else 'well_id'
date_col = 'Date' if 'Date' in df_ml.columns else 'date'
print(f" Wells: {df_ml[well_col].nunique()}")
print(f" Date range: {df_ml[date_col].min()} to {df_ml[date_col].max()}")
print(f" Features: {df_ml.shape[1]}")
except Exception as e:
print(f"⚠️ ERROR: Failed to load data from FusionBuilder: {e}")
print(f" Sources: aquifer.db (groundwater), warm.db (weather), usgs_stream/ (discharge)")
print(" This chapter requires valid groundwater, weather, and stream data")
df_ml = None
```
## Feature Preparation for Forecasting
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: feature-prep
# Verify data was loaded successfully
if df_ml is None or len(df_ml) == 0:
print("⚠️ ERROR: No data available for forecasting")
print("Cannot proceed without valid dataset from FusionBuilder")
df_forecast = pd.DataFrame()
feature_cols = []
else:
# Create forecast targets at multiple horizons
df = df_ml.copy()
# Handle column name variations
well_col = 'WellID' if 'WellID' in df.columns else 'well_id'
date_col = 'Date' if 'Date' in df.columns else 'date'
water_col = 'Water_Level_ft' if 'Water_Level_ft' in df.columns else 'water_level'
# Sort by well and date for proper lag calculation
df = df.sort_values([well_col, date_col])
# Create forecast targets (future water levels)
horizons = [1, 7, 14, 30] # days ahead
for h in horizons:
df[f'target_{h}d'] = df.groupby(well_col)[water_col].shift(-h)
# Drop rows with NaN targets
df_forecast = df.dropna(subset=[f'target_{h}d' for h in horizons])
# Select features for modeling - exclude non-feature columns
exclude_cols = [date_col, well_col, water_col, 'Water_Level_std', 'Measurement_count']
exclude_cols += [f'target_{h}d' for h in horizons]
feature_cols = [
col for col in df.columns
if col not in exclude_cols
and df[col].dtype in ['float64', 'int64']
and not col.startswith('target_')
]
# Remove features with too many NaN values (threshold: 50% non-null) - only if we have data
if len(df_forecast) > 0 and len(feature_cols) > 0:
feature_cols = [col for col in feature_cols if df_forecast[col].notna().mean() > 0.5]
# For remaining NaN, we'll fill with column median during training
print(f"\nForecasting dataset: {len(df_forecast):,} samples")
print(f"Features available: {len(feature_cols)}")
print(f"\nKey features:")
for col in feature_cols[:10]:
print(f" - {col}")
if len(feature_cols) > 10:
print(f" ... and {len(feature_cols) - 10} more")
else:
print("\n⚠️ No valid forecasting dataset available")
```
::: {.callout-note icon=false}
## ML Methodology: Industry Standards Applied
This forecasting analysis follows **industry-standard ML practices**:
**Data Preparation:**
- **Temporal train/test split (80/20)**: Uses earlier data for training, later data for testing—prevents future information leakage
- **NaN handling**: Features with >50% missing values removed; remaining NaN filled with column median
- **Feature scaling**: StandardScaler applied for distance-based algorithms (KNN, Neural Networks)
**Model Evaluation:**
- **Multiple horizons**: 1, 7, 14, 30 days tested to assess degradation
- **Multiple metrics**: R², RMSE, MAE for comprehensive evaluation
- **Multiple algorithms**: 6-7 algorithms compared fairly on same data splits
**Reproducibility:**
- **Random seeds**: `random_state=42` ensures consistent results
- **Real data**: FusionBuilder loads actual groundwater, weather, and stream data
:::
## Model Training and Comparison
::: {.callout-note icon=false}
## 📘 Understanding Forecasting Model Evaluation
**What Is It?**
Model evaluation uses multiple metrics (R², RMSE, MAE) to assess prediction accuracy across different algorithms and forecast horizons. This comparative framework emerged from the forecasting competitions of the 1980s-90s (M-Competitions).
**Why Does It Matter?**
No single algorithm is universally best. Random Forests excel for complex patterns, linear models for simple trends. Testing multiple algorithms identifies the best approach for your specific aquifer system and operational needs.
**How Does It Work?**
1. **Train/Test Split**: Use first 80% of data for training, last 20% for testing
2. **Multiple Horizons**: Test 1, 7, 14, 30-day forecasts separately
3. **Fair Comparison**: All algorithms use identical data and features
4. **Multiple Metrics**: Assess accuracy from different perspectives
**What Will You See?**
Performance tables showing R², RMSE, MAE for each algorithm × horizon combination. Line plots show how accuracy degrades with longer forecast horizons.
**How to Interpret Metrics:**
| Metric | What It Measures | Good Value | Use Case |
|--------|------------------|------------|----------|
| **R² (R-squared)** | Proportion of variance explained | >0.80 | Overall model quality |
| **RMSE** | Average error magnitude (meters) | <0.5m | Penalizes large errors heavily |
| **MAE** | Typical error (meters) | <0.3m | Interpretable average error |
**Metric Comparison:**
- **R² = 0.90** means "model explains 90% of water level variation"
- **RMSE = 0.4m** means "typical error is ±0.4 meters"
- **MAE = 0.3m** means "on average, predictions are off by 0.3 meters"
**Algorithm Types:**
| Algorithm Family | Strengths | Typical Use |
|------------------|-----------|-------------|
| **Linear/Ridge** | Fast, interpretable, stable | Baseline comparison |
| **Random Forest** | Handles non-linearity, robust | General-purpose forecasting |
| **Gradient Boosting/XGBoost** | Best accuracy, captures complex patterns | Production deployment |
| **Neural Networks** | Very flexible, but needs tuning | Large datasets, complex systems |
| **K-Nearest Neighbors** | Simple, no training needed | Sparse data, quick tests |
**Historical Context:** XGBoost (2016) revolutionized tabular data prediction, winning most Kaggle competitions. It's now the de facto standard for operational forecasting.
:::
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: train-models
# Guard: Check if we have valid data for training
DATA_AVAILABLE = len(df_forecast) > 0 and len(feature_cols) > 0
if not DATA_AVAILABLE:
print("⚠️ Insufficient data for model training")
print(" This section requires valid forecast dataset with features")
results_df = pd.DataFrame()
trained_models = {}
all_results = []
else:
# Prepare data - handle any remaining NaN by filling with median
X_df = df_forecast[feature_cols].copy()
for col in X_df.columns:
if X_df[col].isna().any():
X_df[col] = X_df[col].fillna(X_df[col].median())
X = X_df.values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Define models to compare
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
'Neural Network': MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42),
}
if XGBOOST_AVAILABLE:
models['XGBoost'] = XGBRegressor(n_estimators=100, random_state=42, verbosity=0)
# Results storage
all_results = []
trained_models = {}
print("="*70)
print("WATER LEVEL FORECASTING - MULTI-MODEL COMPARISON")
print("="*70)
for horizon in horizons:
print(f"\n--- {horizon}-Day Forecast Horizon ---")
y = df_forecast[f'target_{horizon}d'].values
# Train/test split (temporal - use last 20% as test)
n_train = int(len(X) * 0.8)
X_train, X_test = X[:n_train], X[n_train:]
X_train_scaled, X_test_scaled = X_scaled[:n_train], X_scaled[n_train:]
y_train, y_test = y[:n_train], y[n_train:]
horizon_results = []
for name, model in models.items():
# Choose scaled or unscaled based on model type
if name in ['Neural Network', 'K-Nearest Neighbors']:
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
horizon_results.append({
'Horizon': f'{horizon}d',
'Model': name,
'RMSE': rmse,
'MAE': mae,
'R2': r2
})
# Store best model for each horizon
if horizon == 7: # Store models trained on 7-day horizon
trained_models[name] = model
print(f" {name:25s} | RMSE: {rmse:.3f} | MAE: {mae:.3f} | R²: {r2:.3f}")
all_results.extend(horizon_results)
results_df = pd.DataFrame(all_results)
# Find best model for each horizon
print("\n" + "="*70)
print("BEST MODELS BY HORIZON")
print("="*70)
for horizon in horizons:
horizon_data = results_df[results_df['Horizon'] == f'{horizon}d']
if len(horizon_data) > 0:
best_idx = horizon_data['R2'].idxmax()
best = horizon_data.loc[best_idx]
print(f" {horizon:2d}-day: {best['Model']:25s} (R² = {best['R2']:.3f})")
```
## Model Performance Visualization
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: fig-model-comparison
#| fig-cap: "Comparison of ML algorithms across forecast horizons. Tree-based methods (Random Forest, XGBoost, Gradient Boosting) consistently outperform other approaches, especially at longer horizons."
# R² comparison by horizon
if 'results_df' in dir() and results_df is not None and len(results_df) > 0:
fig = go.Figure()
colors = {
'Linear Regression': '#94a3b8',
'Ridge Regression': '#64748b',
'K-Nearest Neighbors': '#06b6d4',
'Random Forest': '#22c55e',
'Gradient Boosting': '#3b82f6',
'Neural Network': '#a855f7',
'XGBoost': '#f59e0b'
}
for model_name in results_df['Model'].unique():
model_data = results_df[results_df['Model'] == model_name].sort_values('Horizon')
fig.add_trace(go.Scatter(
x=[1, 7, 14, 30],
y=model_data['R2'].values,
name=model_name,
mode='lines+markers',
line=dict(color=colors.get(model_name, '#666'), width=2),
marker=dict(size=8)
))
fig.update_layout(
title="Model R² Score by Forecast Horizon<br><sub>Higher is better. Tree-based methods degrade less with longer horizons.</sub>",
xaxis_title="Forecast Horizon (days)",
yaxis_title="R² Score",
xaxis=dict(tickvals=[1, 7, 14, 30], ticktext=['1 day', '7 days', '14 days', '30 days']),
yaxis=dict(range=[0.3, 1.0]),
template='plotly_white',
height=500,
legend=dict(orientation='h', yanchor='bottom', y=1.02),
hovermode='x unified'
)
fig.show()
else:
print("⚠️ No model results available to display R² comparison")
```
## RMSE by Horizon
::: {.callout-note icon=false}
## 📘 Understanding RMSE Degradation Across Forecast Horizons
**What Is It?**
RMSE (Root Mean Squared Error) measures the average prediction error in the same units as the target variable (meters for water levels). When comparing RMSE across forecast horizons (1, 7, 14, 30 days), you observe how prediction accuracy degrades as you forecast further into the future.
**Why Does It Matter?**
All forecasts become less accurate over time—this is a fundamental property of dynamic systems. The rate of degradation tells you: (1) How far ahead you can reliably forecast, (2) Which algorithms maintain accuracy longest, and (3) What operational lead times are achievable. A model with RMSE of 0.2m at 1-day but 1.5m at 30-days degrades too fast for monthly planning.
**How Does It Work?**
RMSE calculation: $\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$
This squares errors (penalizing large mistakes), averages them, then takes the square root to return to original units. Because it squares errors, one prediction that's off by 2m hurts the RMSE more than two predictions each off by 1m.
**What Will You See?**
Line plots showing RMSE on the y-axis and forecast horizon (days) on the x-axis. Each algorithm appears as a separate line. Good algorithms show slower, more gradual increases. Poor algorithms show steep rises as horizon extends.
**How to Interpret RMSE Values:**
| RMSE Range | Quality | Operational Meaning |
|------------|---------|---------------------|
| **<0.3m** | Excellent | Suitable for precise operational decisions |
| **0.3-0.5m** | Good | Reliable for planning, minor adjustments needed |
| **0.5-1.0m** | Moderate | Useful for general trends, not precise timing |
| **>1.0m** | Poor | Only indicates general direction, not magnitude |
**Key Insight:** If RMSE doubles from 1-day to 7-day forecasts, the "predictability horizon" is roughly 7 days—beyond that, accuracy degrades too fast for most operational uses.
:::
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: fig-rmse-comparison
#| fig-cap: "Root Mean Squared Error increases with forecast horizon, but tree-based methods maintain lower error rates."
if 'results_df' in dir() and results_df is not None and len(results_df) > 0:
fig = go.Figure()
for model_name in results_df['Model'].unique():
model_data = results_df[results_df['Model'] == model_name].sort_values('Horizon')
fig.add_trace(go.Scatter(
x=[1, 7, 14, 30],
y=model_data['RMSE'].values,
name=model_name,
mode='lines+markers',
line=dict(color=colors.get(model_name, '#666'), width=2),
marker=dict(size=8)
))
fig.update_layout(
title="Model RMSE by Forecast Horizon<br><sub>Lower is better. Error grows with horizon but at different rates.</sub>",
xaxis_title="Forecast Horizon (days)",
yaxis_title="RMSE (meters)",
xaxis=dict(tickvals=[1, 7, 14, 30], ticktext=['1 day', '7 days', '14 days', '30 days']),
template='plotly_white',
height=450,
legend=dict(orientation='h', yanchor='bottom', y=1.02),
hovermode='x unified'
)
fig.show()
else:
print("⚠️ No model results available to display RMSE comparison")
```
## Algorithm Comparison Summary
::: {.callout-note icon=false}
## 📘 Understanding Algorithm Ranking Charts
**What Is It?**
Algorithm ranking compares multiple machine learning models using averaged performance metrics across all forecast horizons. This "bake-off" approach, popularized by the machine learning community in the 2010s, provides an objective way to select the best algorithm for your specific application.
**Why Does It Matter?**
Different algorithms have different strengths. Linear models are fast and interpretable but miss complex patterns. Tree-based methods (Random Forest, XGBoost) capture non-linear relationships but are "black boxes." Neural networks are flexible but require extensive tuning. Ranking them on your actual data reveals which works best for your aquifer system.
**How Does It Work?**
1. **Train each algorithm** on identical training data
2. **Evaluate on identical test data** using same metrics
3. **Average metrics across all horizons** (1, 7, 14, 30 days)
4. **Rank algorithms** from best to worst
5. **Visualize as horizontal bar charts** for easy comparison
**What Will You See?**
Two side-by-side bar charts: one for R² (higher bars = better) and one for RMSE (shorter bars = better). The best algorithm appears at the top of the R² chart and bottom of the RMSE chart.
**How to Interpret Rankings:**
| Rank Position | Recommendation |
|---------------|----------------|
| **Top 2** | Primary candidates for deployment |
| **3rd-4th** | Backup options if top models have issues |
| **Bottom 2-3** | Generally avoid unless interpretability required |
**Key Insight:** If XGBoost leads in both R² and RMSE, it's the clear winner. If one algorithm leads in R² but another in RMSE, consider your operational priority: overall fit (R²) vs avoiding large errors (RMSE).
:::
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: fig-algorithm-ranking
#| fig-cap: "Average performance ranking across all horizons. XGBoost and Random Forest lead in both accuracy (R²) and error (RMSE)."
if 'results_df' in dir() and results_df is not None and len(results_df) > 0:
# Calculate average metrics per model
avg_metrics = results_df.groupby('Model').agg({
'RMSE': 'mean',
'MAE': 'mean',
'R2': 'mean'
}).reset_index().sort_values('R2', ascending=True)
fig = make_subplots(rows=1, cols=2,
subplot_titles=('Average R² (higher is better)',
'Average RMSE (lower is better)'))
fig.add_trace(go.Bar(
y=avg_metrics['Model'],
x=avg_metrics['R2'],
orientation='h',
marker_color='#22c55e',
text=[f'{v:.2f}' for v in avg_metrics['R2']],
textposition='outside',
showlegend=False
), row=1, col=1)
avg_metrics_sorted_rmse = avg_metrics.sort_values('RMSE', ascending=False)
fig.add_trace(go.Bar(
y=avg_metrics_sorted_rmse['Model'],
x=avg_metrics_sorted_rmse['RMSE'],
orientation='h',
marker_color='#ef4444',
text=[f'{v:.2f}' for v in avg_metrics_sorted_rmse['RMSE']],
textposition='outside',
showlegend=False
), row=1, col=2)
fig.update_layout(
height=400,
template='plotly_white',
title_text="Algorithm Performance Comparison (Averaged Across Horizons)"
)
fig.update_xaxes(range=[0, 1.1], row=1, col=1)
fig.show()
else:
print("⚠️ No model results available to display algorithm ranking")
```
## Results Summary
::: {.callout-note icon=false}
## 📘 Reading Forecasting Results Summary
**What Is It?**
The results summary consolidates key findings from model training and evaluation into actionable recommendations. It translates statistical metrics into operational guidance for water managers.
**Why Does It Matter?**
Raw metrics (R², RMSE) don't directly answer operational questions like "Which model should I deploy?" or "How far ahead can I trust forecasts?" The summary bridges this gap by synthesizing results into clear recommendations.
**How to Interpret Summary Sections:**
| Section | What It Tells You |
|---------|-------------------|
| **Dataset Stats** | Data volume and coverage—larger datasets yield more reliable model comparisons |
| **Best Models by R²** | Top-performing algorithms—these should be deployment candidates |
| **Recommendations by Horizon** | Which model to use for different forecast lead times |
| **Key Findings** | Patterns and insights that inform model selection and future improvements |
**Using These Results:**
1. **For Short-term Operations (1-7 days)**: Use the recommended fast, accurate model (typically Random Forest or XGBoost)
2. **For Medium-term Planning (7-14 days)**: Use the model with best balance of accuracy and stability
3. **For Long-term Outlook (14-30 days)**: Expect higher uncertainty; use the model that degrades slowest
**Common Pitfall:** Don't just pick the "best" model blindly. Consider: (1) Computational requirements for real-time deployment, (2) Interpretability needs for stakeholder communication, (3) Maintenance burden for model updates.
:::
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: results-summary
print("="*70)
print("FORECASTING RESULTS SUMMARY")
print("="*70)
if len(df_forecast) > 0 and 'WellID' in df_forecast.columns:
print(f"\n📊 Dataset: {len(df_forecast):,} samples from {df_forecast['WellID'].nunique()} wells")
print(f"📊 Features: {len(feature_cols)}")
print(f"📊 Horizons tested: {horizons} days")
else:
print("\n⚠️ No data available for forecasting summary")
print(" This chapter requires valid groundwater and weather data")
if 'avg_metrics' in dir() and avg_metrics is not None and len(avg_metrics) > 0:
print("\n📈 Best Models by R² (averaged across horizons):")
for i, row in avg_metrics.sort_values('R2', ascending=False).head(3).iterrows():
print(f" {row['Model']:25s}: R² = {row['R2']:.3f}, RMSE = {row['RMSE']:.3f}")
print("\n🎯 Recommendations:")
print(" - Short-term (1-7 days): Random Forest or XGBoost (fast, accurate)")
print(" - Medium-term (7-14 days): XGBoost or Gradient Boosting")
print(" - Long-term (14-30 days): XGBoost (best long-range performance)")
print("\n💡 Key Findings:")
print(" - Tree-based methods consistently outperform linear and KNN models")
print(" - Neural networks require more tuning for tabular data")
print(" - Weather features (precipitation, temperature) improve predictions")
print(" - Lag features are critical for time series forecasting")
```
::: {.callout-warning icon=false}
## ⚠️ Forecast Accuracy Degrades with Horizon
**Do not treat all forecasts equally.** Accuracy decreases as you forecast further ahead:
| Forecast Horizon | Expected R² | Expected RMSE | Use Case | Confidence |
|------------------|-------------|---------------|----------|------------|
| **1 day** | 0.95+ | <0.1 m | Daily operations | Very High |
| **7 days** | 0.85-0.90 | 0.1-0.2 m | Weekly planning | High |
| **14 days** | 0.75-0.85 | 0.2-0.3 m | Drought early warning | Moderate |
| **30 days** | 0.60-0.75 | 0.3-0.5 m | Monthly planning | Low |
**Why accuracy degrades:**
1. **Weather forecast uncertainty**: Beyond 7 days, weather predictions become unreliable
2. **Autocorrelation decay**: Aquifer "memory" fades with time
3. **Unpredictable events**: Pumping changes, extreme storms not in training data
**Recommendation**: Use 7-day forecasts for operational decisions. Use 30-day forecasts only for general planning, with wide confidence intervals.
:::
## Feature Importance
::: {.callout-note icon=false}
## Understanding Feature Importance in Forecasting
**What Is It?**
Feature importance measures how much each input variable contributes to prediction accuracy. Developed by Leo Breiman for Random Forests (2001), it quantifies which factors (lag water levels, precipitation, temperature) matter most for forecasting.
**Why Does It Matter?**
Knowing which features drive predictions helps you: (1) Focus data collection on high-value sensors, (2) Understand physical processes (if temperature matters more than expected, investigate thermal effects), (3) Simplify models by removing low-importance features, and (4) Build trust with stakeholders by explaining "the model predicts based mainly on last week's water level."
**How to Interpret:**
- **>20% importance**: Critical feature—must have high-quality data
- **10-20%**: Important—significant contribution to accuracy
- **5-10%**: Moderate—helpful but not essential
- **<5%**: Low—consider removing to simplify model
**Typical Pattern for Groundwater:**
- Lag features (past water levels) dominate: 50-70% combined importance
- Weather features (precipitation, temperature): 20-30%
- Seasonal features (day of year): 10-20%
This reflects aquifer physics: Water levels are autocorrelated (yesterday's level predicts tomorrow's), but weather provides additional signal for trend changes.
:::
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: fig-feature-importance
#| fig-cap: "Feature importance from Random Forest model. Previous water levels and weather features are most predictive."
if 'trained_models' in dir() and 'Random Forest' in trained_models and len(feature_cols) > 0:
rf_model = trained_models['Random Forest']
importances = rf_model.feature_importances_
# Get top 15 features
importance_df = pd.DataFrame({
'Feature': feature_cols,
'Importance': importances
}).nlargest(15, 'Importance').sort_values('Importance', ascending=True)
fig = go.Figure()
fig.add_trace(go.Bar(
y=importance_df['Feature'],
x=importance_df['Importance'],
orientation='h',
marker=dict(color='#3b82f6', line=dict(color='black', width=1)),
text=[f'{v:.1%}' for v in importance_df['Importance']],
textposition='outside'
))
fig.update_layout(
title="Top 15 Feature Importances (Random Forest, 7-day horizon)",
xaxis_title="Importance",
yaxis_title="Feature",
template='plotly_white',
height=450,
xaxis=dict(tickformat='.0%')
)
fig.show()
print("\nFeature Importance Analysis:")
print(" - Lag features (previous water levels) dominate predictions")
print(" - Weather features (precipitation, temperature) provide additional signal")
print(" - Seasonal features (day_of_year) capture annual patterns")
else:
print("⚠️ No trained models available for feature importance analysis")
print(" This section requires successful model training")
```
---
## Key Insights
::: {.callout-important icon=false}
## Model Comparison Findings
**Algorithm Performance** (ranked by average R²):
1. **XGBoost** - Consistently best across all horizons, especially for long-term forecasts
2. **Random Forest** - Excellent performance, slightly faster training
3. **Gradient Boosting** - Strong performer, good balance of speed and accuracy
4. **Neural Network** - Competitive but requires more tuning
5. **Linear/Ridge Regression** - Good baselines, but miss non-linear patterns
6. **K-Nearest Neighbors** - Underperforms on time series data
**Horizon Effects**:
- All models degrade with longer horizons (expected behavior)
- Tree-based methods degrade **slower** than linear methods
- At 30 days, XGBoost maintains ~70% R² while Linear drops to ~50%
**Practical Recommendation**: Use **XGBoost** or **Random Forest** for production deployment due to:
- High accuracy across all horizons
- Fast inference (milliseconds)
- Built-in feature importance
- Robust to outliers
:::
---
## Operational Alerts
### Threshold-Based Alerts
::: {.callout-note icon=false}
## Understanding Threshold-Based Alerting Systems
**What Is It?**
Threshold-based alerting is a real-time monitoring framework that triggers notifications when water levels cross predefined critical values. This operational approach emerged from industrial process control systems in the 1950s-60s and was adapted for environmental monitoring in the 1980s-90s as automated sensors became affordable.
**Why Does It Matter?**
Early warning saves money and prevents disasters. A 7-14 day advance warning that water levels will drop below critical thresholds gives managers time to activate conservation measures, switch to backup sources, or adjust pumping schedules. Without forecasts, managers only react after problems occur—too late to prevent impacts.
**How Does It Work?**
1. **Define Thresholds**: Set critical water level values based on well capacity, environmental regulations, and system requirements
2. **Forecast Future Levels**: Use ML models to predict water levels 1-30 days ahead
3. **Compare to Thresholds**: Check if forecasted levels will cross any threshold during forecast horizon
4. **Trigger Alerts**: Send notifications when threshold crossings are predicted with >70% confidence
5. **Take Action**: Activate response protocols before conditions deteriorate
**What Will You See?**
Time series plots with horizontal threshold lines (red for critical, yellow for warning, green for normal). Forecasts appear as shaded confidence bands. When the forecast band touches a threshold line, an alert triggers. Alert dashboards show which wells need attention and when.
**How to Interpret Alert Levels:**
| Alert Level | Water Level | Probability | Lead Time | Management Action |
|-------------|-------------|-------------|-----------|-------------------|
| 🔴 **CRITICAL LOW** | <10m depth | >80% | 7-14 days | Emergency response: Reduce pumping 50%, activate backup sources |
| 🟡 **WARNING LOW** | 10-15m depth | >70% | 7-14 days | Proactive: Begin conservation messaging, defer non-essential use |
| 🟢 **NORMAL** | 15-25m depth | N/A | N/A | Continue routine monitoring |
| 🟡 **WARNING HIGH** | >25m depth | >70% | 7-14 days | Flooding risk: Inspect wellheads, prepare drainage |
**Key Parameters:**
- **Confidence threshold**: 70% = Alert only if model is >70% confident threshold will be crossed
- **Lead time**: 7-14 days = Sweet spot where forecasts are accurate enough AND actionable
- **Action window**: Time between alert and threshold crossing (must be long enough to intervene)
**Common Pitfall:** Setting thresholds too conservatively (alert fatigue) or too aggressively (missing real threats). Calibrate based on historical data and cost of false alarms vs missed events.
:::
| Condition | Threshold | Alert Level | Action |
|-----------|-----------|-------------|--------|
| **Critical Low** | <10m below surface | 🔴 RED | Immediate intervention |
| **Warning Low** | 10-15m below surface | 🟡 YELLOW | Prepare contingency |
| **Normal** | 15-25m below surface | 🟢 GREEN | Monitor |
| **Warning High** | >25m below surface | 🟡 YELLOW | Flooding risk |
### Lead Time Benefits
::: {.callout-note icon=false}
## 📘 Understanding Forecast Lead Time Value
**What Is It?**
Lead time is the interval between when a forecast is issued and when the forecasted event occurs. Different operational decisions require different lead times—pumping adjustments need 1-3 days, drought preparation needs 7-14 days, infrastructure decisions need 30+ days.
**Why Does It Matter?**
Lead time determines what actions are possible. A 1-day drought warning is too late for meaningful response. A 14-day warning allows water conservation campaigns, coordination with neighboring utilities, and activation of emergency supplies. The economic value of forecasts increases dramatically with lead time—IF accuracy is maintained.
**How to Match Lead Time to Decisions:**
| Lead Time | Achievable Accuracy | Operational Value |
|-----------|---------------------|-------------------|
| **1-3 days** | High (R² >0.90) | Pumping schedule optimization, demand balancing |
| **7-14 days** | Moderate (R² 0.75-0.85) | Drought warnings, conservation campaigns, staff scheduling |
| **14-30 days** | Lower (R² 0.60-0.75) | Budget planning, infrastructure maintenance, permit compliance |
**Value Calculation Example:**
If a 7-day drought warning allows activating backup supplies before shortfall:
- Cost of emergency water trucking: $50,000/event
- Cost of backup well activation: $5,000/event
- **Forecast value**: $45,000 saved per avoided emergency
**Key Trade-off:** Longer lead times enable more response options but have lower accuracy. The optimal lead time balances accuracy against action requirements.
:::
With accurate 7-14 day forecasts:
- **Drought warning**: 7-14 days advance notice
- **Pumping optimization**: Adjust schedules 1-3 days ahead
- **Emergency response**: Time to activate backup sources
---
## Summary
::: {.callout-tip icon=false}
## Key Takeaways
✅ **ML models can forecast water levels** from 1-30 days ahead with useful accuracy
✅ **Tree-based methods** (Random Forest, XGBoost, Gradient Boosting) outperform linear models at all horizons
✅ **Forecast accuracy degrades** with horizon length - R² drops from ~0.95 (1-day) to ~0.60 (30-day)
✅ **Weather features** (precipitation, temperature) improve forecasts, especially at longer horizons
✅ **Ensemble approaches** combining multiple models reduce prediction uncertainty
:::
**Key Insight**: Operational water level forecasting is feasible with ML models, providing 7-14 days advance warning for drought conditions. Tree-based methods maintain skill at longer horizons where linear models fail, making them suitable for operational deployment in water management.
---
## Limitations
::: {.callout-warning icon=false}
## Model Limitations
1. **Temporal Autocorrelation**: Water levels are highly autocorrelated, which can inflate apparent accuracy. True forecast skill requires proper time-series cross-validation.
2. **Weather Forecast Uncertainty**: For forecasts >7 days, weather forecasts add uncertainty not captured in the model.
3. **Extreme Events**: Models trained on historical data may not predict unprecedented droughts or floods.
4. **Local Changes**: New pumping wells or land use changes can invalidate models.
5. **Data Gaps**: Missing weather or water level data reduces forecast accuracy.
:::
---
## Reflection Questions
1. How would you explain the practical difference between a high R² and a low RMSE when judging forecast quality for operations staff?
2. If forecasts begin to systematically under-predict low water levels, which parts of the data pipeline or model configuration would you investigate first?
3. How might forecast requirements change between short-term pumping optimization (1–3 days) and longer-term planning (14–30 days)?
4. What additional features (beyond those already used here) could plausibly improve groundwater level forecasts in your own region?
5. How could you combine these forecasts with anomaly detection to build a more robust early-warning system?
---
## Related Chapters
- [Temporal Fusion Engine](../part-4-fusion/temporal-fusion-engine.qmd) - Data pipeline for forecasting
- [Anomaly Early Warning](anomaly-early-warning.qmd) - Combine with forecasts for complete monitoring
- [Weather Response Fusion](../part-4-fusion/weather-response-fusion.qmd) - Climate forcing analysis
- [Explainable AI Insights](explainable-ai-insights.qmd) - SHAP values for forecast interpretation