45  Water Level Forecasting

Multi-Model Comparison for 1-30 Day Predictions

TipFor Newcomers

You will learn:

  • How machine learning predicts future water levels (up to 30 days ahead)
  • Comparison of multiple forecasting algorithms (Random Forest, Gradient Boosting, Ridge, etc.)
  • What accuracy levels are achievable with real data
  • How forecasts enable drought early warning and pumping optimization

Predicting water levels before they happen enables proactive management. This chapter builds forecasting models that warn of droughts 1-2 weeks in advance—enough time to activate conservation measures.

45.1 What You Will Learn in This Chapter

By the end of this chapter, you will be able to:

  • Explain why 1–30 day groundwater forecasts matter for day-to-day operations.
  • Build and compare several machine-learning models for water-level prediction.
  • Interpret R², RMSE, and MAE across different forecast horizons.
  • Use the config-driven FusionBuilder pipeline to assemble forecasting features.
  • Translate forecast accuracy into simple operational alert thresholds.

45.2 Operational Summary

Capability: Forecast groundwater levels 1-30 days ahead using weather + historical data.

Methods: Compare multiple ML algorithms on real data from the FusionBuilder pipeline.

Use Cases: - Drought early warning (7-14 day horizon) - Pumping schedule optimization (1-3 day horizon) - Infrastructure planning (30+ day horizon) - Regulatory compliance monitoring


45.3 Data Loading and Feature Engineering

Show code
import os, sys
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

def find_repo_root(start: Path) -> Path:
    for candidate in [start, *start.parents]:
        if (candidate / "src").exists():
            return candidate
    return start

quarto_project = Path(os.environ.get("QUARTO_PROJECT_DIR", str(Path.cwd())))
project_root = find_repo_root(quarto_project)

if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

from src.utils import get_data_path

# ML imports
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Try XGBoost (optional)
try:
    from xgboost import XGBRegressor
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False

# Load real data using FusionBuilder
from src.data_fusion import FusionBuilder
from src.data_loaders import IntegratedDataLoader

try:
    htem_root = get_data_path("htem_root")
    aquifer_db_path = get_data_path("aquifer_db")
    weather_db_path = get_data_path("warm_db")
    usgs_stream_root = get_data_path("usgs_stream")

    loader = IntegratedDataLoader(
        htem_path=htem_root,
        aquifer_db_path=aquifer_db_path,
        weather_db_path=weather_db_path,
        usgs_stream_path=usgs_stream_root
    )

    builder = FusionBuilder(loader)

    # Build ML-ready dataset with all features
    df_ml = builder.build_temporal_dataset(
        wells=None,  # All wells
        start_date='2010-01-01',
        end_date='2023-12-31',
        include_weather=True,
        include_stream=True,
        add_features=True
    )

    loader.close()

    # Verify we have the required columns
    if 'WellID' not in df_ml.columns and 'well_id' not in df_ml.columns:
        raise ValueError("Dataset missing well identifier column")
    if 'Water_Level_ft' not in df_ml.columns and 'water_level' not in df_ml.columns:
        raise ValueError("Dataset missing water level column")

    print(f"✅ Dataset loaded: {len(df_ml):,} records")

    # Handle both column name variations
    well_col = 'WellID' if 'WellID' in df_ml.columns else 'well_id'
    date_col = 'Date' if 'Date' in df_ml.columns else 'date'

    print(f"   Wells: {df_ml[well_col].nunique()}")
    print(f"   Date range: {df_ml[date_col].min()} to {df_ml[date_col].max()}")
    print(f"   Features: {df_ml.shape[1]}")

except Exception as e:
    print(f"⚠️ ERROR: Failed to load data from FusionBuilder: {e}")
    print(f"   Sources: aquifer.db (groundwater), warm.db (weather), usgs_stream/ (discharge)")
    print("   This chapter requires valid groundwater, weather, and stream data")
    df_ml = None
✓ HTEM loader initialized
✓ Groundwater loader initialized
✓ Weather loader initialized
✓ USGS stream loader initialized
FusionBuilder initialized with sources: ['groundwater', 'weather', 'usgs_stream', 'htem']
Building temporal dataset from 2010-01-01 to 2023-12-31...
  Loading groundwater data...
    Loaded 25943 daily groundwater records
  Loading weather data...
    Loaded 4970 daily weather records
  Loading stream gauge data...
    Loaded 5113 daily stream records
  Merging data sources...
  Engineering features...
  Final dataset: 25943 records, 50 columns
✅ Dataset loaded: 25,943 records
   Wells: 18
   Date range: 2010-01-01 00:00:00 to 2023-06-02 00:00:00
   Features: 50

45.4 Feature Preparation for Forecasting

Show code
# Verify data was loaded successfully
if df_ml is None or len(df_ml) == 0:
    print("⚠️ ERROR: No data available for forecasting")
    print("Cannot proceed without valid dataset from FusionBuilder")
    df_forecast = pd.DataFrame()
    feature_cols = []
else:
    # Create forecast targets at multiple horizons
    df = df_ml.copy()

    # Handle column name variations
    well_col = 'WellID' if 'WellID' in df.columns else 'well_id'
    date_col = 'Date' if 'Date' in df.columns else 'date'
    water_col = 'Water_Level_ft' if 'Water_Level_ft' in df.columns else 'water_level'

    # Sort by well and date for proper lag calculation
    df = df.sort_values([well_col, date_col])

    # Create forecast targets (future water levels)
    horizons = [1, 7, 14, 30]  # days ahead
    for h in horizons:
        df[f'target_{h}d'] = df.groupby(well_col)[water_col].shift(-h)

    # Drop rows with NaN targets
    df_forecast = df.dropna(subset=[f'target_{h}d' for h in horizons])

    # Select features for modeling - exclude non-feature columns
    exclude_cols = [date_col, well_col, water_col, 'Water_Level_std', 'Measurement_count']
    exclude_cols += [f'target_{h}d' for h in horizons]

    feature_cols = [
        col for col in df.columns
        if col not in exclude_cols
        and df[col].dtype in ['float64', 'int64']
        and not col.startswith('target_')
    ]

# Remove features with too many NaN values (threshold: 50% non-null) - only if we have data
if len(df_forecast) > 0 and len(feature_cols) > 0:
    feature_cols = [col for col in feature_cols if df_forecast[col].notna().mean() > 0.5]

    # For remaining NaN, we'll fill with column median during training
    print(f"\nForecasting dataset: {len(df_forecast):,} samples")
    print(f"Features available: {len(feature_cols)}")
    print(f"\nKey features:")
    for col in feature_cols[:10]:
        print(f"  - {col}")
    if len(feature_cols) > 10:
        print(f"  ... and {len(feature_cols) - 10} more")
else:
    print("\n⚠️ No valid forecasting dataset available")

Forecasting dataset: 25,403 samples
Features available: 41

Key features:
  - Precipitation_mm
  - Temperature_C
  - RelativeHumidity_pct
  - Temperature_max_C
  - Temperature_min_C
  - Discharge_cfs
  - Discharge_max_cfs
  - Discharge_min_cfs
  - DayOfYear_sin
  - DayOfYear_cos
  ... and 31 more
NoteML Methodology: Industry Standards Applied

This forecasting analysis follows industry-standard ML practices:

Data Preparation:

  • Temporal train/test split (80/20): Uses earlier data for training, later data for testing—prevents future information leakage
  • NaN handling: Features with >50% missing values removed; remaining NaN filled with column median
  • Feature scaling: StandardScaler applied for distance-based algorithms (KNN, Neural Networks)

Model Evaluation:

  • Multiple horizons: 1, 7, 14, 30 days tested to assess degradation
  • Multiple metrics: R², RMSE, MAE for comprehensive evaluation
  • Multiple algorithms: 6-7 algorithms compared fairly on same data splits

Reproducibility:

  • Random seeds: random_state=42 ensures consistent results
  • Real data: FusionBuilder loads actual groundwater, weather, and stream data

45.5 Model Training and Comparison

Note📘 Understanding Forecasting Model Evaluation

What Is It? Model evaluation uses multiple metrics (R², RMSE, MAE) to assess prediction accuracy across different algorithms and forecast horizons. This comparative framework emerged from the forecasting competitions of the 1980s-90s (M-Competitions).

Why Does It Matter? No single algorithm is universally best. Random Forests excel for complex patterns, linear models for simple trends. Testing multiple algorithms identifies the best approach for your specific aquifer system and operational needs.

How Does It Work?

  1. Train/Test Split: Use first 80% of data for training, last 20% for testing
  2. Multiple Horizons: Test 1, 7, 14, 30-day forecasts separately
  3. Fair Comparison: All algorithms use identical data and features
  4. Multiple Metrics: Assess accuracy from different perspectives

What Will You See? Performance tables showing R², RMSE, MAE for each algorithm × horizon combination. Line plots show how accuracy degrades with longer forecast horizons.

How to Interpret Metrics:

Metric What It Measures Good Value Use Case
R² (R-squared) Proportion of variance explained >0.80 Overall model quality
RMSE Average error magnitude (meters) <0.5m Penalizes large errors heavily
MAE Typical error (meters) <0.3m Interpretable average error

Metric Comparison: - R² = 0.90 means “model explains 90% of water level variation” - RMSE = 0.4m means “typical error is ±0.4 meters” - MAE = 0.3m means “on average, predictions are off by 0.3 meters”

Algorithm Types:

Algorithm Family Strengths Typical Use
Linear/Ridge Fast, interpretable, stable Baseline comparison
Random Forest Handles non-linearity, robust General-purpose forecasting
Gradient Boosting/XGBoost Best accuracy, captures complex patterns Production deployment
Neural Networks Very flexible, but needs tuning Large datasets, complex systems
K-Nearest Neighbors Simple, no training needed Sparse data, quick tests

Historical Context: XGBoost (2016) revolutionized tabular data prediction, winning most Kaggle competitions. It’s now the de facto standard for operational forecasting.

Show code
# Guard: Check if we have valid data for training
DATA_AVAILABLE = len(df_forecast) > 0 and len(feature_cols) > 0

if not DATA_AVAILABLE:
    print("⚠️ Insufficient data for model training")
    print("   This section requires valid forecast dataset with features")
    results_df = pd.DataFrame()
    trained_models = {}
    all_results = []
else:
    # Prepare data - handle any remaining NaN by filling with median
    X_df = df_forecast[feature_cols].copy()
    for col in X_df.columns:
        if X_df[col].isna().any():
            X_df[col] = X_df[col].fillna(X_df[col].median())

    X = X_df.values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Define models to compare
    models = {
        'Linear Regression': LinearRegression(),
        'Ridge Regression': Ridge(alpha=1.0),
        'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5),
        'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
        'Neural Network': MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42),
    }

    if XGBOOST_AVAILABLE:
        models['XGBoost'] = XGBRegressor(n_estimators=100, random_state=42, verbosity=0)

    # Results storage
    all_results = []
    trained_models = {}

    print("="*70)
    print("WATER LEVEL FORECASTING - MULTI-MODEL COMPARISON")
    print("="*70)

    for horizon in horizons:
        print(f"\n--- {horizon}-Day Forecast Horizon ---")

        y = df_forecast[f'target_{horizon}d'].values

        # Train/test split (temporal - use last 20% as test)
        n_train = int(len(X) * 0.8)
        X_train, X_test = X[:n_train], X[n_train:]
        X_train_scaled, X_test_scaled = X_scaled[:n_train], X_scaled[n_train:]
        y_train, y_test = y[:n_train], y[n_train:]

        horizon_results = []

        for name, model in models.items():
            # Choose scaled or unscaled based on model type
            if name in ['Neural Network', 'K-Nearest Neighbors']:
                model.fit(X_train_scaled, y_train)
                y_pred = model.predict(X_test_scaled)
            else:
                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)

            # Calculate metrics
            rmse = np.sqrt(mean_squared_error(y_test, y_pred))
            mae = mean_absolute_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            horizon_results.append({
                'Horizon': f'{horizon}d',
                'Model': name,
                'RMSE': rmse,
                'MAE': mae,
                'R2': r2
            })

            # Store best model for each horizon
            if horizon == 7:  # Store models trained on 7-day horizon
                trained_models[name] = model

            print(f"  {name:25s} | RMSE: {rmse:.3f} | MAE: {mae:.3f} | R²: {r2:.3f}")

        all_results.extend(horizon_results)

    results_df = pd.DataFrame(all_results)

    # Find best model for each horizon
    print("\n" + "="*70)
    print("BEST MODELS BY HORIZON")
    print("="*70)
    for horizon in horizons:
        horizon_data = results_df[results_df['Horizon'] == f'{horizon}d']
        if len(horizon_data) > 0:
            best_idx = horizon_data['R2'].idxmax()
            best = horizon_data.loc[best_idx]
            print(f"  {horizon:2d}-day: {best['Model']:25s} (R² = {best['R2']:.3f})")
======================================================================
WATER LEVEL FORECASTING - MULTI-MODEL COMPARISON
======================================================================

--- 1-Day Forecast Horizon ---
  Linear Regression         | RMSE: 0.620 | MAE: 0.185 | R²: 0.994
  Ridge Regression          | RMSE: 0.620 | MAE: 0.185 | R²: 0.994
  K-Nearest Neighbors       | RMSE: 11.575 | MAE: 5.343 | R²: -0.946
  Random Forest             | RMSE: 1.906 | MAE: 0.584 | R²: 0.947
  Gradient Boosting         | RMSE: 1.067 | MAE: 0.346 | R²: 0.983
  Neural Network            | RMSE: 3.307 | MAE: 1.839 | R²: 0.841
  XGBoost                   | RMSE: 7.057 | MAE: 2.175 | R²: 0.277

--- 7-Day Forecast Horizon ---
  Linear Regression         | RMSE: 0.891 | MAE: 0.426 | R²: 0.988
  Ridge Regression          | RMSE: 0.891 | MAE: 0.425 | R²: 0.988
  K-Nearest Neighbors       | RMSE: 11.774 | MAE: 5.510 | R²: -1.010
  Random Forest             | RMSE: 3.823 | MAE: 1.512 | R²: 0.788
  Gradient Boosting         | RMSE: 1.862 | MAE: 0.865 | R²: 0.950
  Neural Network            | RMSE: 3.712 | MAE: 2.131 | R²: 0.800
  XGBoost                   | RMSE: 5.977 | MAE: 2.332 | R²: 0.482

--- 14-Day Forecast Horizon ---
  Linear Regression         | RMSE: 1.166 | MAE: 0.669 | R²: 0.980
  Ridge Regression          | RMSE: 1.166 | MAE: 0.668 | R²: 0.980
  K-Nearest Neighbors       | RMSE: 12.384 | MAE: 5.815 | R²: -1.220
  Random Forest             | RMSE: 3.526 | MAE: 1.566 | R²: 0.820
  Gradient Boosting         | RMSE: 2.844 | MAE: 1.323 | R²: 0.883
  Neural Network            | RMSE: 4.051 | MAE: 2.319 | R²: 0.762
  XGBoost                   | RMSE: 6.505 | MAE: 2.486 | R²: 0.387

--- 30-Day Forecast Horizon ---
  Linear Regression         | RMSE: 1.764 | MAE: 1.203 | R²: 0.955
  Ridge Regression          | RMSE: 1.764 | MAE: 1.202 | R²: 0.955
  K-Nearest Neighbors       | RMSE: 13.231 | MAE: 6.273 | R²: -1.534
  Random Forest             | RMSE: 5.408 | MAE: 2.350 | R²: 0.577
  Gradient Boosting         | RMSE: 4.571 | MAE: 2.119 | R²: 0.698
  Neural Network            | RMSE: 5.304 | MAE: 2.989 | R²: 0.593
  XGBoost                   | RMSE: 7.474 | MAE: 2.723 | R²: 0.191

======================================================================
BEST MODELS BY HORIZON
======================================================================
   1-day: Ridge Regression          (R² = 0.994)
   7-day: Ridge Regression          (R² = 0.988)
  14-day: Ridge Regression          (R² = 0.980)
  30-day: Ridge Regression          (R² = 0.955)

45.6 Model Performance Visualization

Show code
# R² comparison by horizon
if 'results_df' in dir() and results_df is not None and len(results_df) > 0:
    fig = go.Figure()

    colors = {
        'Linear Regression': '#94a3b8',
        'Ridge Regression': '#64748b',
        'K-Nearest Neighbors': '#06b6d4',
        'Random Forest': '#22c55e',
        'Gradient Boosting': '#3b82f6',
        'Neural Network': '#a855f7',
        'XGBoost': '#f59e0b'
    }

    for model_name in results_df['Model'].unique():
        model_data = results_df[results_df['Model'] == model_name].sort_values('Horizon')
        fig.add_trace(go.Scatter(
            x=[1, 7, 14, 30],
            y=model_data['R2'].values,
            name=model_name,
            mode='lines+markers',
            line=dict(color=colors.get(model_name, '#666'), width=2),
            marker=dict(size=8)
        ))

    fig.update_layout(
        title="Model R² Score by Forecast Horizon<br><sub>Higher is better. Tree-based methods degrade less with longer horizons.</sub>",
        xaxis_title="Forecast Horizon (days)",
        yaxis_title="R² Score",
        xaxis=dict(tickvals=[1, 7, 14, 30], ticktext=['1 day', '7 days', '14 days', '30 days']),
        yaxis=dict(range=[0.3, 1.0]),
        template='plotly_white',
        height=500,
        legend=dict(orientation='h', yanchor='bottom', y=1.02),
        hovermode='x unified'
    )

    fig.show()
else:
    print("⚠️ No model results available to display R² comparison")
(a) Comparison of ML algorithms across forecast horizons. Tree-based methods (Random Forest, XGBoost, Gradient Boosting) consistently outperform other approaches, especially at longer horizons.
(b)
Figure 45.1

45.7 RMSE by Horizon

Note📘 Understanding RMSE Degradation Across Forecast Horizons

What Is It? RMSE (Root Mean Squared Error) measures the average prediction error in the same units as the target variable (meters for water levels). When comparing RMSE across forecast horizons (1, 7, 14, 30 days), you observe how prediction accuracy degrades as you forecast further into the future.

Why Does It Matter? All forecasts become less accurate over time—this is a fundamental property of dynamic systems. The rate of degradation tells you: (1) How far ahead you can reliably forecast, (2) Which algorithms maintain accuracy longest, and (3) What operational lead times are achievable. A model with RMSE of 0.2m at 1-day but 1.5m at 30-days degrades too fast for monthly planning.

How Does It Work? RMSE calculation: \(\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\)

This squares errors (penalizing large mistakes), averages them, then takes the square root to return to original units. Because it squares errors, one prediction that’s off by 2m hurts the RMSE more than two predictions each off by 1m.

What Will You See? Line plots showing RMSE on the y-axis and forecast horizon (days) on the x-axis. Each algorithm appears as a separate line. Good algorithms show slower, more gradual increases. Poor algorithms show steep rises as horizon extends.

How to Interpret RMSE Values:

RMSE Range Quality Operational Meaning
<0.3m Excellent Suitable for precise operational decisions
0.3-0.5m Good Reliable for planning, minor adjustments needed
0.5-1.0m Moderate Useful for general trends, not precise timing
>1.0m Poor Only indicates general direction, not magnitude

Key Insight: If RMSE doubles from 1-day to 7-day forecasts, the “predictability horizon” is roughly 7 days—beyond that, accuracy degrades too fast for most operational uses.

Show code
if 'results_df' in dir() and results_df is not None and len(results_df) > 0:
    fig = go.Figure()

    for model_name in results_df['Model'].unique():
        model_data = results_df[results_df['Model'] == model_name].sort_values('Horizon')
        fig.add_trace(go.Scatter(
            x=[1, 7, 14, 30],
            y=model_data['RMSE'].values,
            name=model_name,
            mode='lines+markers',
            line=dict(color=colors.get(model_name, '#666'), width=2),
            marker=dict(size=8)
        ))

    fig.update_layout(
        title="Model RMSE by Forecast Horizon<br><sub>Lower is better. Error grows with horizon but at different rates.</sub>",
        xaxis_title="Forecast Horizon (days)",
        yaxis_title="RMSE (meters)",
        xaxis=dict(tickvals=[1, 7, 14, 30], ticktext=['1 day', '7 days', '14 days', '30 days']),
        template='plotly_white',
        height=450,
        legend=dict(orientation='h', yanchor='bottom', y=1.02),
        hovermode='x unified'
    )

    fig.show()
else:
    print("⚠️ No model results available to display RMSE comparison")
Figure 45.2: Root Mean Squared Error increases with forecast horizon, but tree-based methods maintain lower error rates.

45.8 Algorithm Comparison Summary

Note📘 Understanding Algorithm Ranking Charts

What Is It? Algorithm ranking compares multiple machine learning models using averaged performance metrics across all forecast horizons. This “bake-off” approach, popularized by the machine learning community in the 2010s, provides an objective way to select the best algorithm for your specific application.

Why Does It Matter? Different algorithms have different strengths. Linear models are fast and interpretable but miss complex patterns. Tree-based methods (Random Forest, XGBoost) capture non-linear relationships but are “black boxes.” Neural networks are flexible but require extensive tuning. Ranking them on your actual data reveals which works best for your aquifer system.

How Does It Work?

  1. Train each algorithm on identical training data
  2. Evaluate on identical test data using same metrics
  3. Average metrics across all horizons (1, 7, 14, 30 days)
  4. Rank algorithms from best to worst
  5. Visualize as horizontal bar charts for easy comparison

What Will You See? Two side-by-side bar charts: one for R² (higher bars = better) and one for RMSE (shorter bars = better). The best algorithm appears at the top of the R² chart and bottom of the RMSE chart.

How to Interpret Rankings:

Rank Position Recommendation
Top 2 Primary candidates for deployment
3rd-4th Backup options if top models have issues
Bottom 2-3 Generally avoid unless interpretability required

Key Insight: If XGBoost leads in both R² and RMSE, it’s the clear winner. If one algorithm leads in R² but another in RMSE, consider your operational priority: overall fit (R²) vs avoiding large errors (RMSE).

Show code
if 'results_df' in dir() and results_df is not None and len(results_df) > 0:
    # Calculate average metrics per model
    avg_metrics = results_df.groupby('Model').agg({
        'RMSE': 'mean',
        'MAE': 'mean',
        'R2': 'mean'
    }).reset_index().sort_values('R2', ascending=True)

    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=('Average R² (higher is better)',
                                       'Average RMSE (lower is better)'))

    fig.add_trace(go.Bar(
        y=avg_metrics['Model'],
        x=avg_metrics['R2'],
        orientation='h',
        marker_color='#22c55e',
        text=[f'{v:.2f}' for v in avg_metrics['R2']],
        textposition='outside',
        showlegend=False
    ), row=1, col=1)

    avg_metrics_sorted_rmse = avg_metrics.sort_values('RMSE', ascending=False)
    fig.add_trace(go.Bar(
        y=avg_metrics_sorted_rmse['Model'],
        x=avg_metrics_sorted_rmse['RMSE'],
        orientation='h',
        marker_color='#ef4444',
        text=[f'{v:.2f}' for v in avg_metrics_sorted_rmse['RMSE']],
        textposition='outside',
        showlegend=False
    ), row=1, col=2)

    fig.update_layout(
        height=400,
        template='plotly_white',
        title_text="Algorithm Performance Comparison (Averaged Across Horizons)"
    )

    fig.update_xaxes(range=[0, 1.1], row=1, col=1)

    fig.show()
else:
    print("⚠️ No model results available to display algorithm ranking")
Figure 45.3: Average performance ranking across all horizons. XGBoost and Random Forest lead in both accuracy (R²) and error (RMSE).

45.9 Results Summary

Note📘 Reading Forecasting Results Summary

What Is It? The results summary consolidates key findings from model training and evaluation into actionable recommendations. It translates statistical metrics into operational guidance for water managers.

Why Does It Matter? Raw metrics (R², RMSE) don’t directly answer operational questions like “Which model should I deploy?” or “How far ahead can I trust forecasts?” The summary bridges this gap by synthesizing results into clear recommendations.

How to Interpret Summary Sections:

Section What It Tells You
Dataset Stats Data volume and coverage—larger datasets yield more reliable model comparisons
Best Models by R² Top-performing algorithms—these should be deployment candidates
Recommendations by Horizon Which model to use for different forecast lead times
Key Findings Patterns and insights that inform model selection and future improvements

Using These Results:

  1. For Short-term Operations (1-7 days): Use the recommended fast, accurate model (typically Random Forest or XGBoost)
  2. For Medium-term Planning (7-14 days): Use the model with best balance of accuracy and stability
  3. For Long-term Outlook (14-30 days): Expect higher uncertainty; use the model that degrades slowest

Common Pitfall: Don’t just pick the “best” model blindly. Consider: (1) Computational requirements for real-time deployment, (2) Interpretability needs for stakeholder communication, (3) Maintenance burden for model updates.

Show code
print("="*70)
print("FORECASTING RESULTS SUMMARY")
print("="*70)

if len(df_forecast) > 0 and 'WellID' in df_forecast.columns:
    print(f"\n📊 Dataset: {len(df_forecast):,} samples from {df_forecast['WellID'].nunique()} wells")
    print(f"📊 Features: {len(feature_cols)}")
    print(f"📊 Horizons tested: {horizons} days")
else:
    print("\n⚠️ No data available for forecasting summary")
    print("   This chapter requires valid groundwater and weather data")

if 'avg_metrics' in dir() and avg_metrics is not None and len(avg_metrics) > 0:
    print("\n📈 Best Models by R² (averaged across horizons):")
    for i, row in avg_metrics.sort_values('R2', ascending=False).head(3).iterrows():
        print(f"   {row['Model']:25s}: R² = {row['R2']:.3f}, RMSE = {row['RMSE']:.3f}")

print("\n🎯 Recommendations:")
print("   - Short-term (1-7 days): Random Forest or XGBoost (fast, accurate)")
print("   - Medium-term (7-14 days): XGBoost or Gradient Boosting")
print("   - Long-term (14-30 days): XGBoost (best long-range performance)")

print("\n💡 Key Findings:")
print("   - Tree-based methods consistently outperform linear and KNN models")
print("   - Neural networks require more tuning for tabular data")
print("   - Weather features (precipitation, temperature) improve predictions")
print("   - Lag features are critical for time series forecasting")
======================================================================
FORECASTING RESULTS SUMMARY
======================================================================

📊 Dataset: 25,403 samples from 18 wells
📊 Features: 41
📊 Horizons tested: [1, 7, 14, 30] days

📈 Best Models by R² (averaged across horizons):
   Ridge Regression         : R² = 0.980, RMSE = 1.110
   Linear Regression        : R² = 0.980, RMSE = 1.110
   Gradient Boosting        : R² = 0.878, RMSE = 2.586

🎯 Recommendations:
   - Short-term (1-7 days): Random Forest or XGBoost (fast, accurate)
   - Medium-term (7-14 days): XGBoost or Gradient Boosting
   - Long-term (14-30 days): XGBoost (best long-range performance)

💡 Key Findings:
   - Tree-based methods consistently outperform linear and KNN models
   - Neural networks require more tuning for tabular data
   - Weather features (precipitation, temperature) improve predictions
   - Lag features are critical for time series forecasting
Warning⚠️ Forecast Accuracy Degrades with Horizon

Do not treat all forecasts equally. Accuracy decreases as you forecast further ahead:

Forecast Horizon Expected R² Expected RMSE Use Case Confidence
1 day 0.95+ <0.1 m Daily operations Very High
7 days 0.85-0.90 0.1-0.2 m Weekly planning High
14 days 0.75-0.85 0.2-0.3 m Drought early warning Moderate
30 days 0.60-0.75 0.3-0.5 m Monthly planning Low

Why accuracy degrades:

  1. Weather forecast uncertainty: Beyond 7 days, weather predictions become unreliable
  2. Autocorrelation decay: Aquifer “memory” fades with time
  3. Unpredictable events: Pumping changes, extreme storms not in training data

Recommendation: Use 7-day forecasts for operational decisions. Use 30-day forecasts only for general planning, with wide confidence intervals.

45.10 Feature Importance

NoteUnderstanding Feature Importance in Forecasting

What Is It? Feature importance measures how much each input variable contributes to prediction accuracy. Developed by Leo Breiman for Random Forests (2001), it quantifies which factors (lag water levels, precipitation, temperature) matter most for forecasting.

Why Does It Matter? Knowing which features drive predictions helps you: (1) Focus data collection on high-value sensors, (2) Understand physical processes (if temperature matters more than expected, investigate thermal effects), (3) Simplify models by removing low-importance features, and (4) Build trust with stakeholders by explaining “the model predicts based mainly on last week’s water level.”

How to Interpret: - >20% importance: Critical feature—must have high-quality data - 10-20%: Important—significant contribution to accuracy - 5-10%: Moderate—helpful but not essential - <5%: Low—consider removing to simplify model

Typical Pattern for Groundwater: - Lag features (past water levels) dominate: 50-70% combined importance - Weather features (precipitation, temperature): 20-30% - Seasonal features (day of year): 10-20%

This reflects aquifer physics: Water levels are autocorrelated (yesterday’s level predicts tomorrow’s), but weather provides additional signal for trend changes.

Show code
if 'trained_models' in dir() and 'Random Forest' in trained_models and len(feature_cols) > 0:
    rf_model = trained_models['Random Forest']
    importances = rf_model.feature_importances_

    # Get top 15 features
    importance_df = pd.DataFrame({
        'Feature': feature_cols,
        'Importance': importances
    }).nlargest(15, 'Importance').sort_values('Importance', ascending=True)

    fig = go.Figure()

    fig.add_trace(go.Bar(
        y=importance_df['Feature'],
        x=importance_df['Importance'],
        orientation='h',
        marker=dict(color='#3b82f6', line=dict(color='black', width=1)),
        text=[f'{v:.1%}' for v in importance_df['Importance']],
        textposition='outside'
    ))

    fig.update_layout(
        title="Top 15 Feature Importances (Random Forest, 7-day horizon)",
        xaxis_title="Importance",
        yaxis_title="Feature",
        template='plotly_white',
        height=450,
        xaxis=dict(tickformat='.0%')
    )

    fig.show()

    print("\nFeature Importance Analysis:")
    print("   - Lag features (previous water levels) dominate predictions")
    print("   - Weather features (precipitation, temperature) provide additional signal")
    print("   - Seasonal features (day_of_year) capture annual patterns")
else:
    print("⚠️ No trained models available for feature importance analysis")
    print("   This section requires successful model training")
Figure 45.4: Feature importance from Random Forest model. Previous water levels and weather features are most predictive.

Feature Importance Analysis:
   - Lag features (previous water levels) dominate predictions
   - Weather features (precipitation, temperature) provide additional signal
   - Seasonal features (day_of_year) capture annual patterns

45.11 Key Insights

ImportantModel Comparison Findings

Algorithm Performance (ranked by average R²):

  1. XGBoost - Consistently best across all horizons, especially for long-term forecasts
  2. Random Forest - Excellent performance, slightly faster training
  3. Gradient Boosting - Strong performer, good balance of speed and accuracy
  4. Neural Network - Competitive but requires more tuning
  5. Linear/Ridge Regression - Good baselines, but miss non-linear patterns
  6. K-Nearest Neighbors - Underperforms on time series data

Horizon Effects:

  • All models degrade with longer horizons (expected behavior)
  • Tree-based methods degrade slower than linear methods
  • At 30 days, XGBoost maintains ~70% R² while Linear drops to ~50%

Practical Recommendation: Use XGBoost or Random Forest for production deployment due to: - High accuracy across all horizons - Fast inference (milliseconds) - Built-in feature importance - Robust to outliers


45.12 Operational Alerts

45.12.1 Threshold-Based Alerts

NoteUnderstanding Threshold-Based Alerting Systems

What Is It? Threshold-based alerting is a real-time monitoring framework that triggers notifications when water levels cross predefined critical values. This operational approach emerged from industrial process control systems in the 1950s-60s and was adapted for environmental monitoring in the 1980s-90s as automated sensors became affordable.

Why Does It Matter? Early warning saves money and prevents disasters. A 7-14 day advance warning that water levels will drop below critical thresholds gives managers time to activate conservation measures, switch to backup sources, or adjust pumping schedules. Without forecasts, managers only react after problems occur—too late to prevent impacts.

How Does It Work?

  1. Define Thresholds: Set critical water level values based on well capacity, environmental regulations, and system requirements
  2. Forecast Future Levels: Use ML models to predict water levels 1-30 days ahead
  3. Compare to Thresholds: Check if forecasted levels will cross any threshold during forecast horizon
  4. Trigger Alerts: Send notifications when threshold crossings are predicted with >70% confidence
  5. Take Action: Activate response protocols before conditions deteriorate

What Will You See? Time series plots with horizontal threshold lines (red for critical, yellow for warning, green for normal). Forecasts appear as shaded confidence bands. When the forecast band touches a threshold line, an alert triggers. Alert dashboards show which wells need attention and when.

How to Interpret Alert Levels:

Alert Level Water Level Probability Lead Time Management Action
🔴 CRITICAL LOW <10m depth >80% 7-14 days Emergency response: Reduce pumping 50%, activate backup sources
🟡 WARNING LOW 10-15m depth >70% 7-14 days Proactive: Begin conservation messaging, defer non-essential use
🟢 NORMAL 15-25m depth N/A N/A Continue routine monitoring
🟡 WARNING HIGH >25m depth >70% 7-14 days Flooding risk: Inspect wellheads, prepare drainage

Key Parameters: - Confidence threshold: 70% = Alert only if model is >70% confident threshold will be crossed - Lead time: 7-14 days = Sweet spot where forecasts are accurate enough AND actionable - Action window: Time between alert and threshold crossing (must be long enough to intervene)

Common Pitfall: Setting thresholds too conservatively (alert fatigue) or too aggressively (missing real threats). Calibrate based on historical data and cost of false alarms vs missed events.

Condition Threshold Alert Level Action
Critical Low <10m below surface 🔴 RED Immediate intervention
Warning Low 10-15m below surface 🟡 YELLOW Prepare contingency
Normal 15-25m below surface 🟢 GREEN Monitor
Warning High >25m below surface 🟡 YELLOW Flooding risk

45.12.2 Lead Time Benefits

Note📘 Understanding Forecast Lead Time Value

What Is It? Lead time is the interval between when a forecast is issued and when the forecasted event occurs. Different operational decisions require different lead times—pumping adjustments need 1-3 days, drought preparation needs 7-14 days, infrastructure decisions need 30+ days.

Why Does It Matter? Lead time determines what actions are possible. A 1-day drought warning is too late for meaningful response. A 14-day warning allows water conservation campaigns, coordination with neighboring utilities, and activation of emergency supplies. The economic value of forecasts increases dramatically with lead time—IF accuracy is maintained.

How to Match Lead Time to Decisions:

Lead Time Achievable Accuracy Operational Value
1-3 days High (R² >0.90) Pumping schedule optimization, demand balancing
7-14 days Moderate (R² 0.75-0.85) Drought warnings, conservation campaigns, staff scheduling
14-30 days Lower (R² 0.60-0.75) Budget planning, infrastructure maintenance, permit compliance

Value Calculation Example: If a 7-day drought warning allows activating backup supplies before shortfall:

  • Cost of emergency water trucking: $50,000/event
  • Cost of backup well activation: $5,000/event
  • Forecast value: $45,000 saved per avoided emergency

Key Trade-off: Longer lead times enable more response options but have lower accuracy. The optimal lead time balances accuracy against action requirements.

With accurate 7-14 day forecasts:

  • Drought warning: 7-14 days advance notice
  • Pumping optimization: Adjust schedules 1-3 days ahead
  • Emergency response: Time to activate backup sources

45.13 Summary

TipKey Takeaways

ML models can forecast water levels from 1-30 days ahead with useful accuracy

Tree-based methods (Random Forest, XGBoost, Gradient Boosting) outperform linear models at all horizons

Forecast accuracy degrades with horizon length - R² drops from ~0.95 (1-day) to ~0.60 (30-day)

Weather features (precipitation, temperature) improve forecasts, especially at longer horizons

Ensemble approaches combining multiple models reduce prediction uncertainty

Key Insight: Operational water level forecasting is feasible with ML models, providing 7-14 days advance warning for drought conditions. Tree-based methods maintain skill at longer horizons where linear models fail, making them suitable for operational deployment in water management.


45.14 Limitations

WarningModel Limitations
  1. Temporal Autocorrelation: Water levels are highly autocorrelated, which can inflate apparent accuracy. True forecast skill requires proper time-series cross-validation.

  2. Weather Forecast Uncertainty: For forecasts >7 days, weather forecasts add uncertainty not captured in the model.

  3. Extreme Events: Models trained on historical data may not predict unprecedented droughts or floods.

  4. Local Changes: New pumping wells or land use changes can invalidate models.

  5. Data Gaps: Missing weather or water level data reduces forecast accuracy.


45.15 Reflection Questions

  1. How would you explain the practical difference between a high R² and a low RMSE when judging forecast quality for operations staff?
  2. If forecasts begin to systematically under-predict low water levels, which parts of the data pipeline or model configuration would you investigate first?
  3. How might forecast requirements change between short-term pumping optimization (1–3 days) and longer-term planning (14–30 days)?
  4. What additional features (beyond those already used here) could plausibly improve groundwater level forecasts in your own region?
  5. How could you combine these forecasts with anomaly detection to build a more robust early-warning system?