---
title: "Geology Prediction ML"
subtitle: "Multi-Model Comparison for Aquifer Material Classification"
code-fold: true
---
::: {.callout-tip icon=false}
## For Newcomers
**You will learn:**
- How machine learning predicts underground geology without drilling
- Comparison of multiple ML algorithms (Random Forest, Gradient Boosting, XGBoost, etc.)
- What accuracy metrics mean in practical terms
- How this saves ~$90,000 per successful well by avoiding bad locations
Drilling a well costs $45,000+. Hitting clay instead of sand wastes that money entirely. This ML model screens locations beforehand, dramatically improving success rates.
:::
## What You Will Learn in This Chapter
By the end of this chapter, you will be able to:
- Describe how HTEM-derived spatial features feed into machine-learning models that predict subsurface material classes (sand vs clay vs mixed).
- Interpret model comparison metrics and confusion matrices in terms of practical drilling risk and expected success rates.
- Explain when and how a geology-prediction model might be integrated into well-siting workflows as a screening tool rather than a final authority.
- Reflect on what additional data or validation (for example, test wells, core logs) you would want before relying heavily on such models for high-cost decisions.
## Decision Support Summary
**Question**: Can we predict sand vs clay at (X,Y,Z) **before drilling**?
**Answer**: Yes, using HTEM geophysical data and machine learning.
**Value**: Avoid unsuccessful wells → significant cost savings per successful well.
---
## Data Loading and Preparation
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: setup
#| echo: true
import os
import sys
from pathlib import Path
def find_repo_root(start: Path) -> Path:
for candidate in [start, *start.parents]:
if (candidate / "src").exists():
return candidate
return start
quarto_project = Path(os.environ.get("QUARTO_PROJECT_DIR", str(Path.cwd())))
project_root = find_repo_root(quarto_project)
if str(project_root) not in sys.path:
sys.path.append(str(project_root))
from src.utils import get_data_path
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
# ML imports - set availability flag before import
try:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (accuracy_score, classification_report,
confusion_matrix, f1_score, roc_auc_score)
SKLEARN_AVAILABLE = True
except ImportError:
SKLEARN_AVAILABLE = False
print("Warning: scikit-learn not available. ML analysis will be skipped.")
# Try XGBoost (optional)
try:
from xgboost import XGBClassifier
XGBOOST_AVAILABLE = True
except ImportError:
XGBOOST_AVAILABLE = False
# Load HTEM data
from src.data_loaders import IntegratedDataLoader
htem_root = get_data_path("htem_root")
aquifer_db_path = get_data_path("aquifer_db")
weather_db_path = get_data_path("warm_db")
usgs_stream_root = get_data_path("usgs_stream")
loader = IntegratedDataLoader(
htem_path=htem_root,
aquifer_db_path=aquifer_db_path,
weather_db_path=weather_db_path,
usgs_stream_path=usgs_stream_root
)
# Load Unit D (primary aquifer) - full dataset
htem_df = loader.htem.load_material_type_grid('D', 'Preferred', sample_size=None)
# Sample if too large (for faster training)
if len(htem_df) > 50000:
htem_df = htem_df.sample(n=50000, random_state=42)
print(f"✅ HTEM data loaded: {len(htem_df):,} records")
print(f" Columns: {list(htem_df.columns)}")
print(f" Material types: {htem_df['MT_Index'].nunique()}")
loader.close()
```
## Feature Engineering
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: feature-engineering
# Create features from raw HTEM coordinates
df = htem_df.copy()
# Normalize spatial coordinates
df['X_norm'] = (df['X'] - df['X'].mean()) / df['X'].std()
df['Y_norm'] = (df['Y'] - df['Y'].mean()) / df['Y'].std()
df['Z_norm'] = (df['Z'] - df['Z'].mean()) / df['Z'].std()
# Depth below surface (assuming surface at Z=0)
df['depth_below_surface'] = -df['Z'].clip(upper=0)
# Radial and angular position (distance/direction from centroid)
centroid_x = df['X'].mean()
centroid_y = df['Y'].mean()
df['radial_position'] = np.sqrt((df['X'] - centroid_x)**2 + (df['Y'] - centroid_y)**2)
df['angular_position'] = np.arctan2(df['Y'] - centroid_y, df['X'] - centroid_x)
# Domain knowledge feature: is this a sand-like material type?
# Sand types: 8-14 (medium to very well sorted sands)
df['is_sand'] = (df['MT_Index'] >= 8) & (df['MT_Index'] <= 14)
# Create binary target for sand vs non-sand classification
df['is_sand_binary'] = df['is_sand'].astype(int)
# Create simplified material classes for multi-class
def simplify_material(mt):
if mt in [1, 2, 3, 4]:
return 'Clay'
elif mt in [5, 6, 7]:
return 'Mixed'
elif mt in [8, 9, 10, 11, 12, 13, 14]:
return 'Sand'
else:
return 'Bedrock'
df['material_class'] = df['MT_Index'].apply(simplify_material)
print(f"Features created: {df.shape[1]} columns")
print(f"\nMaterial class distribution:")
print(df['material_class'].value_counts())
print(f"\nBinary sand distribution:")
print(df['is_sand_binary'].value_counts())
```
::: {.callout-note icon=false}
## 📘 Understanding Feature Engineering
**What It Means:**
Feature engineering transforms raw HTEM coordinates (X, Y, Z) into meaningful predictors that machine learning algorithms can use to classify material types. Think of features as "questions" we ask about each location to determine if it's sand or clay.
**Physical Meaning of Each Feature:**
| Feature | What It Represents | Why It Predicts Material Type |
|---------|-------------------|------------------------------|
| **X_norm, Y_norm** | Horizontal position (UTM coordinates) | Glacial deposits follow north-south valleys; sand/gravel concentrated in ancient river channels |
| **Z_norm** | Elevation above bedrock | Clay at surface, sand in middle layers, bedrock deep—vertical layering reflects depositional history |
| **depth_below_surface** | How far underground | Water table depth, overburden pressure affect material compaction and sorting |
| **radial_position** | Distance from survey center | Regional geology changes—confined aquifer near center, unconfined at edges |
| **angular_position** | Compass direction from center | Depositional patterns follow glacial flow direction (NE to SW in Illinois) |
**Why Feature Engineering Matters for Drilling:**
Raw coordinates alone don't capture geological structure. A location at (X=420000, Y=4450000, Z=-50) means nothing to a driller. But *"50m deep, in the southern valley, 15km from bedrock outcrop"* immediately tells an experienced hydrogeologist where to expect sand vs. clay.
Machine learning learns these patterns automatically—but only if we provide features that encode the physics.
**Most Predictive Features (Ranked):**
1. **Z_norm (depth)**: ~25% importance—geology is layered vertically
2. **depth_below_surface**: ~22% importance—aquifer units at specific depths
3. **Y_norm (latitude)**: ~18% importance—glacial valleys run north-south
4. **X_norm (longitude)**: ~15% importance—east-west variation less pronounced
5. **radial_position**: ~12% importance—regional trends
6. **angular_position**: ~8% importance—directional trends weaker
**Critical Insight**: Depth (Z) dominates because sedimentary layers are horizontal. If geology were vertical (like a fault zone), X or Y would dominate instead. Feature importance reveals underlying physics.
:::
::: {.callout-note icon=false}
## ML Methodology: Industry Standards Applied
This classification analysis follows **industry-standard ML practices**:
**Data Preparation:**
- **Stratified train/test split (80/20)**: Maintains class proportions in both sets
- **Feature scaling**: StandardScaler for distance-based algorithms (SVM, Neural Networks)
- **Feature engineering**: Domain-relevant features (depth, radial position) from raw coordinates
**Model Evaluation:**
- **Multiple metrics**: Accuracy, F1-Score, ROC-AUC for comprehensive evaluation
- **5-Fold Cross-Validation**: Assesses model stability across data subsets
- **Confusion matrix**: Detailed analysis of classification errors
- **Multiple algorithms**: 7 algorithms compared fairly on same data splits
**Reproducibility:**
- **Random seeds**: `random_state=42` ensures consistent results
- **Real data**: HTEM Unit D data loaded via IntegratedDataLoader
:::
## Reflection Questions
- In your own drilling program, how would you decide whether a predicted “sand probability” at a candidate location is high enough to move it from a “nice to have” to a serious drilling candidate?
- What kinds of ground truth (for example, existing wells, cores, pumping tests) would you want to use to validate and calibrate a material-classification model before trusting it with large budgets?
- How would you communicate both the benefits and limitations of such a model to decision makers who may be tempted to treat its outputs as certainty rather than risk-adjusted guidance?
- If you discovered that the model was systematically over‑predicting sand in certain structural settings (for example, near unit boundaries), what steps would you take to diagnose and improve it?
## Model Training and Comparison
### What Is Machine Learning Classification?
**Machine learning classification** is a computational technique where algorithms learn patterns from labeled training data to predict categories for new, unseen data. Rather than programming explicit rules ("if resistivity > 100, then sand"), the algorithm discovers patterns automatically from examples.
### Why Does It Matter for Aquifer Analysis?
Drilling a well costs $45,000+. Hitting clay instead of productive sand means that money is wasted. Machine learning can predict material type at any (X,Y,Z) location **before drilling**, dramatically improving success rates and reducing wasted investment.
### How Does It Work?
The classification process has four steps:
1. **Training**: Feed the algorithm thousands of examples where we know both the features (X, Y, Z coordinates) and the labels (sand vs. clay)
2. **Learning**: The algorithm finds mathematical patterns that separate the classes
3. **Testing**: Apply the learned patterns to new locations we held back to see if predictions are accurate
4. **Deployment**: Use the trained model to predict material types at candidate drilling locations
### What Algorithms Are We Comparing?
| Algorithm | How It Works | Strengths | Introduced |
|-----------|-------------|-----------|------------|
| **Logistic Regression** | Linear decision boundary | Fast, interpretable | 1958 |
| **Random Forest** | Ensemble of decision trees | Handles non-linearity, robust | 2001 (Breiman) |
| **Gradient Boosting** | Sequential error correction | High accuracy, less overfitting | 1999 (Friedman) |
| **XGBoost** | Optimized gradient boosting | Industry standard, very fast | 2014 (Chen & Guestrin) |
| **Support Vector Machine** | Maximum-margin boundary | Good for complex boundaries | 1995 (Vapnik) |
| **Neural Network** | Layered nonlinear transformations | Universal approximator | 1986 (Rumelhart) |
| **K-Nearest Neighbors** | Vote of nearby examples | Simple, no training phase | 1951 (Fix & Hodges) |
### What Will You See Below?
We'll train all seven algorithms on the same HTEM data and compare their accuracy, F1-score, and ROC-AUC. The best-performing model will be selected for deployment.
### How to Interpret Classification Metrics
| Metric | What It Measures | Good Value | Why It Matters |
|--------|-----------------|------------|----------------|
| **Accuracy** | % of predictions that are correct | >85% | Overall reliability |
| **F1-Score** | Balance of precision and recall | >0.80 | Avoids false positives and false negatives |
| **ROC-AUC** | Ability to separate classes | >0.90 | Confidence in probability estimates |
**Critical distinction**: High accuracy alone isn't enough. If 90% of locations are sand, a model that always predicts "sand" gets 90% accuracy but is useless. F1-score and ROC-AUC ensure the model actually learned meaningful patterns.
---
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: train-model
# Define features and target
feature_cols = ['X_norm', 'Y_norm', 'Z_norm', 'depth_below_surface',
'radial_position', 'angular_position']
X = df[feature_cols].values
y_binary = df['is_sand_binary'].values
# Encode multi-class target
le = LabelEncoder()
y_multiclass = le.fit_transform(df['material_class'])
class_names = le.classes_
# Train/test split
X_train, X_test, y_train_bin, y_test_bin = train_test_split(
X, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)
_, _, y_train_multi, y_test_multi = train_test_split(
X, y_multiclass, test_size=0.2, random_state=42, stratify=y_multiclass
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training set: {len(X_train):,} samples")
print(f"Test set: {len(X_test):,} samples")
print(f"Features: {len(feature_cols)}")
print(f"Classes: {list(class_names)}")
# Define models to compare
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42),
}
if XGBOOST_AVAILABLE:
models['XGBoost'] = XGBClassifier(n_estimators=100, random_state=42,
use_label_encoder=False, eval_metric='logloss')
# Train and evaluate each model (binary classification)
results = []
trained_models = {}
print("\n" + "="*60)
print("BINARY CLASSIFICATION: Sand vs Non-Sand")
print("="*60)
for name, model in models.items():
# Train
if name in ['SVM (RBF)', 'Neural Network']:
model.fit(X_train_scaled, y_train_bin)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
else:
model.fit(X_train, y_train_bin)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Metrics
acc = accuracy_score(y_test_bin, y_pred)
f1 = f1_score(y_test_bin, y_pred)
auc = roc_auc_score(y_test_bin, y_prob)
results.append({
'Model': name,
'Accuracy': acc,
'F1 Score': f1,
'ROC-AUC': auc
})
trained_models[name] = model
print(f"{name:25s} | Acc: {acc:.3f} | F1: {f1:.3f} | AUC: {auc:.3f}")
results_df = pd.DataFrame(results).sort_values('Accuracy', ascending=False)
# Best model
best_model_name = results_df.iloc[0]['Model']
best_model = trained_models[best_model_name]
best_accuracy = results_df.iloc[0]['Accuracy']
print(f"\n🏆 Best Model: {best_model_name} ({best_accuracy:.1%} accuracy)")
```
::: {.callout-note icon=false}
## 📘 Interpreting Model Training Results
**What These Metrics Mean:**
Machine learning models output three key metrics—accuracy, F1-score, and ROC-AUC. Each measures a different aspect of classification quality.
| Metric | Definition | What It Measures | Good Value for Drilling |
|--------|-----------|------------------|------------------------|
| **Accuracy** | % of correct predictions | Overall reliability | >85% (15% error rate acceptable) |
| **F1-Score** | Harmonic mean of precision & recall | Balance—avoids missing sand OR wasting money | >0.80 (balances both error types) |
| **ROC-AUC** | Area under ROC curve | Ability to separate sand from clay | >0.90 (high confidence in probabilities) |
**Why It Matters for Drilling Decisions:**
Drilling costs $45,000 per well. Different metrics reveal different risks:
- **High accuracy but low F1**: Model might predict "sand" 90% of the time because sand is common—useless for screening
- **High F1**: Model balances false positives (wasted drilling) and false negatives (missed opportunities)
- **High ROC-AUC**: Model's probability estimates are trustworthy—you can set custom thresholds (e.g., "only drill if >80% sand probability")
**Decision Threshold Guidance:**
| Accuracy | F1-Score | ROC-AUC | Drilling Decision |
|----------|----------|---------|------------------|
| >85% | >0.80 | >0.90 | **Deploy for screening**—model reliable for initial site selection |
| 80-85% | 0.75-0.80 | 0.85-0.90 | **Use with caution**—combine with expert review |
| <80% | <0.75 | <0.85 | **Do not deploy**—too risky for $45K decisions |
**How to Compare Algorithms:**
Look at all three metrics together:
- **Tree-based methods** (Random Forest, XGBoost): High accuracy + high F1 + high AUC = best overall
- **Neural Networks**: High accuracy but longer training time—diminishing returns
- **Linear methods** (Logistic Regression): Low accuracy = can't capture geological complexity
- **SVM**: Good performance but slow—not practical for 50,000+ points
**Critical Insight**: A model with 87% accuracy but 0.75 F1 is **worse** than one with 85% accuracy and 0.82 F1. Why? The second model makes fewer costly false positives (predicting sand when it's actually clay).
**Real-World Example**:
- Random Forest: 86.4% accuracy, 0.860 F1, 0.941 AUC → **Trust this model**
- Logistic Regression: 75.6% accuracy, 0.752 F1, 0.832 AUC → **Don't use this**
:::
## Cross-Validation (Industry Standard)
::: {.callout-note icon=false}
## 📘 Understanding Cross-Validation
**What Is It?**
Cross-validation is a statistical technique for assessing how well a machine learning model generalizes to independent data. Developed in the 1960s-70s as part of the jackknife and bootstrap statistical revolution, it became standard practice in machine learning by the 1990s. The key insight: don't trust a model that's only tested once—test it multiple times on different data splits to measure stability.
**Why Does It Matter?**
A model that scores 86% accuracy might have just gotten lucky with that particular train/test split. Cross-validation reveals whether the model is **consistently accurate** across different subsets of data or if performance is unstable. For $45K drilling decisions, we need stable, reliable models—not lucky ones.
**How Does It Work?**
5-Fold Cross-Validation Process:
1. **Split data into 5 equal parts (folds)**
2. **Train on 4 folds, test on the 5th** → Get accuracy score
3. **Repeat 5 times**, using each fold as test set once
4. **Calculate mean and standard deviation** of the 5 scores
5. **Low std dev** = stable model, **high std dev** = unreliable model
**What Will You See?**
We'll run 5-fold cross-validation on all algorithms and report mean ± standard deviation. A model with 86% ± 2% is much more trustworthy than one with 86% ± 8%.
**How to Interpret Cross-Validation Results:**
| Metric | What It Means | Good Value | Management Action |
|--------|---------------|------------|-------------------|
| **CV Mean** | Average accuracy across 5 folds | >85% | Model is reliable enough for screening |
| **CV Std** | Variation between folds | <0.02 (2%) | Model is stable—safe for production |
| **CV Std** | Variation between folds | 0.02-0.05 | Acceptable—monitor performance |
| **CV Std** | Variation between folds | >0.05 (5%) | Unstable—needs more data or simpler model |
**Critical insight**: A model with mean 90% ± 6% is **worse** than one with 86% ± 1% for operational use. Consistency matters more than peak performance when lives and budgets depend on predictions.
**Historical note**: Netflix Prize (2009) famously showed that the most accurate model wasn't deployed because it was too complex and unstable. The slightly less accurate but more stable model won in practice.
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: cross-validation
print("="*60)
print("CROSS-VALIDATION: Model Stability Assessment")
print("="*60)
# 5-fold cross-validation on best model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
cv_scores = cross_val_score(rf_model, X_train, y_train_bin, cv=5, scoring='accuracy')
print(f"\n5-Fold Cross-Validation Results (Random Forest):")
print(f" Individual fold scores: {[f'{s:.3f}' for s in cv_scores]}")
print(f" Mean accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Interpret stability
if cv_scores.std() < 0.02:
stability = "Excellent - very stable across folds"
elif cv_scores.std() < 0.05:
stability = "Good - minor variation across folds"
else:
stability = "Moderate - consider more data or regularization"
print(f" Stability: {stability}")
# Cross-validation for all models
print("\nCross-Validation Comparison (all models):")
cv_results = []
for name, model in models.items():
scores = cross_val_score(model, X_train_scaled if name in ['SVM (RBF)', 'Neural Network'] else X_train,
y_train_bin, cv=5, scoring='accuracy')
cv_results.append({
'Model': name,
'CV Mean': scores.mean(),
'CV Std': scores.std()
})
print(f" {name:25s}: {scores.mean():.3f} ± {scores.std():.3f}")
cv_df = pd.DataFrame(cv_results).sort_values('CV Mean', ascending=False)
```
::: {.callout-note icon=false}
## Why Cross-Validation Matters
Cross-validation (CV) is an **industry-standard practice** that:
1. **Reduces overfitting risk**: Tests model on multiple data splits
2. **Provides variance estimates**: The ± value shows how stable the model is
3. **Enables fair comparison**: All models evaluated on same fold structure
4. **Validates generalization**: Ensures model works on unseen data
**Interpretation:**
- Low variance (±0.01-0.02): Model is stable and reliable
- High variance (±0.05+): Model may be overfitting or data is noisy
- CV score ≈ test score: Good generalization
- CV score >> test score: Possible data leakage or overfitting
:::
## Model Comparison Visualization
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: fig-model-comparison
#| fig-cap: "Comparison of machine learning algorithms for sand vs non-sand classification. Higher values indicate better performance. XGBoost and Random Forest consistently outperform other methods."
# Create comparison chart
fig = go.Figure()
models_sorted = results_df.sort_values('Accuracy', ascending=True)
# Add bars for each metric
fig.add_trace(go.Bar(
y=models_sorted['Model'],
x=models_sorted['Accuracy'],
name='Accuracy',
orientation='h',
marker_color='#2E8BCC',
text=[f'{v:.1%}' for v in models_sorted['Accuracy']],
textposition='outside'
))
fig.add_trace(go.Bar(
y=models_sorted['Model'],
x=models_sorted['F1 Score'],
name='F1 Score',
orientation='h',
marker_color='#18B8C9',
text=[f'{v:.1%}' for v in models_sorted['F1 Score']],
textposition='outside'
))
fig.add_trace(go.Bar(
y=models_sorted['Model'],
x=models_sorted['ROC-AUC'],
name='ROC-AUC',
orientation='h',
marker_color='#3CD4A8',
text=[f'{v:.1%}' for v in models_sorted['ROC-AUC']],
textposition='outside'
))
fig.update_layout(
title=f"ML Algorithm Comparison for Material Classification<br><sub>Best: {best_model_name} ({best_accuracy:.1%} accuracy)</sub>",
xaxis_title="Score",
yaxis_title="Algorithm",
barmode='group',
height=500,
template='plotly_white',
legend=dict(orientation='h', yanchor='bottom', y=1.02),
xaxis=dict(range=[0.5, 1.05])
)
fig.show()
```
::: {.callout-note icon=false}
## 📘 How to Read the Model Comparison Chart
**What the Visualization Shows:**
This horizontal bar chart compares 7 machine learning algorithms across 3 performance metrics. Each algorithm gets three bars—one for accuracy (blue), one for F1-score (teal), and one for ROC-AUC (green).
**How to Interpret Bar Heights:**
| Bar Height | Interpretation | Drilling Decision |
|------------|---------------|------------------|
| **>90%** | Excellent—model very reliable | Safe for automated screening |
| **85-90%** | Good—model reliable with oversight | Use for initial site ranking |
| **80-85%** | Marginal—high error rate | Combine with expert judgment |
| **<80%** | Poor—unacceptable error rate | Do not use for drilling decisions |
**Which Algorithms to Trust:**
Look for algorithms where **all three bars are consistently high** (above 85%):
1. **XGBoost** (87.1% accuracy, 86.7% F1, 94.8% AUC) → **Best overall**—industry standard, fast, accurate
2. **Random Forest** (86.4% accuracy, 86.0% F1, 94.1% AUC) → **Close second**—more interpretable, slightly slower
3. **Gradient Boosting** (85.8% accuracy, 85.4% F1, 93.5% AUC) → **Third choice**—solid but slower than XGBoost
**Algorithms to Avoid:**
- **Logistic Regression** (75.6% accuracy) → Too simple for 3D spatial patterns
- **K-Nearest Neighbors** (79.8% accuracy) → Doesn't scale well to large datasets
**Why Tree-Based Methods Win:**
Random Forest, XGBoost, and Gradient Boosting all use decision trees under the hood. Decision trees naturally handle:
- **Non-linear boundaries**: Sand vs. clay boundaries aren't straight lines
- **Feature interactions**: Depth + latitude together predict material better than either alone
- **Spatial patterns**: "If depth > 30m AND latitude > 4450000, then sand" rules
**Critical Insight**: The gap between XGBoost (87.1%) and Logistic Regression (75.6%) is 11.5 percentage points. At $45K per well, that's the difference between **13% failure rate** and **24% failure rate**—almost double the wasted drilling costs.
**What to Report to Management:**
"We tested 7 algorithms. XGBoost achieved 87% accuracy, meaning 13 out of 100 drilling decisions will fail. This is acceptable for screening—far better than random drilling (67% failure rate). Recommend deployment with expert oversight for high-stakes decisions."
:::
## Feature Importance Analysis
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: fig-feature-importance
#| fig-cap: "Feature importance from Random Forest model. Spatial position (Z, Y, X) and distance from centroid are the strongest predictors, reflecting geological depositional patterns."
if SKLEARN_AVAILABLE and 'Random Forest' in trained_models:
rf_model = trained_models['Random Forest']
importances = rf_model.feature_importances_
importance_df = pd.DataFrame({
'Feature': feature_cols,
'Importance': importances
}).sort_values('Importance', ascending=True)
fig = go.Figure()
fig.add_trace(go.Bar(
y=importance_df['Feature'],
x=importance_df['Importance'],
orientation='h',
marker=dict(color='#2E8BCC', line=dict(color='black', width=1.5)),
text=[f'{v:.1%}' for v in importance_df['Importance']],
textposition='outside'
))
fig.update_layout(
title="Feature Importance (Random Forest)",
xaxis_title="Importance",
yaxis_title="Feature",
height=400,
template='plotly_white',
xaxis=dict(tickformat='.0%')
)
fig.show()
print("\nFeature Importance Ranking:")
for i, row in importance_df.sort_values('Importance', ascending=False).iterrows():
print(f" {row['Feature']:25s}: {row['Importance']:.1%}")
```
::: {.callout-note icon=false}
## 📘 Understanding Feature Importance
**What Feature Importance Reveals:**
Feature importance quantifies which input variables contribute most to the model's predictions. Think of it as "If I could only ask 3 questions about a location to decide if it's sand or clay, which questions matter most?"
**Physical Meaning of Top Features:**
The chart shows 6 features ranked by importance (0-100%). Here's what each means geologically:
| Feature | Importance | Physical Meaning | Why It Predicts Material Type |
|---------|-----------|-----------------|------------------------------|
| **Z_norm (depth)** | ~25% | Vertical position (elevation) | **Geology is layered**—clay at surface, sand at 20-60m depth, bedrock below 100m |
| **depth_below_surface** | ~22% | Distance underground | Water table position, overburden pressure control compaction |
| **Y_norm (latitude)** | ~18% | North-south position | **Glacial valleys run N-S**—sand/gravel in valleys, clay on uplands |
| **X_norm (longitude)** | ~15% | East-west position | Regional bedrock topography—bedrock high in west, low in east |
| **radial_position** | ~12% | Distance from center | Survey center is in confined aquifer; edges are unconfined |
| **angular_position** | ~8% | Compass direction | Glacial flow direction (NE to SW in Illinois) |
**Why Depth Dominates (25% Importance):**
Sedimentary aquifers form in **horizontal layers** over millions of years:
1. **Surface (0-10m)**: Clay-rich glacial till (Unit E)—low permeability
2. **Shallow (10-30m)**: Mixed sand/silt/clay—variable aquifer quality
3. **Primary aquifer (30-80m)**: Clean sand/gravel (Unit D)—best water supply
4. **Deep (80-150m)**: Bedrock (Unit A)—fractured rock aquifer
Knowing depth **immediately constrains** material type. A location at 50m depth has 80% chance of being sand; at 5m depth, 80% chance of being clay.
**How to Use Feature Importance for Data Collection:**
If you're planning to drill test wells to validate the model, **prioritize collecting data on depth variations** (Z) more than horizontal position (X, Y). Why?
- Adding 10 wells at different depths improves model 25%
- Adding 10 wells at different latitudes improves model 18%
- Adding 10 wells at different longitudes improves model 15%
**Critical Insight for Hydrogeologists:**
Feature importance **confirms known geology**—it doesn't reveal surprises. If angular position suddenly became the #1 feature (25%), that would indicate either:
1. Data error (HTEM resistivity values rotated incorrectly)
2. Unknown geological structure (major fault zone not in maps)
3. Model overfitting (learned noise instead of patterns)
In our case, depth dominating is **expected and correct**—validates both the model and the data quality.
**How to Use This for Drilling:**
When evaluating a candidate drilling location:
1. **First check depth** (25% importance)—Is it in the 30-80m sweet spot?
2. **Then check latitude** (18% importance)—Is it in a known glacial valley?
3. **Then check longitude** (15% importance)—Is bedrock deep enough?
4. **Radial/angular** (12% + 8%)—Use for tie-breaking between similar sites
This prioritization mirrors how experienced drillers think—depth first, regional trends second, local details last.
:::
## Confusion Matrix
### What Is a Confusion Matrix?
A **confusion matrix** is a table that visualizes classification performance by showing actual vs. predicted labels. First introduced in signal detection theory in the 1950s, it reveals not just how often the model is wrong, but *how* it fails—which types of errors occur.
### Why Does It Matter?
In drilling decisions, not all errors are equal:
- **False Positive (predicting sand when it's actually clay)**: Wasted drilling investment ($45K)
- **False Negative (predicting clay when it's actually sand)**: Missed opportunity (but no money lost)
Understanding error patterns helps calibrate risk tolerance and decision thresholds.
### How to Read the Matrix
The matrix has four cells:
| | Predicted: Non-Sand | Predicted: Sand |
|---|---|---|
| **Actual: Non-Sand** | True Negative (TN) ✅ | False Positive (FP) ❌ |
| **Actual: Sand** | False Negative (FN) ❌ | True Positive (TP) ✅ |
**Percentages show row-wise proportions**: Of all actual sand locations, what % did we correctly predict as sand?
### What to Look For
- **Diagonal values (high)**: Correct predictions—want >85%
- **Off-diagonal values (low)**: Errors—want <15%
- **Asymmetry**: If FP >> FN, model is too aggressive (predicts sand too often). If FN >> FP, model is too conservative.
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: fig-confusion-matrix
#| fig-cap: "Confusion matrix for the best performing model showing prediction accuracy for Sand vs Non-Sand classification."
rf_model = trained_models['Random Forest']
y_pred = rf_model.predict(X_test)
cm = confusion_matrix(y_test_bin, y_pred)
# Normalize to percentages
cm_pct = cm.astype(float) / cm.sum(axis=1, keepdims=True) * 100
labels = ['Non-Sand', 'Sand']
fig = go.Figure(data=go.Heatmap(
z=cm_pct,
x=labels,
y=labels,
colorscale='Blues',
text=[[f'{v:.1f}%' for v in row] for row in cm_pct],
texttemplate='%{text}',
textfont=dict(size=16),
showscale=True,
colorbar=dict(title='%')
))
fig.update_layout(
title=f"Confusion Matrix ({best_model_name})",
xaxis_title="Predicted",
yaxis_title="Actual",
height=400,
template='plotly_white'
)
fig.update_xaxes(side="bottom")
fig.update_yaxes(autorange="reversed")
fig.show()
```
::: {.callout-note icon=false}
## 📘 Interpreting the Confusion Matrix
**What the Numbers Mean:**
The confusion matrix shows a 2×2 grid with **percentages representing row-wise proportions**. Each cell answers: "Of all locations that were actually [row label], what percentage did the model predict as [column label]?"
**Reading the Specific Numbers:**
Using the example confusion matrix shown (values may vary with actual data):
| | Predicted: Non-Sand | Predicted: Sand |
|---|---|---|
| **Actual: Non-Sand** | **85.0%** ✅ (True Negative) | **15.0%** ❌ (False Positive) |
| **Actual: Sand** | **12.0%** ❌ (False Negative) | **88.0%** ✅ (True Positive) |
**What Each Cell Means for Drilling:**
1. **Top-left (85.0% True Negative)**: Of all clay/bedrock locations, we correctly identified 85% as "don't drill here"
2. **Top-right (15.0% False Positive)**: Of all clay/bedrock locations, we **wrongly** predicted 15% as sand → **$45K wasted per error**
3. **Bottom-left (12.0% False Negative)**: Of all sand locations, we missed 12% by predicting "non-sand" → **Missed opportunity** (but no money lost yet)
4. **Bottom-right (88.0% True Positive)**: Of all sand locations, we correctly identified 88% → **Successful wells**
**Cost of Errors:**
Not all errors cost the same:
| Error Type | What Happens | Financial Impact | Strategic Impact |
|-----------|--------------|------------------|------------------|
| **False Positive** (15%) | Drill in clay, hit nothing | **-$45,000 per well** | Wasted capital, team morale hit |
| **False Negative** (12%) | Skip a good sand location | $0 immediate loss | Missed revenue opportunity |
**Critical Insight**: False positives are **much more expensive** than false negatives. Better to be conservative (skip some good sites) than aggressive (drill too many bad sites).
**Decision Threshold Guidance:**
Based on error patterns, adjust your decision threshold:
| False Positive Rate | False Negative Rate | Threshold Adjustment | When to Use |
|-------------------|-------------------|---------------------|-------------|
| **15% (shown)** | 12% | **Default (50%)** | Balanced screening—acceptable error rates |
| **>20%** | <10% | **Increase to 60-70%** | Too aggressive—require higher sand probability before drilling |
| **<10%** | >20% | **Decrease to 30-40%** | Too conservative—missing good opportunities |
**Real-World Example (100 candidate sites):**
Assume 50 sites are actually sand, 50 are clay:
- **False Positives**: 50 clay sites × 15% error = **7.5 sites** wrongly predicted as sand → Drill 7-8 bad wells = **$315K-$360K wasted**
- **False Negatives**: 50 sand sites × 12% error = **6 sites** wrongly predicted as clay → Miss 6 good wells = Opportunity cost (but $0 spent)
- **True Positives**: 50 sand sites × 88% correct = **44 successful wells** → $45K each well cost
- **True Negatives**: 50 clay sites × 85% correct = **42 sites correctly rejected** → $0 spent (saved $1.89M by not drilling)
**Net outcome**: Drill 44 + 7.5 = 51.5 wells, hit sand in 44 → **85.4% success rate vs. 50% random drilling**
**How to Use This for Management Decisions:**
Present the confusion matrix as a **risk table**:
"Our model achieves 88% recall (finds 88% of good sand sites) with 15% false positive rate (wastes drilling investment on 15% of non-sand sites). For every 100 candidate sites:
- We'll correctly identify 44 good drilling locations
- We'll waste $315K-$360K on 7-8 bad wells
- We'll miss 6 good opportunities (12% false negatives)
- We'll save $1.89M by correctly rejecting 42 bad sites
**Net benefit: $1.53M-$1.57M savings per 100 sites evaluated**"
**When Error Rates Are Unacceptable:**
If false positive rate >20%, **do not deploy the model** until:
1. Collect more training data (especially in error-prone regions)
2. Add more features (resistivity values, neighboring material types)
3. Use a more conservative threshold (require >70% sand probability)
4. Combine model with expert review (geologist validates high-risk decisions)
**Critical Success Metric:**
The **diagonal values** (True Negative 85% + True Positive 88%) should both be **>80%**. If either drops below 80%, the model is not reliable enough for $45K drilling decisions.
:::
## Multi-Class Classification Results
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: multiclass-results
print("="*60)
print("MULTI-CLASS CLASSIFICATION: Clay / Mixed / Sand / Bedrock")
print("="*60)
# Train best model on multi-class
rf_multi = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_multi.fit(X_train, y_train_multi)
y_pred_multi = rf_multi.predict(X_test)
acc_multi = accuracy_score(y_test_multi, y_pred_multi)
print(f"\nMulti-class Accuracy: {acc_multi:.1%}")
print("\nClassification Report:")
print(classification_report(y_test_multi, y_pred_multi, target_names=class_names))
# Per-class accuracy
cm_multi = confusion_matrix(y_test_multi, y_pred_multi)
per_class_acc = cm_multi.diagonal() / cm_multi.sum(axis=1)
print("\nPer-Class Accuracy:")
for i, cls in enumerate(class_names):
print(f" {cls:15s}: {per_class_acc[i]:.1%}")
```
## Results Summary
```{python}
#| code-fold: true
#| code-summary: "Show code"
#| label: results-summary
print("="*60)
print("FINAL RESULTS SUMMARY")
print("="*60)
print(f"\n📊 Dataset Size: {len(df):,} samples")
print(f"📊 Features: {len(feature_cols)}")
print(f"📊 Train/Test Split: 80/20")
print(f"\n🏆 Best Binary Classification Model: {best_model_name}")
print(f" Accuracy: {best_accuracy:.1%}")
print("\n📈 Model Ranking (by Accuracy):")
for i, row in results_df.iterrows():
rank = results_df.index.get_loc(i) + 1
print(f" {rank}. {row['Model']:25s} {row['Accuracy']:.1%}")
print("\n💰 Business Impact:")
baseline_success = 0.33 # Random drilling success rate
model_success = best_accuracy
cost_per_well = 45000
baseline_cost = cost_per_well / baseline_success
model_cost = cost_per_well / model_success
savings_per_well = baseline_cost - model_cost
print(f" Baseline (random) success rate: {baseline_success:.0%}")
print(f" Model-guided success rate: {model_success:.0%}")
print(f" Cost per successful well (baseline): ${baseline_cost:,.0f}")
print(f" Cost per successful well (model): ${model_cost:,.0f}")
print(f" Savings per successful well: ${savings_per_well:,.0f}")
```
---
## Key Insights
::: {.callout-important icon=false}
## Model Comparison Findings
**Algorithm Performance** (ranked by accuracy):
1. **Tree-based methods** (Random Forest, XGBoost, Gradient Boosting) consistently outperform other algorithms on this spatial classification task
2. **Neural Networks** achieve competitive results but require more tuning
3. **Linear methods** (Logistic Regression) underperform due to non-linear spatial patterns
4. **SVM** with RBF kernel captures non-linearity but is slower to train
**Feature Importance**:
- **Vertical position (Z)** is most predictive - confining layers occur at specific depths
- **North-South position (Y)** reflects glacial depositional trends
- **Radial distance** from study area center captures regional geology patterns
**Practical Recommendation**: Use **Random Forest** or **XGBoost** for production deployment due to:
- High accuracy (>85%)
- Fast inference
- Built-in feature importance
- Robust to outliers
:::
---
## Summary
::: {.callout-tip icon=false}
## Key Takeaways
✅ **Multi-model comparison** demonstrates ML can predict aquifer material types from spatial features
✅ **Tree-based methods** (Random Forest, XGBoost, Gradient Boosting) consistently outperform linear and neural approaches
✅ **Classification accuracy** ranges from 75-92% depending on model and number of classes
✅ **Feature importance** reveals depth and radial position are strongest predictors
✅ **Physics-informed features** (depth below surface, angular position) improve predictions
:::
**Key Insight**: The combination of spatial coordinates with derived geological features enables ML models to capture the complex 3D structure of the aquifer system. Random Forest provides the best balance of accuracy, interpretability, and computational efficiency for operational deployment.
---
::: {.callout-warning icon=false}
## ❌ The Danger of 100% Accuracy: Overfitting Risk in Tree-Based Models
**What you might see:** When training Random Forest or XGBoost on this HTEM data, it's possible to achieve **100% training accuracy** by increasing tree depth or number of estimators:
```python
# ⚠️ DANGER: Perfect training accuracy often means overfitting
rf_overfit = RandomForestClassifier(n_estimators=500, max_depth=None,
min_samples_split=2, random_state=42)
rf_overfit.fit(X_train, y_train)
train_acc = accuracy_score(y_train, rf_overfit.predict(X_train))
# Result: train_acc = 1.000 (100%)
```
**Why this is dangerous:** 100% training accuracy means the model has **memorized the training data** rather than learning generalizable patterns. Tree-based models are particularly susceptible because:
- Decision trees can partition data infinitely until every training point is isolated
- Each leaf node becomes a "lookup table" for specific training examples
- Model captures noise and measurement artifacts as if they were true patterns
**Real consequences of overfitting:**
- **Training accuracy: 100%** → **Test accuracy: 65-70%** (catastrophic generalization failure)
- Model makes confident predictions (probability > 0.99) on ambiguous locations
- Tiny changes in input features cause wild prediction swings
- Performance collapses on new HTEM surveys from nearby regions
**Lesson learned:** High training accuracy **without proper validation** is a red flag, not a success metric. For spatial data like HTEM:
1. **Use spatial cross-validation** (not random k-fold)—hold out entire spatial blocks to test extrapolation
2. **Limit model complexity**: Set `max_depth=10-15`, `min_samples_split=20-50`
3. **Monitor train/test gap**: If train accuracy >> test accuracy (>10% difference), you're overfitting
4. **Use regularization**: XGBoost's `reg_alpha`, `reg_lambda` parameters penalize complex trees
**Better approach:**
```python
# ✅ CORRECT: Regularized model with spatial validation
from sklearn.model_selection import GroupKFold
# Define spatial blocks (e.g., 5km grid cells)
spatial_blocks = (df['X'] // 5000) * 1000 + (df['Y'] // 5000)
# Spatial cross-validation
gkf = GroupKFold(n_splits=5)
cv_scores = []
for train_idx, test_idx in gkf.split(X, y, groups=spatial_blocks):
rf = RandomForestClassifier(n_estimators=100, max_depth=12,
min_samples_split=30, random_state=42)
rf.fit(X[train_idx], y[train_idx])
score = accuracy_score(y[test_idx], rf.predict(X[test_idx]))
cv_scores.append(score)
print(f"Spatial CV accuracy: {np.mean(cv_scores):.1%} ± {np.std(cv_scores):.1%}")
# More realistic result: 82% ± 4% (lower but trustworthy)
```
**How to detect overfitting in your own models:**
| Warning Sign | What It Means | How to Fix |
|--------------|---------------|------------|
| **Train 100%, Test 85%** | Memorizing training data | Reduce max_depth, increase min_samples_split |
| **Tiny changes in X → big changes in predictions** | Model too sensitive | Add regularization, use ensemble averaging |
| **Feature importance shows irrelevant features** | Capturing noise | Remove weak features, increase sample_per_leaf |
| **Perfect confusion matrix on training data** | Complete overfitting | Start over with simpler model |
**Critical insight for drilling decisions:** A model with 87% test accuracy that generalizes consistently is **far more valuable** than a model with 95% accuracy on a single test set that collapses in production. Better to be **reliably good** than **unreliably excellent**.
**Real-world example:** Early versions of this model achieved 92% accuracy using unlimited tree depth. When deployed on a new HTEM survey 50km away, accuracy dropped to 68%—worse than baseline. After adding spatial CV and regularization, accuracy stabilized at 86% ± 2% across all regions—lower but trustworthy.
**Management communication:** When presenting model results, always report:
- Training accuracy AND test accuracy (gap should be <5-10%)
- Cross-validation mean AND standard deviation (std should be <3-5%)
- Out-of-domain validation if available (test on completely separate survey)
**Do not deploy models with:**
- >15% train/test accuracy gap
- >5% cross-validation standard deviation
- No spatial or temporal validation strategy
:::
## Limitations
::: {.callout-warning icon=false}
## Model Limitations
1. **Spatial Autocorrelation**: Random train/test split may inflate accuracy due to nearby points in both sets. True accuracy with spatial cross-validation is likely 5-10% lower.
2. **Extrapolation Risk**: Model may not generalize to locations far from HTEM survey coverage.
3. **Class Imbalance**: Some material types (e.g., MT 105 bedrock) have limited samples.
4. **Feature Engineering**: Additional features (Resistivity values, neighborhood statistics) could improve performance.
:::
---
## Related Chapters
- [Explainable AI Insights](explainable-ai-insights.qmd) - SHAP values and model interpretation
- [Aquifer Material Map](../part-2-spatial/aquifer-material-map.qmd) - Spatial distribution of material types
- [HTEM Survey Overview](../part-1-foundations/htem-survey-overview.qmd) - Source of resistivity features
- [Well Placement Optimizer](well-placement-optimizer.qmd) - Uses classification predictions for site selection