51 Lessons Learned Log

Failed Experiments & Tribal Knowledge Preservation

For Newcomers

You will learn:

What approaches failed and why (knowledge usually hidden in papers)
How to avoid common mistakes in aquifer data analysis
“Tribal knowledge” that experienced practitioners know but rarely document
Why documenting failures is as valuable as documenting success

Most technical documentation only shows what worked. This chapter reveals the 90% that failed—saving you from repeating our mistakes and preserving hard-won knowledge.

51.1 What You Will Learn in This Chapter

By the end of this chapter, you will be able to:

Describe common failure modes in aquifer data pipelines, modeling, optimization, and deployment.
Recognize patterns and anti-patterns that tend to waste time or create hidden technical debt.
Connect specific “failed experiments” to concrete practices that improve robustness and trustworthiness.
Use a structured lessons-learned log as part of your own project’s governance and onboarding.
Reflect on how negative knowledge (what doesn’t work) shapes better future decisions.

51.2 Purpose of This Chapter

Most technical documentation lies: It shows only what worked, hiding the 90% that failed.

This chapter tells the truth: What we tried, what failed, why it failed, what we learned.

Value: Save future researchers from repeating failures. Turn mistakes into knowledge.

51.3 Philosophy: Celebrate Failure

💻 For Computer Scientists

Research ≠ Production

In research papers, we show: - Model accuracy: 86.4% - F1 score: 86.0% - Publication-ready results

We hide: - 7 previous model architectures that failed - 3 months debugging data pipeline - 23 hyperparameter combinations that made it worse - The intern’s XGBoost model that beat our complex ensemble

This chapter documents the hidden 90%. Because the path to 86% went through 12%, 34%, 51%, 72%…

🌍 For Hydrogeologists

Field experience = Tribal knowledge

Old timer: “Don’t drill on the north side of the ridge - it’s all clay.” New hire: “Why?” Old timer: “We tried 4 wells there in 1987. All dry. Lost $180K.”

This knowledge disappears when old timer retires unless documented.

This chapter is that documentation: organized, searchable, actionable.

51.4 Failed Experiments Taxonomy

📘 Understanding Failure Taxonomies in Data Science

What Is It? A failure taxonomy is a structured classification system that categorizes mistakes, bugs, and unsuccessful experiments into logical groups. This concept comes from software engineering (1990s bug tracking systems) and medicine (adverse event reporting). For data science, taxonomies help teams recognize patterns: “We keep making the same type of error in data pipelines.”

Why Does It Matter? Most technical documentation shows only successes (publication bias). Failed experiments are hidden, causing future researchers to repeat the same mistakes. A failure taxonomy preserves “negative knowledge”—what doesn’t work—which is as valuable as positive knowledge. It also reveals systemic weaknesses: “80% of our failures are in Category 1 (pipelines), so we need better testing there.”

How Does It Work?

Define Categories: Group failures by type (pipeline, architecture, optimization, quality, deployment)
Document Each Failure: Record what was tried, why it failed, time wasted, lesson learned
Extract Patterns: Identify common root causes across failures in same category
Create Prevention: Design tests, checks, or processes to avoid repeating failures
Share Knowledge: Make taxonomy searchable so future team members can learn

What Will You See? Failures organized into 5 categories (pipeline, architecture, optimization, quality, deployment), with each entry showing the attempted approach, failure mechanism, time cost, and corrective action.

How to Interpret Failure Categories:

Category	What Goes Wrong	Typical Symptoms	Prevention Strategy	Time Typically Wasted
Pipeline Failures	Data loading, parsing, format issues	Crashes, silent corruption, wrong dates	Schema validation, explicit formats	2-8 weeks per bug
Architecture Failures	Wrong model choice, over-engineering	Low accuracy, slow training, complexity	Start simple, iterate	4-12 weeks per attempt
Optimization Failures	Algorithm doesn’t converge, wrong method	Slow, unstable, hard to explain	Match algorithm to problem size	2-6 weeks per attempt
Quality Failures	Data cleaning removes real information	Missing events, artificial patterns	Flag, don’t delete; preserve raw data	1-4 weeks per issue
Deployment Failures	Model drift, over-automation, monitoring gaps	Accuracy drops, bad decisions made	Production monitoring, human-in-loop	1-6 months to discover

Failure Cost Analysis: - Total documented failures: 23 experiments that didn’t work - Total time invested: ~180 weeks (3.5 person-years) across 2-year project - Percentage of project: ~50% of effort went into failed attempts - Value: Each documented failure saves 2-8 weeks for future researchers

Most Dangerous Failure Type: Silent Corruption - Example: Timestamp parsing bug (M/D/YYYY vs D/M/YYYY) - No error messages, system runs fine - All analysis is wrong but looks plausible - Discovered after 4 months when results didn’t make physical sense - Lesson: Explicit formats, sanity checks, domain expert review

Red Flags That Predict Failure: - “This worked in the paper, so it’ll work here” (different data, different problem) - “More complex is better” (usually false—feature engineering > model complexity) - “We’ll fix data quality later” (never works—GIGO principle) - “100% automation is the goal” (ignores human expertise and accountability) - “This will definitely work” (no backup plan when it doesn’t)

Success Pattern: Best outcomes came from: Simple baseline → Understand why it fails → Add minimal complexity → Validate → Repeat. Not: Complex solution first → Debug for months → Give up.

51.4.1 Category 1: Pipeline Failures

Failure: Auto-Detection of HTEM File Format

What we tried: Automatically detect 2D vs 3D HTEM files based on column names

Why it failed: Inconsistent naming across files - Some files: “Resistivity_ohm_m” - Other files: “Resist_Class” - Some files: Both columns present but one all NaN

Symptoms: Pipeline crashed on 15% of files, silent data corruption on another 20%

Time wasted: 6 weeks debugging production pipeline

What we learned: Never trust file formats. Always explicit schema validation:

# ❌ Don't do this
if 'Resistivity' in df.columns:
    process_2d_grid(df)

# ✅ Do this
required_cols = ['X', 'Y', 'Z', 'Resistivity_ohm_m']
if not all(col in df.columns for col in required_cols):
    raise SchemaError(f"Missing columns: {set(required_cols) - set(df.columns)}")

Status: Fixed in v0.3 with explicit schema checks

Pattern Recognition: Early Warning Signs of File Format Issues

Warning signs (what to watch for):

Inconsistent column names across files (e.g., “Resistivity” vs. “Resist” vs. “Res_ohm_m”)
Silent failures on subset of files (pipeline works on 80%, crashes on 20%)
NaN columns that should have data (column exists but all values are NaN)
Unexpected data types (expecting float, getting string due to format variations)

Prevention strategy:

Schema validation FIRST: Define expected schema before processing any files

REQUIRED_COLUMNS = ['X', 'Y', 'Z', 'Resistivity_ohm_m']
OPTIONAL_COLUMNS = ['Material_Type', 'Confidence']

def validate_schema(df):
    missing = set(REQUIRED_COLUMNS) - set(df.columns)
    if missing:
        raise SchemaError(f"Missing required columns: {missing}")

Test on ALL files: Don’t just test on one “representative” file—file formats vary

Explicit type checking: Validate data types match expectations

assert df['Resistivity_ohm_m'].dtype == float
assert df['X'].min() > 0  # UTM coordinates should be positive

Log schema info: On first load, log schema for debugging

logger.info(f"Loaded {file}: {df.shape} rows, columns: {list(df.columns)}")

Recovery approach (when you discover format issues):

Don’t panic-fix: Resist urge to patch one file at a time
Catalog variations: Survey ALL files, document format variations
Design unified schema: Create canonical schema that handles all variations
Implement converters: Write explicit conversion functions for each format variant
Add regression tests: Test pipeline on problematic files to prevent regression

Key lesson: File format assumptions are the #1 cause of silent data corruption. Always validate explicitly, never trust auto-detection.

Failure: Timestamp Parsing (The Silent Killer)

What we tried: Use pd.to_datetime(..., errors='coerce') for robustness

Why it failed: Database uses US format (M/D/YYYY), pandas defaults to ISO (YYYY-MM-DD) - “7/9/2008” parsed as September 7 in some locales, July 9 in others - Silent data corruption: Wrong dates, no error messages - All temporal analysis was wrong for 4 months before discovery

How we discovered: Seasonal decomposition showed peak recharge in August (should be April)

Time wasted: 4 months bad analysis, 2 weeks to find bug, 1 week to fix

What we learned: ALWAYS use explicit format: format='%m/%d/%Y'

Impact: Created TIMESTAMP_AUDIT_AND_FIXES.md and config/timestamp_formats.yaml

Status: Fixed in v0.5, added regression tests

51.5 Prevention Framework: Timestamp Validation Checks

Common pitfalls (locale-dependent parsing):

Auto-detection: pd.to_datetime() without format= → Guesses based on locale
Ambiguous dates: “7/9/2008” could be July 9 or September 7 depending on locale
Silent corruption: No errors, just wrong dates (most dangerous type of bug)

Validation checks (implement ALL of these):

Explicit format specification:

# ✅ ALWAYS specify format explicitly
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%m/%d/%Y', errors='coerce')

# Document format in schema
TIMESTAMP_FORMAT = '%m/%d/%Y'  # US format: Month/Day/Year

Sanity checks after parsing:

# Check date range is reasonable
assert df['TIMESTAMP'].min() > pd.Timestamp('1900-01-01')
assert df['TIMESTAMP'].max() < pd.Timestamp('2030-01-01')

# Check no dates in future
assert df['TIMESTAMP'].max() <= pd.Timestamp.now()

Domain validation:

# For seasonal analysis, check peak month makes sense
monthly_mean = df.groupby(df['TIMESTAMP'].dt.month)['water_level'].mean()
peak_month = monthly_mean.idxmax()
assert peak_month in [3, 4, 5], f"Peak recharge should be spring, got month {peak_month}"

Cross-reference validation:

# Compare to known events (e.g., 2012 drought)
drought_period = df[(df['TIMESTAMP'] >= '2012-06-01') & (df['TIMESTAMP'] <= '2012-09-30')]
assert drought_period['water_level'].mean() < df['water_level'].mean(), "2012 drought should show low water levels"

Testing protocol:

Unit test with known dates: Test parsing on edge cases (1/1/2000, 12/31/2099)
Integration test with real data: Parse full dataset, validate results
Regression test: Lock in correct parsing, prevent future breaks

Key lesson: Timestamp bugs are silent killers—no errors, just wrong analysis. Always use explicit formats, always validate results, always add sanity checks.

51.5.1 Category 2: Architecture Failures

Failure #1: Complex Ensemble Beats Simple Baseline

What we tried: 7-model stacking ensemble (RF + XGBoost + LightGBM + CatBoost + Extra Trees + AdaBoost + Gradient Boost)

Why it failed: - Training time: 8 hours (vs 12 minutes for RF alone) - Overfitting: Train accuracy 98%, test accuracy 84% (vs RF: 92%/86%) - Maintenance nightmare: 7 libraries, version conflicts - Marginal gain: 84% vs 86% (ensemble WORSE than RF)

What we learned: - Simple > Complex (Occam’s Razor applies to ML) - RF with good feature engineering beats complex ensemble with bad features - 1 hour feature engineering > 10 hours model tuning

Status: Abandoned ensemble, focused on RF + feature engineering → 86.4% accuracy

Failure #2: Deep Learning for Tabular Data

What we tried: Neural network for material classification (tabular HTEM data)

Architecture:

Input (8 features) → Dense(128) → Dropout(0.3) → Dense(64) →
Dropout(0.3) → Dense(32) → Dense(15 classes)

Why it failed: - Test accuracy: 79% (vs RF: 86%) - Training time: 45 minutes (vs RF: 2 minutes) - Hyperparameter hell (learning rate, dropout, batch size, optimizer) - Needs 10× more data for marginal improvement

What we learned: Deep learning excels at images/sequences, not small tabular data - Tabular data: Use RF, XGBoost (decision trees) - Images: Use CNN - Sequences: Use LSTM, Transformers - Don’t force deep learning where it doesn’t fit

Status: Abandoned for material classification, kept for time series forecasting (where it excels)

Meta-Analysis: Common Patterns in Architecture Failures

Root causes (why architectures fail):

Complexity bias: “More complex = better” assumption (usually false)
Hype-driven development: “Deep learning is SOTA, let’s use it everywhere”
Ignoring data characteristics: Forcing algorithms meant for images onto tabular data
Premature optimization: Optimizing model before validating it works
No baseline: Jumping to complex model without simple baseline to beat

Common failure patterns:

Pattern	Example	Why It Fails	Simple Alternative
Overengineering	7-model ensemble vs. single RF	Complexity ≠ accuracy, maintenance nightmare	Random Forest with good features
Wrong tool for job	Deep learning for tabular data	DL needs large datasets, tabular has small	Tree-based models (RF, XGBoost)
Transfer learning mismatch	ImageNet → HTEM grids	Source domain too different	Train from scratch on domain data
Hyperparameter hell	10+ hyperparameters to tune	Search space too large, overfitting	Simpler model with fewer knobs

Prevention principles:

Start simple: Baseline must be simple, fast, interpretable (e.g., Random Forest)
Beat the baseline: Only increase complexity if it significantly beats baseline (>5% improvement)
Match algorithm to data: Tabular → trees, images → CNN, sequences → LSTM
1-hour rule: If model takes >1 hour to train, simplify first
Explainability requirement: If you can’t explain predictions, model is too complex

Decision framework (when to use what):

Use Random Forest when: - Tabular data with <100K rows - Need interpretability (SHAP works well) - Want fast training (<10 minutes) - Baseline or production model

Use XGBoost/LightGBM when: - Tabular data with >100K rows - Accuracy is critical (competitions) - Can tolerate longer training (30-60 min) - Have engineering resources for tuning

Use Deep Learning when: - Images (CNN), sequences (LSTM/Transformer), or very large datasets (>1M rows) - Have GPU resources - Accuracy gain >10% over trees - Latency requirements allow (DL is slower)

Key lesson: Architecture failures share common pattern—complexity without justification. Always start simple, validate it works, then add complexity only when necessary and beneficial.

Failure #3: Transfer Learning from ImageNet

What we tried: Treat HTEM 2D grids as images, fine-tune pre-trained ResNet

Logic: “ImageNet learned to detect patterns in images, HTEM grids ARE images of subsurface…”

Why it failed: - ResNet expects RGB (3 channels), HTEM is grayscale (1 channel) → Had to pad with zeros - ImageNet learned cats/dogs, not geology → Features not transferable - Accuracy: 62% (worse than random forest) - Model size: 138 MB (vs RF: 2.4 MB)

What we learned: Transfer learning requires domain similarity - ImageNet → Cats/dogs: ✅ Works (similar objects) - ImageNet → Medical X-rays: ✅ Works (similar modality) - ImageNet → HTEM grids: ❌ Fails (completely different domain)

Better approach: Pre-train on other HTEM surveys (geophysical domain), not ImageNet

Status: Abandoned, lesson documented

Broader Lesson: When Transfer Learning Works vs. Fails

Transfer learning works when:

Domain similarity: Source and target domains are similar (e.g., ImageNet cats → wildlife photos)
Task similarity: Same type of task (e.g., image classification → image classification)
Feature reuse: Low-level features from source transfer to target (e.g., edges, textures)
Data scarcity: Target domain has <10K labeled examples (pre-training helps)

Transfer learning fails when:

Domain mismatch: Completely different modalities (e.g., natural images → geophysical grids)
Task mismatch: Different objectives (e.g., object detection → material classification)
Feature mismatch: Pre-trained features irrelevant (e.g., cat whiskers ≠ resistivity gradients)
Sufficient data: Target domain has >100K examples (training from scratch works fine)

Domain adaptation strategies:

Fine-tuning: Keep pre-trained weights, train last layers on target data (works for similar domains)
Feature extraction: Use pre-trained model as feature extractor, train classifier on top (works when features transferable)
Multi-task learning: Train on multiple related tasks simultaneously (works when tasks share structure)
Domain-specific pre-training: Pre-train on other HTEM surveys, then fine-tune on Champaign County (best approach for this problem)

Decision tree (should I use transfer learning?):

Do I have <10K labeled examples?
├─ No → Train from scratch (you have enough data)
└─ Yes → Is there a pre-trained model from similar domain?
    ├─ No → Train from scratch with data augmentation
    └─ Yes → Is the task similar (classification, detection, segmentation)?
        ├─ No → Train from scratch (task mismatch)
        └─ Yes → Try transfer learning, validate on holdout set
            ├─ Accuracy gain >5% → Keep transfer learning
            └─ Accuracy gain <5% → Train from scratch (transfer not helping)

Key insight: Transfer learning is not magic—it only works when source and target domains share meaningful structure. Don’t force it when domains are fundamentally different.

51.5.2 Category 3: Optimization Failures

Failure: Genetic Algorithm for Well Placement

What we tried: Use genetic algorithm (GA) for multi-objective well optimization

Why it seemed good: “Nature-inspired, handles multiple objectives, proven in literature”

Why it failed: - Slow: 10,000 iterations × 0.5 sec/evaluation = 83 minutes runtime - Unstable: Different runs gave different “optimal” solutions (random seed sensitivity) - Hard to tune: Population size, mutation rate, crossover rate, elite count… - Black box: Can’t explain why algorithm chose location A over B

What worked instead: Grid search with Pareto filtering - Evaluate ALL candidate locations (1,000 sites) - Filter to Pareto frontier (47 sites) - Rank by weighted score - Runtime: 2 minutes - Reproducible: Same input → same output - Explainable: Can show suitability map

What we learned: Optimization ≠ Always use fancy algorithm - Small search space (<10,000 candidates): Brute force grid search - Medium search space: Pareto filtering + ranking - Large search space (>1M candidates): Then consider GA, simulated annealing

Status: Switched to grid search, 40× faster and more explainable

Algorithm Selection Framework: When to Use What Optimization Method

Problem size determines algorithm:

Search Space Size	Best Approach	Runtime	Pros	Cons
<1,000 candidates	Exhaustive search (grid)	Seconds-minutes	Guaranteed global optimum, reproducible	Only works for small spaces
1K-10K candidates	Pareto filtering + ranking	Minutes	Fast, explainable, reproducible	Requires good scoring function
10K-100K candidates	Random search + refinement	10-30 min	Explores space well, simple	May miss optimal
100K-1M candidates	Gradient-based (if continuous)	10-60 min	Fast convergence if smooth	Requires differentiable objectives
>1M candidates	Genetic algorithm, particle swarm	Hours	Handles discrete + continuous	Slow, hard to tune, non-reproducible

When to use Genetic Algorithms:

✅ Use GA when: - Search space is >1M candidates AND can’t be exhaustively searched - Objective function is non-differentiable (can’t use gradient descent) - Multi-modal landscape (many local optima to escape) - You have weeks to tune hyperparameters (population size, mutation rate, etc.)

❌ Don’t use GA when: - Search space is <10K candidates (use grid search instead—faster and reproducible) - Need explainability (GA is black box) - Production deployment requires reproducibility (GA results vary by random seed) - Limited compute time (<1 hour)

Better alternatives by problem type:

For well placement (discrete locations): 1. Grid search (evaluate all candidates) 2. Pareto filtering (remove dominated solutions) 3. Multi-criteria ranking (weighted score) → Result: 2 minutes, reproducible, explainable

For continuous optimization (e.g., pumping rates): 1. Gradient descent (if objective is smooth) 2. Bayesian optimization (if expensive to evaluate) 3. NSGA-II (if truly multi-objective with >100K candidates) → Result: 10-30 minutes, converges to optimum

Selection criteria:

What's your search space size?
├─ <10K → Use grid search (brute force)
├─ 10K-100K → Use random search or Pareto filtering
└─ >100K → Is objective smooth/continuous?
    ├─ Yes → Use gradient-based (Adam, L-BFGS)
    └─ No → Is it multi-modal (many peaks)?
        ├─ Yes → Use GA or particle swarm (last resort)
        └─ No → Use simulated annealing (simpler than GA)

Key insight: Fancy optimization algorithms (GA, PSO) are NOT better—they’re just necessary for extremely large, complex search spaces. For most real-world problems, simple approaches (grid search, Pareto filtering) are faster, more explainable, and more reproducible.

51.5.3 Category 4: Quality Failures

Failure: Outlier Removal Deleted Real Events

What we tried: Remove outliers >3σ from rolling mean to clean data

Why it failed: Deleted actual pumping tests and flood events - Pumping test: 3m drawdown in 6 hours → Flagged as outlier, deleted - Flash flood: 2m rise in 2 hours → Flagged as outlier, deleted - Lost critical information for aquifer characterization

What we learned: Outliers ≠ Errors - Some outliers are measurement errors (sensor stuck) - Some outliers are real events (extreme weather, pumping) - Don’t delete blindly - flag for review, preserve raw data

Better approach: Multi-method anomaly detection (Anomaly Early Warning) - If 5/5 methods flag → Likely sensor error - If 1/5 methods flag → Likely real event - Always keep raw data, flag in metadata

Status: Changed to anomaly flagging (not deletion), preserves all data

Detection Strategy: Identifying Data Quality Issues Early

Early warning signs of quality problems:

Sudden drop in data volume: 1000 records/day → 100 records/day (sensor failure or data pipeline issue)
Unexpected value distributions: Water levels all positive → 10% negative values (measurement errors)
Missing variation: Constant values for extended periods (stuck sensor)
Impossible values: Water level changes 10m in 1 hour (physically impossible in aquifer)
Seasonal pattern breaks: Peak recharge in August instead of April (timestamp parsing error)

Validation checks (implement in data pipeline):

Range checks:

# Define physical bounds
VALID_WATER_LEVEL_RANGE = (-10, 50)  # meters below surface
VALID_RESISTIVITY_RANGE = (1, 10000)  # ohm-meters

# Flag out-of-range values
invalid = (df['water_level'] < VALID_WATER_LEVEL_RANGE[0]) | \
          (df['water_level'] > VALID_WATER_LEVEL_RANGE[1])
df.loc[invalid, 'quality_flag'] = 'out_of_range'

Rate-of-change checks:

# Maximum physically possible change (e.g., 1m/day for water levels)
df['daily_change'] = df['water_level'].diff() / df['TIMESTAMP'].diff().dt.days
impossible_change = df['daily_change'].abs() > 1.0  # m/day
df.loc[impossible_change, 'quality_flag'] = 'rate_exceeded'

Consistency checks:

# Check relationships between variables
# Example: Pumping well should show drawdown during pumping
pumping_periods = df[df['pumping'] == True]
rising_during_pumping = pumping_periods['water_level'].diff() > 0
if rising_during_pumping.any():
    logger.warning("Water level rising during pumping - check sensor")

Multi-method anomaly ensemble:

# Use 5 methods, flag if majority agree
methods = [
    isolation_forest(df),
    local_outlier_factor(df),
    zscore_method(df),
    seasonal_decomposition_outliers(df),
    domain_rules(df)
]
votes = sum(methods)  # Count how many methods flagged each point
df['anomaly_confidence'] = votes / len(methods)
df['quality_flag'] = votes >= 3  # Flag if 3+ methods agree

Quality gates (don’t proceed if these fail):

Completeness: <5% missing values (otherwise, investigate gap causes)
Validity: <1% out-of-range values (otherwise, check sensor calibration)
Consistency: Temporal ordering correct, no duplicates
Plausibility: Seasonal patterns match expectations (April peak recharge)

Key lesson: Don’t clean data blindly (e.g., “remove all outliers”). Instead, FLAG suspicious data, preserve raw values, investigate root causes, and let domain experts decide if it’s error or real event.

Failure: Imputation Introduced Artifacts

What we tried: Fill missing groundwater data with linear interpolation

Why it failed: Created impossible water level changes - Example: Gap from 2020-03-15 to 2020-06-20 (97 days) - Linear interpolation: Smooth decline from 15.2m to 14.8m - Reality when data resumed: Sudden spike to 16.5m (spring recharge) - Our imputation created fake “gradual decline” that never happened

What we learned: Long gaps can’t be interpolated - Short gaps (<7 days): Linear interpolation OK - Medium gaps (7-30 days): Use seasonal average (same day of year from previous years) - Long gaps (>30 days): Leave as NaN, don’t fabricate data

Impact: Seasonal decomposition failed because imputed data had wrong seasonal pattern

Status: Changed to conservative imputation (max 7 days), otherwise NaN

51.5.4 Category 5: Deployment Failures

Failure: Model Drift Detection (Too Late)

What happened: Material classification model accuracy dropped from 86% to 71% over 6 months

Why we didn’t notice: No monitoring in place

How we discovered: Driller complained “Model said sand, I hit clay 4 times in a row”

Root cause: - Model trained on data from 2015-2020 (glacial outwash region, north) - New wells drilled in 2023-2024 (different geology, south) - Model extrapolating beyond training distribution

What we learned: Production models need monitoring

# Track predictions vs actuals
for new_well in recent_wells:
    predicted = model.predict(new_well.location)
    actual = new_well.drilled_lithology
    log_prediction(predicted, actual, timestamp=now())

# Alert if accuracy drops
monthly_accuracy = calculate_accuracy(last_30_days)
if monthly_accuracy < 0.80:
    send_alert("Model accuracy dropped to {monthly_accuracy:.0%}. Retrain recommended.")

Status: Added model monitoring dashboard, quarterly retraining schedule

51.6 Monitoring Framework: What to Monitor in Production

Critical metrics to track:

Model performance degradation:
- Monthly accuracy: Track predictions vs. actuals each month
- Alert threshold: Accuracy drops below 80% (from baseline 86%)
- Action: Investigate cause, retrain if needed
Data distribution shift:
- Feature distributions: Track mean, std, min, max for each feature
- Alert threshold: >2σ change from training distribution
- Action: Check if new wells in different geological region
Prediction confidence:
- Uncertainty estimates: Track prediction confidence over time
- Alert threshold: >30% of predictions have low confidence (<70%)
- Action: Model may need more training data for new conditions
System health:
- Inference latency: Predictions should complete in <1 second
- Error rate: <1% of predictions should fail
- Data freshness: Sensor data should be <15 minutes old

Monitoring implementation:

class ProductionMonitor:
    def __init__(self, model, baseline_accuracy=0.86):
        self.model = model
        self.baseline_accuracy = baseline_accuracy
        self.predictions_log = []

    def log_prediction(self, features, prediction, actual=None, timestamp=None):
        """Log each prediction for monitoring"""
        log_entry = {
            'timestamp': timestamp or datetime.now(),
            'features': features,
            'prediction': prediction,
            'actual': actual,  # Add when available (e.g., after drilling)
            'confidence': self.model.predict_proba(features).max()
        }
        self.predictions_log.append(log_entry)

    def check_performance(self, window_days=30):
        """Check if model performance has degraded"""
        recent = [p for p in self.predictions_log
                  if p['timestamp'] > datetime.now() - timedelta(days=window_days)
                  and p['actual'] is not None]

        if len(recent) < 10:
            return None  # Not enough data yet

        accuracy = sum(p['prediction'] == p['actual'] for p in recent) / len(recent)

        if accuracy < 0.80:
            self.send_alert(
                f"Model accuracy dropped to {accuracy:.1%} (baseline: {self.baseline_accuracy:.1%}). "
                f"Based on {len(recent)} predictions in last {window_days} days. "
                f"Consider retraining."
            )

        return accuracy

    def check_distribution_shift(self):
        """Detect if input features have shifted from training distribution"""
        recent_features = [p['features'] for p in self.predictions_log[-100:]]
        # Compare to training distribution (implementation depends on features)
        # Flag if mean/std differ by >2 sigma

Alert thresholds:

Metric	Green (OK)	Yellow (Warning)	Red (Alert)	Action
Accuracy	>85%	80-85%	<80%	Retrain immediately
Confidence	>80% high-conf	60-80%	<60%	Investigate low-confidence cases
Latency	<500ms	500ms-2s	>2s	Optimize inference
Error rate	<0.1%	0.1-1%	>1%	Debug errors
Data freshness	<5 min	5-15 min	>15 min	Check sensor/network

Recovery procedures:

If accuracy drops: 1. Check recent predictions - are they all from new geographical area? 2. Compare feature distributions - training vs. production 3. Collect new training data from failed predictions 4. Retrain model with combined old + new data 5. Validate on holdout set before redeploying

If distribution shifts: 1. Investigate cause - new wells in different geology? Climate change? 2. If temporary (e.g., one unusual well), no action needed 3. If persistent trend, retrain quarterly to adapt

Key lesson: Production models WILL degrade over time as the world changes. Monitoring is not optional—it’s essential for maintaining performance and catching problems before they become disasters.

Failure: Over-Automation (The $45K Mistake)

What we tried: Fully automated well siting recommendations without human review

Why it failed: - Model recommended location with 89% confidence for sand - Automated system generated permit, scheduled drilling (no human in loop) - Drilled, hit clay (model was in the 11% error rate) - $45K dry hole, political embarrassment

What we learned: High-stakes decisions need human-in-loop - Low stakes (<$1K): Automate fully (sensor replacement) - Medium stakes ($1K-$10K): Automate with review (schedule maintenance) - High stakes (>$10K): Human approval required (well drilling)

New workflow: 1. Model recommends top 5 sites 2. Geologist reviews (domain expertise) 3. Stakeholder meeting (budget, politics) 4. Final decision (human)

Status: Changed to decision support (not decision automation)

51.7 Tribal Knowledge: Hard-Won Insights

📘 Understanding Tribal Knowledge in Technical Projects

What Is It? Tribal knowledge refers to undocumented expertise that experienced team members know but haven’t written down. The term comes from organizational studies (1980s-90s) describing how companies lose critical knowledge when senior employees retire. In data science, tribal knowledge includes “gotchas” discovered through painful experience: “Never use auto-detection for timestamps” or “Spatial cross-validation is needed for geospatial data.”

Why Does It Matter? When a senior geologist retires, they take 30 years of field experience with them—knowledge about which sites tend to have clay, which wells are unreliable, which analyses to trust. Without documentation, new hires must rediscover these lessons through trial and error, wasting months and making expensive mistakes. Tribal knowledge documentation prevents knowledge loss and accelerates onboarding.

How Does It Work?

Capture Implicit Knowledge: Interview experienced staff about “rules of thumb” and “things everyone knows”
Document Root Causes: Explain WHY the rule exists (not just WHAT the rule is)
Provide Examples: Show concrete cases where following/ignoring the rule succeeded/failed
Organize by Topic: Group insights by category (spatial analysis, temporal patterns, data quality)
Keep Updated: Add new insights as team makes discoveries

What Will You See? A collection of counter-intuitive insights with explanations: spatial autocorrelation affects model validation, resistivity values are depth-dependent, seasonal adjustment prevents false alarms, rare types need special handling, and explainability sometimes trumps accuracy.

How to Interpret Tribal Knowledge Categories:

Knowledge Type	Example	Why It’s Not Obvious	Cost of Not Knowing	How Discovered
Statistical Gotchas	Spatial autocorrelation inflates accuracy	Standard ML assumes independence	Overestimate accuracy by 4-6%	Compared random vs spatial CV
Domain Physics	Same resistivity = different lithology at different depths	Need geological context	Misclassify materials	Geologist caught error
Data Quirks	Seasonal 2m cycle in water levels	Looks like trend, not seasonal	False drought alarms (60% FP rate)	Seasonal decomposition
Rare Event Handling	MT 14 is 0.3% but highest yield	Standard methods ignore rare classes	Miss best aquifer zones	Class imbalance analysis
Stakeholder Psychology	Explainability > Accuracy for adoption	Trust matters more than metrics	Model rejected despite accuracy	A/B preference test

How Tribal Knowledge Accumulates: - Year 1: Learn the basics (how to load data, run models) - Year 2: Discover first “gotchas” (timestamp parsing, spatial autocorrelation) - Year 3: Build intuition (when models lie, which metrics matter) - Year 5: Can predict what will fail before trying it - Year 10+: Become “tribal elder” who knows all the edge cases

Knowledge Transfer Methods: - Documentation: Write it down (this chapter is an example) - Code Reviews: “Why did you use explicit format here?” → Teachable moment - Pair Programming: Junior watches senior, asks “why?” constantly - Post-Mortems: After failures, document what we learned - Onboarding Guides: “Read this chapter BEFORE trying to improve the system”

Example: The Resistivity-Depth Interaction

What Novice Knows: “High resistivity = sand, low resistivity = clay”

What Tribal Knowledge Adds: - 50 Ω·m at 20m depth = Sand (Quaternary sediments) - 50 Ω·m at 80m depth = Fractured bedrock (Carboniferous) - Reason: Different geological formations at different depths - Impact: Without depth feature, model accuracy drops from 86% to 78% - How discovered: Geologist noticed model misclassified deep formations

Preservation Strategy: This chapter IS the preservation strategy—searchable, version-controlled, permanent. Future team members can Ctrl+F for keywords and find the accumulated wisdom of years of work.

How to extract tribal knowledge from your team:

Interview senior staff: “What do you wish you knew when you started?” “What mistakes do newcomers always make?”
Document war stories: “Tell me about the worst bug you ever debugged” → Write it down
Code review insights: When senior dev says “Don’t do X because Y”, add to tribal knowledge
Post-mortem every failure: After each production issue, document root cause and lesson
Onboarding feedback: Ask new hires “What surprised you?” → Document the non-obvious things

Tribal knowledge red flags (signs you’re losing knowledge):

“Only Bob knows how that works” → Bus factor of 1
“We tried that 3 years ago and it failed” → But no written record of why
New hires make same mistakes repeatedly → Knowledge not transferred
Decisions based on “we’ve always done it this way” → Lost original reasoning
Critical systems have no documentation → Institutional memory only

Key insight: Tribal knowledge is your competitive advantage—but only if you write it down. This chapter is living proof that documentation preserves decades of hard-won insights.

51.7.1 Insight 1: Spatial Autocorrelation

Discovery: Nearby points have correlated errors, not independent

Implication: Random train/test split overestimates accuracy - If training point at (X=405000, Y=4428000) - And test point at (X=405020, Y=4428020) - only 28m away - Model “cheats” by interpolating from nearby training point

Better approach: Spatial cross-validation (leave-one-region-out)

Impact: True accuracy is 82% (not 86% from random split)

51.7.2 Insight 2: Non-Linear Relationships

Discovery: Same resistivity can mean different lithologies depending on context - 50 Ω·m at 20m depth = Sand (Quaternary) - 50 Ω·m at 80m depth = Fractured bedrock (Carboniferous)

Implication: Can’t use resistivity alone, need depth + location

Feature engineering solution: resistivity × depth interaction term

Impact: Accuracy improved from 78% → 86%

51.7.3 Insight 3: Seasonal Adjustment

Discovery: Water levels show 2m seasonal cycle (spring high, fall low)

Implication: Forecasting models confuse trend with seasonality - Model predicts decline in summer (actually seasonal, not drought) - False alarms for drought when it’s just normal summer drawdown

Solution: STL decomposition (separate trend, seasonal, residual)

Impact: False positive rate dropped 60%

51.7.4 Insight 4: Rare Type Recognition

Discovery: MT 14 (extremely well sorted sand) only 0.3% of data, but 200+ GPM yield

Implication: Class imbalance - model ignores MT 14 (too rare)

Solution: Oversampling MT 14 in training (SMOTE), or separate binary classifier

Impact: MT 14 detection improved from 25% → 73%

51.7.5 Insight 5: Explainability Beats Accuracy

Discovery: Stakeholders trust 82% accurate explainable model over 86% accurate black box

Experiment: - Model A: XGBoost, 87% accurate, no explanation - Model B: Random Forest, 83% accurate, SHAP explanations

Result: Stakeholders chose Model B 4× more often

Lesson: Accuracy is not everything - Production needs: Accuracy + Explainability + Reliability + Speed - Research optimizes: Accuracy only

Impact: Switched from XGBoost to Random Forest for production

51.8 Anti-Patterns to Avoid

51.8.1 Anti-Pattern 1: Assume More Data Always Helps

Naive thinking: “Model accuracy is 75%. If we collect 10× more data, it’ll reach 90%!”

Reality: Diminishing returns - First 1,000 samples: 60% → 75% accuracy (+15%) - Next 10,000 samples: 75% → 81% accuracy (+6%) - Next 100,000 samples: 81% → 84% accuracy (+3%)

Lesson: After certain point, better features > more data

51.8.2 Anti-Pattern 2: Neural Networks Everywhere

Naive thinking: “Deep learning is state-of-art. Let’s use it everywhere!”

Reality: Deep learning is tool, not universal solution - Tabular data (<100K rows): Random Forest wins - Images: CNN wins - Sequences: LSTM/Transformer wins - Don’t force deep learning where it doesn’t fit

Lesson: Match algorithm to problem structure

51.8.3 Anti-Pattern 3: Total Automation

Naive thinking: “Humans are bottleneck. Automate everything!”

Reality: Humans provide: - Domain expertise (model can’t learn from 10 samples what geologist knows from 30 years) - Common sense (model recommends drilling in lake - technically valid, obviously wrong) - Accountability (when model fails, who is responsible?)

Lesson: Design for human-AI collaboration, not replacement

51.8.4 Anti-Pattern 4: Fix Later

Naive thinking: “Data quality is poor, but model is robust to noise. Ship it!”

Reality: Garbage in, garbage out - Poor timestamp parsing → 4 months of wrong analysis - Outliers not cleaned → Model learns noise as signal - Missing values imputed → Fake patterns appear

Lesson: Fix data quality FIRST, then build models

51.8.5 Anti-Pattern 5: Research to Production Gap

Naive thinking: “Model got 90% accuracy in notebook. Deploy to production!”

Reality: Research ≠ Production - Research: Batch processing, clean data, no uptime requirement - Production: Real-time, dirty data, 99.9% uptime required - Research: Accuracy only metric - Production: Accuracy + Latency + Reliability + Explainability + Cost

Lesson: Production needs different engineering than research

How to Detect Anti-Patterns: Warning Signs

Anti-Pattern 1: “More Data Always Helps”

Warning signs: - Team proposes “Let’s collect 10× more data” without analyzing current data first - Accuracy plateaued but response is “we need more data” (not “better features”) - Budget request for massive data collection before validating model works

How to detect: - Plot learning curve (accuracy vs. training set size) - If curve has plateaued, more data won’t help—need better features or different model - Test: Train on 10%, 30%, 50%, 100% of data. If 50%→100% gives <2% improvement, you’ve saturated.

Prevention: - Always plot learning curves before requesting more data - Try feature engineering first (cheaper than data collection) - Quantify marginal value: “10K more samples = 2% accuracy gain, worth $50K data cost?”

Anti-Pattern 2: “Neural Networks Everywhere”

Warning signs: - Deep learning proposed for every problem regardless of data size or structure - “We use SOTA” without justifying why SOTA is needed - Complex architectures with no baseline comparison

How to detect: - Check if problem matches DL strengths (images, sequences, >100K samples) - Ask “What’s the baseline?” If answer is “no baseline”, that’s a red flag - Look for excessive training time (>1 hour) on small datasets

Prevention: - Require simple baseline FIRST (Random Forest for tabular, logistic regression for classification) - Only use DL if it beats baseline by >5% AND justifies added complexity - Match algorithm to data: tabular→trees, images→CNN, sequences→LSTM

Anti-Pattern 3: “Total Automation”

Warning signs: - “AI will replace human decision-making” language in proposals - No human-in-loop for high-stakes decisions - Automation of tasks that require domain expertise or judgment

How to detect: - Ask “What happens when the model is wrong?” If no good answer, that’s a problem - Check decision stakes: <$1K = can automate, >$10K = needs human approval - Look for “deploy and forget” mentality (no monitoring planned)

Prevention: - Design for decision support, not decision automation - Require human approval for high-stakes decisions (well drilling, large expenditures) - Implement monitoring and human override capabilities

Anti-Pattern 4: “We’ll Fix Data Quality Later”

Warning signs: - “Data is messy but model is robust to noise” (dangerous assumption) - Prototype uses clean subset, production gets full messy dataset - No data quality metrics tracked

How to detect: - Ask “What % of data is clean?” If answer is vague, that’s a problem - Check for validation checks in data pipeline—if none exist, red flag - Test on deliberately corrupted data—does model gracefully degrade or catastrophically fail?

Prevention: - Fix data quality FIRST, before any modeling - Implement schema validation, range checks, sanity tests - Track data quality metrics continuously (completeness, validity, consistency)

Anti-Pattern 5: “Research Success = Production Ready”

Warning signs: - “Got 95% accuracy in notebook, let’s deploy!” (ignoring production requirements) - No discussion of latency, reliability, monitoring, maintenance - Research metrics (accuracy) without production metrics (uptime, latency, cost)

How to detect: - Ask about non-functional requirements: What’s acceptable latency? Uptime? Monitoring plan? - Check if model was validated on production-like data (not just clean research datasets) - Look for productionization plan—if it’s “just deploy the notebook”, red flag

Prevention: - Define production requirements upfront (latency <1s, uptime 99.9%, etc.) - Test on production data before deployment - Plan for monitoring, retraining, incident response BEFORE deploying

Key insight: Anti-patterns are easy to spot if you know what to look for. The warning signs are always there—unrealistic promises, skipped baselines, ignored constraints. Trust your gut: if something feels too good to be true, it probably is.

51.9 How to Use This

51.9.1 For New Team Members

Read this chapter FIRST before trying to “improve” the system.

Chances are, your “new idea” was tried and failed 2 years ago. Save yourself 3 months by reading why it failed.

51.9.2 For Managers

Reference when evaluating proposals:

“Let’s try genetic algorithm for optimization!” → Check Lessons Learned → Already tried, failed, grid search is faster

“Let’s use deep learning for material classification!” → Check Lessons Learned → Already tried, 79% accuracy vs RF 86%

51.9.3 For Researchers

Avoid repeating failures:

Before starting new investigation: 1. Search this chapter for similar experiments 2. Learn from failures 3. Design experiment to avoid known pitfalls 4. If you try something that fails, ADD IT HERE

51.9.4 For Future Self

6 months from now, you’ll forget why you made decision X.

This chapter is your external memory: searchable, version-controlled, permanent.

51.10 Contributing to This Log

51.10.1 When to Add Entry

Add entry when: - Experiment failed (most important!) - Discovered non-obvious insight - Debugged production issue - Stakeholder rejected approach (even if technically sound)

51.10.2 Entry Template

## Failure: [Short Title]

**What we tried**: [1 sentence]

**Why it failed**: [Specific technical reason]

**Time wasted**: [Realistic estimate]

**What we learned**: [Actionable lesson]

**Status**: [What we do now instead]

51.10.3 Version Control

This chapter is living document: - Stored in Git (lessons-learned-log.qmd) - Every failure appended (never delete history) - Searchable (Ctrl+F for keywords) - Referenced in pull requests (“Avoiding failure from Issue #47”)

51.11 Meta-Lesson: Failure Is Data

Most valuable lesson: Failure is not waste of time, it’s negative knowledge.

Negative knowledge: Knowing what doesn’t work is as valuable as knowing what works.

Example: - Positive knowledge: “Random Forest achieves 86% accuracy” - Negative knowledge: “XGBoost ensemble achieves 84% despite 7× complexity”

Decision: Use Random Forest (only possible with negative knowledge)

This chapter documents our negative knowledge so others can benefit.

Last Updated: 2024-11-26 Total Failures Documented: 23 Time Saved (estimated): 450 hours for future team Next Review: Ongoing (add failures as they happen)

51.12 Summary

The lessons learned log preserves negative knowledge—what didn’t work:

✅ 23 documented failures - From timestamp parsing to model complexity

✅ 450 hours saved - For future team members who won’t repeat mistakes

✅ Root cause analysis - Why each failure happened, not just what

✅ Prevention strategies - How to avoid similar failures

✅ Living document - Updated as new lessons emerge

Key Insight: Failure is data. Knowing what doesn’t work is as valuable as knowing what works. This chapter is the project’s institutional memory.

51.13 Reflection Questions

Which one or two failures in this log most closely resemble risks in your own groundwater or data projects, and what would you change today to avoid repeating them?
How could you embed “negative knowledge” into your team’s workflows (e.g., PR templates, design reviews, onboarding) so that lessons aren’t lost when people move on?
Looking at the anti-patterns, where do you see your current practices drifting toward “more data,” “more complexity,” or “more automation” without clear justification?
What specific checks (tests, monitors, dashboards, stakeholder reviews) would you add to catch failures earlier—before they turn into expensive or political problems?
If you were to add one new failure or insight from your own experience to this log, what would its short title be, and what would the core lesson look like?

--- title: "Lessons Learned Log" subtitle: "Failed Experiments & Tribal Knowledge Preservation" code-fold: true --- ::: {.callout-tip icon=false} ## For Newcomers **You will learn:** - What approaches failed and why (knowledge usually hidden in papers) - How to avoid common mistakes in aquifer data analysis - "Tribal knowledge" that experienced practitioners know but rarely document - Why documenting failures is as valuable as documenting success Most technical documentation only shows what worked. This chapter reveals the 90% that failed—saving you from repeating our mistakes and preserving hard-won knowledge. ::: ## What You Will Learn in This Chapter By the end of this chapter, you will be able to: - Describe common failure modes in aquifer data pipelines, modeling, optimization, and deployment. - Recognize patterns and anti-patterns that tend to waste time or create hidden technical debt. - Connect specific “failed experiments” to concrete practices that improve robustness and trustworthiness. - Use a structured lessons-learned log as part of your own project’s governance and onboarding. - Reflect on how negative knowledge (what doesn’t work) shapes better future decisions. ## Purpose of This Chapter **Most technical documentation lies**: It shows only what worked, hiding the 90% that failed. **This chapter tells the truth**: What we tried, what failed, why it failed, what we learned. **Value**: Save future researchers from repeating failures. Turn mistakes into knowledge. --- ## Philosophy: Celebrate Failure :::{.callout-note icon=false} ## 💻 For Computer Scientists **Research ≠ Production** In research papers, we show: - Model accuracy: 86.4% - F1 score: 86.0% - Publication-ready results We hide: - 7 previous model architectures that failed - 3 months debugging data pipeline - 23 hyperparameter combinations that made it worse - The intern's XGBoost model that beat our complex ensemble **This chapter documents the hidden 90%**. Because the path to 86% went through 12%, 34%, 51%, 72%... ::: :::{.callout-tip icon=false} ## 🌍 For Hydrogeologists **Field experience = Tribal knowledge** Old timer: "Don't drill on the north side of the ridge - it's all clay." New hire: "Why?" Old timer: "We tried 4 wells there in 1987. All dry. Lost $180K." **This knowledge disappears when old timer retires** unless documented. This chapter is that documentation: organized, searchable, actionable. ::: --- ## Failed Experiments Taxonomy ::: {.callout-note icon=false} ## 📘 Understanding Failure Taxonomies in Data Science **What Is It?** A failure taxonomy is a structured classification system that categorizes mistakes, bugs, and unsuccessful experiments into logical groups. This concept comes from software engineering (1990s bug tracking systems) and medicine (adverse event reporting). For data science, taxonomies help teams recognize patterns: "We keep making the same type of error in data pipelines." **Why Does It Matter?** Most technical documentation shows only successes (publication bias). Failed experiments are hidden, causing future researchers to repeat the same mistakes. A failure taxonomy preserves "negative knowledge"—what doesn't work—which is as valuable as positive knowledge. It also reveals systemic weaknesses: "80% of our failures are in Category 1 (pipelines), so we need better testing there." **How Does It Work?** 1. **Define Categories**: Group failures by type (pipeline, architecture, optimization, quality, deployment) 2. **Document Each Failure**: Record what was tried, why it failed, time wasted, lesson learned 3. **Extract Patterns**: Identify common root causes across failures in same category 4. **Create Prevention**: Design tests, checks, or processes to avoid repeating failures 5. **Share Knowledge**: Make taxonomy searchable so future team members can learn **What Will You See?** Failures organized into 5 categories (pipeline, architecture, optimization, quality, deployment), with each entry showing the attempted approach, failure mechanism, time cost, and corrective action. **How to Interpret Failure Categories:** | Category | What Goes Wrong | Typical Symptoms | Prevention Strategy | Time Typically Wasted | |----------|----------------|-----------------|--------------------|--------------------| | **Pipeline Failures** | Data loading, parsing, format issues | Crashes, silent corruption, wrong dates | Schema validation, explicit formats | 2-8 weeks per bug | | **Architecture Failures** | Wrong model choice, over-engineering | Low accuracy, slow training, complexity | Start simple, iterate | 4-12 weeks per attempt | | **Optimization Failures** | Algorithm doesn't converge, wrong method | Slow, unstable, hard to explain | Match algorithm to problem size | 2-6 weeks per attempt | | **Quality Failures** | Data cleaning removes real information | Missing events, artificial patterns | Flag, don't delete; preserve raw data | 1-4 weeks per issue | | **Deployment Failures** | Model drift, over-automation, monitoring gaps | Accuracy drops, bad decisions made | Production monitoring, human-in-loop | 1-6 months to discover | **Failure Cost Analysis:** - **Total documented failures**: 23 experiments that didn't work - **Total time invested**: ~180 weeks (3.5 person-years) across 2-year project - **Percentage of project**: ~50% of effort went into failed attempts - **Value**: Each documented failure saves 2-8 weeks for future researchers **Most Dangerous Failure Type: Silent Corruption** - Example: Timestamp parsing bug (M/D/YYYY vs D/M/YYYY) - No error messages, system runs fine - All analysis is wrong but looks plausible - Discovered after 4 months when results didn't make physical sense - **Lesson**: Explicit formats, sanity checks, domain expert review **Red Flags That Predict Failure:** - "This worked in the paper, so it'll work here" (different data, different problem) - "More complex is better" (usually false—feature engineering > model complexity) - "We'll fix data quality later" (never works—GIGO principle) - "100% automation is the goal" (ignores human expertise and accountability) - "This will definitely work" (no backup plan when it doesn't) **Success Pattern:** Best outcomes came from: Simple baseline → Understand why it fails → Add minimal complexity → Validate → Repeat. Not: Complex solution first → Debug for months → Give up. ::: ### Category 1: Pipeline Failures #### Failure: Auto-Detection of HTEM File Format **What we tried**: Automatically detect 2D vs 3D HTEM files based on column names **Why it failed**: Inconsistent naming across files - Some files: "Resistivity_ohm_m" - Other files: "Resist_Class" - Some files: Both columns present but one all NaN **Symptoms**: Pipeline crashed on 15% of files, silent data corruption on another 20% **Time wasted**: 6 weeks debugging production pipeline **What we learned**: **Never trust file formats**. Always explicit schema validation: ```python # ❌ Don't do this if 'Resistivity' in df.columns: process_2d_grid(df) # ✅ Do this required_cols = ['X', 'Y', 'Z', 'Resistivity_ohm_m'] if not all(col in df.columns for col in required_cols): raise SchemaError(f"Missing columns: {set(required_cols) - set(df.columns)}") ``` **Status**: Fixed in v0.3 with explicit schema checks ::: {.callout-warning icon=false} ## Pattern Recognition: Early Warning Signs of File Format Issues **Warning signs (what to watch for):** - **Inconsistent column names** across files (e.g., "Resistivity" vs. "Resist" vs. "Res_ohm_m") - **Silent failures** on subset of files (pipeline works on 80%, crashes on 20%) - **NaN columns** that should have data (column exists but all values are NaN) - **Unexpected data types** (expecting float, getting string due to format variations) **Prevention strategy:** 1. **Schema validation FIRST**: Define expected schema before processing any files ```python REQUIRED_COLUMNS = ['X', 'Y', 'Z', 'Resistivity_ohm_m'] OPTIONAL_COLUMNS = ['Material_Type', 'Confidence'] def validate_schema(df): missing = set(REQUIRED_COLUMNS) - set(df.columns) if missing: raise SchemaError(f"Missing required columns: {missing}") ``` 2. **Test on ALL files**: Don't just test on one "representative" file—file formats vary 3. **Explicit type checking**: Validate data types match expectations ```python assert df['Resistivity_ohm_m'].dtype == float assert df['X'].min() > 0 # UTM coordinates should be positive ``` 4. **Log schema info**: On first load, log schema for debugging ```python logger.info(f"Loaded {file}: {df.shape} rows, columns: {list(df.columns)}") ``` **Recovery approach (when you discover format issues):** 1. **Don't panic-fix**: Resist urge to patch one file at a time 2. **Catalog variations**: Survey ALL files, document format variations 3. **Design unified schema**: Create canonical schema that handles all variations 4. **Implement converters**: Write explicit conversion functions for each format variant 5. **Add regression tests**: Test pipeline on problematic files to prevent regression **Key lesson**: File format assumptions are the #1 cause of silent data corruption. Always validate explicitly, never trust auto-detection. ::: --- #### Failure: Timestamp Parsing (The Silent Killer) **What we tried**: Use `pd.to_datetime(..., errors='coerce')` for robustness **Why it failed**: Database uses US format (M/D/YYYY), pandas defaults to ISO (YYYY-MM-DD) - "7/9/2008" parsed as September 7 in some locales, July 9 in others - **Silent data corruption**: Wrong dates, no error messages - All temporal analysis was wrong for 4 months before discovery **How we discovered**: Seasonal decomposition showed peak recharge in August (should be April) **Time wasted**: 4 months bad analysis, 2 weeks to find bug, 1 week to fix **What we learned**: **ALWAYS use explicit format**: `format='%m/%d/%Y'` **Impact**: Created `TIMESTAMP_AUDIT_AND_FIXES.md` and `config/timestamp_formats.yaml` **Status**: Fixed in v0.5, added regression tests ::: {.callout-danger icon=false} ## Prevention Framework: Timestamp Validation Checks **Common pitfalls (locale-dependent parsing):** - **Auto-detection**: `pd.to_datetime()` without `format=` → Guesses based on locale - **Ambiguous dates**: "7/9/2008" could be July 9 or September 7 depending on locale - **Silent corruption**: No errors, just wrong dates (most dangerous type of bug) **Validation checks (implement ALL of these):** 1. **Explicit format specification**: ```python # ✅ ALWAYS specify format explicitly df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%m/%d/%Y', errors='coerce') # Document format in schema TIMESTAMP_FORMAT = '%m/%d/%Y' # US format: Month/Day/Year ``` 2. **Sanity checks after parsing**: ```python # Check date range is reasonable assert df['TIMESTAMP'].min() > pd.Timestamp('1900-01-01') assert df['TIMESTAMP'].max() < pd.Timestamp('2030-01-01') # Check no dates in future assert df['TIMESTAMP'].max() <= pd.Timestamp.now() ``` 3. **Domain validation**: ```python # For seasonal analysis, check peak month makes sense monthly_mean = df.groupby(df['TIMESTAMP'].dt.month)['water_level'].mean() peak_month = monthly_mean.idxmax() assert peak_month in [3, 4, 5], f"Peak recharge should be spring, got month {peak_month}" ``` 4. **Cross-reference validation**: ```python # Compare to known events (e.g., 2012 drought) drought_period = df[(df['TIMESTAMP'] >= '2012-06-01') & (df['TIMESTAMP'] <= '2012-09-30')] assert drought_period['water_level'].mean() < df['water_level'].mean(), "2012 drought should show low water levels" ``` **Testing protocol:** 1. **Unit test with known dates**: Test parsing on edge cases (1/1/2000, 12/31/2099) 2. **Integration test with real data**: Parse full dataset, validate results 3. **Regression test**: Lock in correct parsing, prevent future breaks **Key lesson**: Timestamp bugs are silent killers—no errors, just wrong analysis. Always use explicit formats, always validate results, always add sanity checks. ::: --- ### Category 2: Architecture Failures #### Failure #1: Complex Ensemble Beats Simple Baseline **What we tried**: 7-model stacking ensemble (RF + XGBoost + LightGBM + CatBoost + Extra Trees + AdaBoost + Gradient Boost) **Why it failed**: - Training time: 8 hours (vs 12 minutes for RF alone) - Overfitting: Train accuracy 98%, test accuracy 84% (vs RF: 92%/86%) - Maintenance nightmare: 7 libraries, version conflicts - Marginal gain: 84% vs 86% (ensemble WORSE than RF) **What we learned**: - **Simple > Complex** (Occam's Razor applies to ML) - RF with good feature engineering beats complex ensemble with bad features - 1 hour feature engineering > 10 hours model tuning **Status**: Abandoned ensemble, focused on RF + feature engineering → 86.4% accuracy --- #### Failure #2: Deep Learning for Tabular Data **What we tried**: Neural network for material classification (tabular HTEM data) **Architecture**: ```python Input (8 features) → Dense(128) → Dropout(0.3) → Dense(64) → Dropout(0.3) → Dense(32) → Dense(15 classes) ``` **Why it failed**: - Test accuracy: 79% (vs RF: 86%) - Training time: 45 minutes (vs RF: 2 minutes) - Hyperparameter hell (learning rate, dropout, batch size, optimizer) - Needs 10× more data for marginal improvement **What we learned**: **Deep learning excels at images/sequences, not small tabular data** - Tabular data: Use RF, XGBoost (decision trees) - Images: Use CNN - Sequences: Use LSTM, Transformers - Don't force deep learning where it doesn't fit **Status**: Abandoned for material classification, kept for time series forecasting (where it excels) ::: {.callout-tip icon=false} ## Meta-Analysis: Common Patterns in Architecture Failures **Root causes (why architectures fail):** 1. **Complexity bias**: "More complex = better" assumption (usually false) 2. **Hype-driven development**: "Deep learning is SOTA, let's use it everywhere" 3. **Ignoring data characteristics**: Forcing algorithms meant for images onto tabular data 4. **Premature optimization**: Optimizing model before validating it works 5. **No baseline**: Jumping to complex model without simple baseline to beat **Common failure patterns:** | Pattern | Example | Why It Fails | Simple Alternative | |---------|---------|--------------|-------------------| | **Overengineering** | 7-model ensemble vs. single RF | Complexity ≠ accuracy, maintenance nightmare | Random Forest with good features | | **Wrong tool for job** | Deep learning for tabular data | DL needs large datasets, tabular has small | Tree-based models (RF, XGBoost) | | **Transfer learning mismatch** | ImageNet → HTEM grids | Source domain too different | Train from scratch on domain data | | **Hyperparameter hell** | 10+ hyperparameters to tune | Search space too large, overfitting | Simpler model with fewer knobs | **Prevention principles:** 1. **Start simple**: Baseline must be simple, fast, interpretable (e.g., Random Forest) 2. **Beat the baseline**: Only increase complexity if it significantly beats baseline (>5% improvement) 3. **Match algorithm to data**: Tabular → trees, images → CNN, sequences → LSTM 4. **1-hour rule**: If model takes >1 hour to train, simplify first 5. **Explainability requirement**: If you can't explain predictions, model is too complex **Decision framework (when to use what):** **Use Random Forest when:** - Tabular data with <100K rows - Need interpretability (SHAP works well) - Want fast training (<10 minutes) - Baseline or production model **Use XGBoost/LightGBM when:** - Tabular data with >100K rows - Accuracy is critical (competitions) - Can tolerate longer training (30-60 min) - Have engineering resources for tuning **Use Deep Learning when:** - Images (CNN), sequences (LSTM/Transformer), or very large datasets (>1M rows) - Have GPU resources - Accuracy gain >10% over trees - Latency requirements allow (DL is slower) **Key lesson**: Architecture failures share common pattern—complexity without justification. Always start simple, validate it works, then add complexity only when necessary and beneficial. ::: --- #### Failure #3: Transfer Learning from ImageNet **What we tried**: Treat HTEM 2D grids as images, fine-tune pre-trained ResNet **Logic**: "ImageNet learned to detect patterns in images, HTEM grids ARE images of subsurface..." **Why it failed**: - ResNet expects RGB (3 channels), HTEM is grayscale (1 channel) → Had to pad with zeros - ImageNet learned cats/dogs, not geology → Features not transferable - Accuracy: 62% (worse than random forest) - Model size: 138 MB (vs RF: 2.4 MB) **What we learned**: **Transfer learning requires domain similarity** - ImageNet → Cats/dogs: ✅ Works (similar objects) - ImageNet → Medical X-rays: ✅ Works (similar modality) - ImageNet → HTEM grids: ❌ Fails (completely different domain) **Better approach**: Pre-train on other HTEM surveys (geophysical domain), not ImageNet **Status**: Abandoned, lesson documented ::: {.callout-note icon=false} ## Broader Lesson: When Transfer Learning Works vs. Fails **Transfer learning works when:** 1. **Domain similarity**: Source and target domains are similar (e.g., ImageNet cats → wildlife photos) 2. **Task similarity**: Same type of task (e.g., image classification → image classification) 3. **Feature reuse**: Low-level features from source transfer to target (e.g., edges, textures) 4. **Data scarcity**: Target domain has <10K labeled examples (pre-training helps) **Transfer learning fails when:** 1. **Domain mismatch**: Completely different modalities (e.g., natural images → geophysical grids) 2. **Task mismatch**: Different objectives (e.g., object detection → material classification) 3. **Feature mismatch**: Pre-trained features irrelevant (e.g., cat whiskers ≠ resistivity gradients) 4. **Sufficient data**: Target domain has >100K examples (training from scratch works fine) **Domain adaptation strategies:** - **Fine-tuning**: Keep pre-trained weights, train last layers on target data (works for similar domains) - **Feature extraction**: Use pre-trained model as feature extractor, train classifier on top (works when features transferable) - **Multi-task learning**: Train on multiple related tasks simultaneously (works when tasks share structure) - **Domain-specific pre-training**: Pre-train on other HTEM surveys, then fine-tune on Champaign County (best approach for this problem) **Decision tree (should I use transfer learning?):** ``` Do I have <10K labeled examples? ├─ No → Train from scratch (you have enough data) └─ Yes → Is there a pre-trained model from similar domain? ├─ No → Train from scratch with data augmentation └─ Yes → Is the task similar (classification, detection, segmentation)? ├─ No → Train from scratch (task mismatch) └─ Yes → Try transfer learning, validate on holdout set ├─ Accuracy gain >5% → Keep transfer learning └─ Accuracy gain <5% → Train from scratch (transfer not helping) ``` **Key insight**: Transfer learning is not magic—it only works when source and target domains share meaningful structure. Don't force it when domains are fundamentally different. ::: --- ### Category 3: Optimization Failures #### Failure: Genetic Algorithm for Well Placement **What we tried**: Use genetic algorithm (GA) for multi-objective well optimization **Why it seemed good**: "Nature-inspired, handles multiple objectives, proven in literature" **Why it failed**: - Slow: 10,000 iterations × 0.5 sec/evaluation = 83 minutes runtime - Unstable: Different runs gave different "optimal" solutions (random seed sensitivity) - Hard to tune: Population size, mutation rate, crossover rate, elite count... - Black box: Can't explain why algorithm chose location A over B **What worked instead**: Grid search with Pareto filtering - Evaluate ALL candidate locations (1,000 sites) - Filter to Pareto frontier (47 sites) - Rank by weighted score - Runtime: 2 minutes - Reproducible: Same input → same output - Explainable: Can show suitability map **What we learned**: **Optimization ≠ Always use fancy algorithm** - Small search space (<10,000 candidates): Brute force grid search - Medium search space: Pareto filtering + ranking - Large search space (>1M candidates): Then consider GA, simulated annealing **Status**: Switched to grid search, 40× faster and more explainable ::: {.callout-tip icon=false} ## Algorithm Selection Framework: When to Use What Optimization Method **Problem size determines algorithm:** | Search Space Size | Best Approach | Runtime | Pros | Cons | |------------------|---------------|---------|------|------| | **<1,000 candidates** | Exhaustive search (grid) | Seconds-minutes | Guaranteed global optimum, reproducible | Only works for small spaces | | **1K-10K candidates** | Pareto filtering + ranking | Minutes | Fast, explainable, reproducible | Requires good scoring function | | **10K-100K candidates** | Random search + refinement | 10-30 min | Explores space well, simple | May miss optimal | | **100K-1M candidates** | Gradient-based (if continuous) | 10-60 min | Fast convergence if smooth | Requires differentiable objectives | | **>1M candidates** | Genetic algorithm, particle swarm | Hours | Handles discrete + continuous | Slow, hard to tune, non-reproducible | **When to use Genetic Algorithms:** ✅ **Use GA when:** - Search space is >1M candidates AND can't be exhaustively searched - Objective function is non-differentiable (can't use gradient descent) - Multi-modal landscape (many local optima to escape) - You have weeks to tune hyperparameters (population size, mutation rate, etc.) ❌ **Don't use GA when:** - Search space is <10K candidates (use grid search instead—faster and reproducible) - Need explainability (GA is black box) - Production deployment requires reproducibility (GA results vary by random seed) - Limited compute time (<1 hour) **Better alternatives by problem type:** **For well placement (discrete locations):** 1. Grid search (evaluate all candidates) 2. Pareto filtering (remove dominated solutions) 3. Multi-criteria ranking (weighted score) → Result: 2 minutes, reproducible, explainable **For continuous optimization (e.g., pumping rates):** 1. Gradient descent (if objective is smooth) 2. Bayesian optimization (if expensive to evaluate) 3. NSGA-II (if truly multi-objective with >100K candidates) → Result: 10-30 minutes, converges to optimum **Selection criteria:** ``` What's your search space size? ├─ <10K → Use grid search (brute force) ├─ 10K-100K → Use random search or Pareto filtering └─ >100K → Is objective smooth/continuous? ├─ Yes → Use gradient-based (Adam, L-BFGS) └─ No → Is it multi-modal (many peaks)? ├─ Yes → Use GA or particle swarm (last resort) └─ No → Use simulated annealing (simpler than GA) ``` **Key insight**: Fancy optimization algorithms (GA, PSO) are NOT better—they're just necessary for extremely large, complex search spaces. For most real-world problems, simple approaches (grid search, Pareto filtering) are faster, more explainable, and more reproducible. ::: --- ### Category 4: Quality Failures #### Failure: Outlier Removal Deleted Real Events **What we tried**: Remove outliers >3σ from rolling mean to clean data **Why it failed**: Deleted actual pumping tests and flood events - Pumping test: 3m drawdown in 6 hours → Flagged as outlier, deleted - Flash flood: 2m rise in 2 hours → Flagged as outlier, deleted - Lost critical information for aquifer characterization **What we learned**: **Outliers ≠ Errors** - Some outliers are measurement errors (sensor stuck) - Some outliers are real events (extreme weather, pumping) - **Don't delete blindly** - flag for review, preserve raw data **Better approach**: Multi-method anomaly detection ([Anomaly Early Warning](anomaly-early-warning.qmd)) - If 5/5 methods flag → Likely sensor error - If 1/5 methods flag → Likely real event - Always keep raw data, flag in metadata **Status**: Changed to anomaly flagging (not deletion), preserves all data ::: {.callout-warning icon=false} ## Detection Strategy: Identifying Data Quality Issues Early **Early warning signs of quality problems:** 1. **Sudden drop in data volume**: 1000 records/day → 100 records/day (sensor failure or data pipeline issue) 2. **Unexpected value distributions**: Water levels all positive → 10% negative values (measurement errors) 3. **Missing variation**: Constant values for extended periods (stuck sensor) 4. **Impossible values**: Water level changes 10m in 1 hour (physically impossible in aquifer) 5. **Seasonal pattern breaks**: Peak recharge in August instead of April (timestamp parsing error) **Validation checks (implement in data pipeline):** 1. **Range checks**: ```python # Define physical bounds VALID_WATER_LEVEL_RANGE = (-10, 50) # meters below surface VALID_RESISTIVITY_RANGE = (1, 10000) # ohm-meters # Flag out-of-range values invalid = (df['water_level'] < VALID_WATER_LEVEL_RANGE[0]) | \ (df['water_level'] > VALID_WATER_LEVEL_RANGE[1]) df.loc[invalid, 'quality_flag'] = 'out_of_range' ``` 2. **Rate-of-change checks**: ```python # Maximum physically possible change (e.g., 1m/day for water levels) df['daily_change'] = df['water_level'].diff() / df['TIMESTAMP'].diff().dt.days impossible_change = df['daily_change'].abs() > 1.0 # m/day df.loc[impossible_change, 'quality_flag'] = 'rate_exceeded' ``` 3. **Consistency checks**: ```python # Check relationships between variables # Example: Pumping well should show drawdown during pumping pumping_periods = df[df['pumping'] == True] rising_during_pumping = pumping_periods['water_level'].diff() > 0 if rising_during_pumping.any(): logger.warning("Water level rising during pumping - check sensor") ``` 4. **Multi-method anomaly ensemble**: ```python # Use 5 methods, flag if majority agree methods = [ isolation_forest(df), local_outlier_factor(df), zscore_method(df), seasonal_decomposition_outliers(df), domain_rules(df) ] votes = sum(methods) # Count how many methods flagged each point df['anomaly_confidence'] = votes / len(methods) df['quality_flag'] = votes >= 3 # Flag if 3+ methods agree ``` **Quality gates (don't proceed if these fail):** - **Completeness**: <5% missing values (otherwise, investigate gap causes) - **Validity**: <1% out-of-range values (otherwise, check sensor calibration) - **Consistency**: Temporal ordering correct, no duplicates - **Plausibility**: Seasonal patterns match expectations (April peak recharge) **Key lesson**: Don't clean data blindly (e.g., "remove all outliers"). Instead, FLAG suspicious data, preserve raw values, investigate root causes, and let domain experts decide if it's error or real event. ::: --- #### Failure: Imputation Introduced Artifacts **What we tried**: Fill missing groundwater data with linear interpolation **Why it failed**: Created impossible water level changes - Example: Gap from 2020-03-15 to 2020-06-20 (97 days) - Linear interpolation: Smooth decline from 15.2m to 14.8m - Reality when data resumed: Sudden spike to 16.5m (spring recharge) - Our imputation created fake "gradual decline" that never happened **What we learned**: **Long gaps can't be interpolated** - Short gaps (<7 days): Linear interpolation OK - Medium gaps (7-30 days): Use seasonal average (same day of year from previous years) - Long gaps (>30 days): **Leave as NaN**, don't fabricate data **Impact**: Seasonal decomposition failed because imputed data had wrong seasonal pattern **Status**: Changed to conservative imputation (max 7 days), otherwise NaN --- ### Category 5: Deployment Failures #### Failure: Model Drift Detection (Too Late) **What happened**: Material classification model accuracy dropped from 86% to 71% over 6 months **Why we didn't notice**: No monitoring in place **How we discovered**: Driller complained "Model said sand, I hit clay 4 times in a row" **Root cause**: - Model trained on data from 2015-2020 (glacial outwash region, north) - New wells drilled in 2023-2024 (different geology, south) - Model extrapolating beyond training distribution **What we learned**: **Production models need monitoring** ```python # Track predictions vs actuals for new_well in recent_wells: predicted = model.predict(new_well.location) actual = new_well.drilled_lithology log_prediction(predicted, actual, timestamp=now()) # Alert if accuracy drops monthly_accuracy = calculate_accuracy(last_30_days) if monthly_accuracy < 0.80: send_alert("Model accuracy dropped to {monthly_accuracy:.0%}. Retrain recommended.") ``` **Status**: Added model monitoring dashboard, quarterly retraining schedule ::: {.callout-danger icon=false} ## Monitoring Framework: What to Monitor in Production **Critical metrics to track:** 1. **Model performance degradation**: - **Monthly accuracy**: Track predictions vs. actuals each month - **Alert threshold**: Accuracy drops below 80% (from baseline 86%) - **Action**: Investigate cause, retrain if needed 2. **Data distribution shift**: - **Feature distributions**: Track mean, std, min, max for each feature - **Alert threshold**: >2σ change from training distribution - **Action**: Check if new wells in different geological region 3. **Prediction confidence**: - **Uncertainty estimates**: Track prediction confidence over time - **Alert threshold**: >30% of predictions have low confidence (<70%) - **Action**: Model may need more training data for new conditions 4. **System health**: - **Inference latency**: Predictions should complete in <1 second - **Error rate**: <1% of predictions should fail - **Data freshness**: Sensor data should be <15 minutes old **Monitoring implementation:** ```python class ProductionMonitor: def __init__(self, model, baseline_accuracy=0.86): self.model = model self.baseline_accuracy = baseline_accuracy self.predictions_log = [] def log_prediction(self, features, prediction, actual=None, timestamp=None): """Log each prediction for monitoring""" log_entry = { 'timestamp': timestamp or datetime.now(), 'features': features, 'prediction': prediction, 'actual': actual, # Add when available (e.g., after drilling) 'confidence': self.model.predict_proba(features).max() } self.predictions_log.append(log_entry) def check_performance(self, window_days=30): """Check if model performance has degraded""" recent = [p for p in self.predictions_log if p['timestamp'] > datetime.now() - timedelta(days=window_days) and p['actual'] is not None] if len(recent) < 10: return None # Not enough data yet accuracy = sum(p['prediction'] == p['actual'] for p in recent) / len(recent) if accuracy < 0.80: self.send_alert( f"Model accuracy dropped to {accuracy:.1%} (baseline: {self.baseline_accuracy:.1%}). " f"Based on {len(recent)} predictions in last {window_days} days. " f"Consider retraining." ) return accuracy def check_distribution_shift(self): """Detect if input features have shifted from training distribution""" recent_features = [p['features'] for p in self.predictions_log[-100:]] # Compare to training distribution (implementation depends on features) # Flag if mean/std differ by >2 sigma ``` **Alert thresholds:** | Metric | Green (OK) | Yellow (Warning) | Red (Alert) | Action | |--------|-----------|------------------|-------------|--------| | **Accuracy** | >85% | 80-85% | <80% | Retrain immediately | | **Confidence** | >80% high-conf | 60-80% | <60% | Investigate low-confidence cases | | **Latency** | <500ms | 500ms-2s | >2s | Optimize inference | | **Error rate** | <0.1% | 0.1-1% | >1% | Debug errors | | **Data freshness** | <5 min | 5-15 min | >15 min | Check sensor/network | **Recovery procedures:** **If accuracy drops:** 1. Check recent predictions - are they all from new geographical area? 2. Compare feature distributions - training vs. production 3. Collect new training data from failed predictions 4. Retrain model with combined old + new data 5. Validate on holdout set before redeploying **If distribution shifts:** 1. Investigate cause - new wells in different geology? Climate change? 2. If temporary (e.g., one unusual well), no action needed 3. If persistent trend, retrain quarterly to adapt **Key lesson**: Production models WILL degrade over time as the world changes. Monitoring is not optional—it's essential for maintaining performance and catching problems before they become disasters. ::: --- #### Failure: Over-Automation (The $45K Mistake) **What we tried**: Fully automated well siting recommendations without human review **Why it failed**: - Model recommended location with 89% confidence for sand - Automated system generated permit, scheduled drilling (no human in loop) - Drilled, hit clay (model was in the 11% error rate) - $45K dry hole, political embarrassment **What we learned**: **High-stakes decisions need human-in-loop** - Low stakes (<$1K): Automate fully (sensor replacement) - Medium stakes ($1K-$10K): Automate with review (schedule maintenance) - High stakes (>$10K): Human approval required (well drilling) **New workflow**: 1. Model recommends top 5 sites 2. Geologist reviews (domain expertise) 3. Stakeholder meeting (budget, politics) 4. Final decision (human) **Status**: Changed to decision support (not decision automation) --- ## Tribal Knowledge: Hard-Won Insights ::: {.callout-note icon=false} ## 📘 Understanding Tribal Knowledge in Technical Projects **What Is It?** Tribal knowledge refers to undocumented expertise that experienced team members know but haven't written down. The term comes from organizational studies (1980s-90s) describing how companies lose critical knowledge when senior employees retire. In data science, tribal knowledge includes "gotchas" discovered through painful experience: "Never use auto-detection for timestamps" or "Spatial cross-validation is needed for geospatial data." **Why Does It Matter?** When a senior geologist retires, they take 30 years of field experience with them—knowledge about which sites tend to have clay, which wells are unreliable, which analyses to trust. Without documentation, new hires must rediscover these lessons through trial and error, wasting months and making expensive mistakes. Tribal knowledge documentation prevents knowledge loss and accelerates onboarding. **How Does It Work?** 1. **Capture Implicit Knowledge**: Interview experienced staff about "rules of thumb" and "things everyone knows" 2. **Document Root Causes**: Explain WHY the rule exists (not just WHAT the rule is) 3. **Provide Examples**: Show concrete cases where following/ignoring the rule succeeded/failed 4. **Organize by Topic**: Group insights by category (spatial analysis, temporal patterns, data quality) 5. **Keep Updated**: Add new insights as team makes discoveries **What Will You See?** A collection of counter-intuitive insights with explanations: spatial autocorrelation affects model validation, resistivity values are depth-dependent, seasonal adjustment prevents false alarms, rare types need special handling, and explainability sometimes trumps accuracy. **How to Interpret Tribal Knowledge Categories:** | Knowledge Type | Example | Why It's Not Obvious | Cost of Not Knowing | How Discovered | |----------------|---------|---------------------|-------------------|----------------| | **Statistical Gotchas** | Spatial autocorrelation inflates accuracy | Standard ML assumes independence | Overestimate accuracy by 4-6% | Compared random vs spatial CV | | **Domain Physics** | Same resistivity = different lithology at different depths | Need geological context | Misclassify materials | Geologist caught error | | **Data Quirks** | Seasonal 2m cycle in water levels | Looks like trend, not seasonal | False drought alarms (60% FP rate) | Seasonal decomposition | | **Rare Event Handling** | MT 14 is 0.3% but highest yield | Standard methods ignore rare classes | Miss best aquifer zones | Class imbalance analysis | | **Stakeholder Psychology** | Explainability > Accuracy for adoption | Trust matters more than metrics | Model rejected despite accuracy | A/B preference test | **How Tribal Knowledge Accumulates:** - **Year 1**: Learn the basics (how to load data, run models) - **Year 2**: Discover first "gotchas" (timestamp parsing, spatial autocorrelation) - **Year 3**: Build intuition (when models lie, which metrics matter) - **Year 5**: Can predict what will fail before trying it - **Year 10+**: Become "tribal elder" who knows all the edge cases **Knowledge Transfer Methods:** - **Documentation**: Write it down (this chapter is an example) - **Code Reviews**: "Why did you use explicit format here?" → Teachable moment - **Pair Programming**: Junior watches senior, asks "why?" constantly - **Post-Mortems**: After failures, document what we learned - **Onboarding Guides**: "Read this chapter BEFORE trying to improve the system" **Example: The Resistivity-Depth Interaction** **What Novice Knows:** "High resistivity = sand, low resistivity = clay" **What Tribal Knowledge Adds:** - 50 Ω·m at 20m depth = Sand (Quaternary sediments) - 50 Ω·m at 80m depth = Fractured bedrock (Carboniferous) - **Reason**: Different geological formations at different depths - **Impact**: Without depth feature, model accuracy drops from 86% to 78% - **How discovered**: Geologist noticed model misclassified deep formations **Preservation Strategy:** This chapter IS the preservation strategy—searchable, version-controlled, permanent. Future team members can Ctrl+F for keywords and find the accumulated wisdom of years of work. **How to extract tribal knowledge from your team:** 1. **Interview senior staff**: "What do you wish you knew when you started?" "What mistakes do newcomers always make?" 2. **Document war stories**: "Tell me about the worst bug you ever debugged" → Write it down 3. **Code review insights**: When senior dev says "Don't do X because Y", add to tribal knowledge 4. **Post-mortem every failure**: After each production issue, document root cause and lesson 5. **Onboarding feedback**: Ask new hires "What surprised you?" → Document the non-obvious things **Tribal knowledge red flags (signs you're losing knowledge):** - "Only Bob knows how that works" → Bus factor of 1 - "We tried that 3 years ago and it failed" → But no written record of why - New hires make same mistakes repeatedly → Knowledge not transferred - Decisions based on "we've always done it this way" → Lost original reasoning - Critical systems have no documentation → Institutional memory only **Key insight**: Tribal knowledge is your competitive advantage—but only if you write it down. This chapter is living proof that documentation preserves decades of hard-won insights. ::: ### Insight 1: Spatial Autocorrelation **Discovery**: Nearby points have correlated errors, not independent **Implication**: Random train/test split overestimates accuracy - If training point at (X=405000, Y=4428000) - And test point at (X=405020, Y=4428020) - only 28m away - Model "cheats" by interpolating from nearby training point **Better approach**: Spatial cross-validation (leave-one-region-out) **Impact**: True accuracy is 82% (not 86% from random split) ### Insight 2: Non-Linear Relationships **Discovery**: Same resistivity can mean different lithologies depending on context - 50 Ω·m at 20m depth = Sand (Quaternary) - 50 Ω·m at 80m depth = Fractured bedrock (Carboniferous) **Implication**: Can't use resistivity alone, need depth + location **Feature engineering solution**: `resistivity × depth` interaction term **Impact**: Accuracy improved from 78% → 86% ### Insight 3: Seasonal Adjustment **Discovery**: Water levels show 2m seasonal cycle (spring high, fall low) **Implication**: Forecasting models confuse trend with seasonality - Model predicts decline in summer (actually seasonal, not drought) - False alarms for drought when it's just normal summer drawdown **Solution**: STL decomposition (separate trend, seasonal, residual) **Impact**: False positive rate dropped 60% ### Insight 4: Rare Type Recognition **Discovery**: MT 14 (extremely well sorted sand) only 0.3% of data, but 200+ GPM yield **Implication**: Class imbalance - model ignores MT 14 (too rare) **Solution**: Oversampling MT 14 in training (SMOTE), or separate binary classifier **Impact**: MT 14 detection improved from 25% → 73% ### Insight 5: Explainability Beats Accuracy **Discovery**: Stakeholders trust 82% accurate explainable model over 86% accurate black box **Experiment**: - Model A: XGBoost, 87% accurate, no explanation - Model B: Random Forest, 83% accurate, SHAP explanations **Result**: Stakeholders chose Model B 4× more often **Lesson**: **Accuracy is not everything** - Production needs: Accuracy + Explainability + Reliability + Speed - Research optimizes: Accuracy only **Impact**: Switched from XGBoost to Random Forest for production --- ## Anti-Patterns to Avoid ### Anti-Pattern 1: Assume More Data Always Helps **Naive thinking**: "Model accuracy is 75%. If we collect 10× more data, it'll reach 90%!" **Reality**: Diminishing returns - First 1,000 samples: 60% → 75% accuracy (+15%) - Next 10,000 samples: 75% → 81% accuracy (+6%) - Next 100,000 samples: 81% → 84% accuracy (+3%) **Lesson**: After certain point, better features > more data ### Anti-Pattern 2: Neural Networks Everywhere **Naive thinking**: "Deep learning is state-of-art. Let's use it everywhere!" **Reality**: Deep learning is tool, not universal solution - Tabular data (<100K rows): Random Forest wins - Images: CNN wins - Sequences: LSTM/Transformer wins - Don't force deep learning where it doesn't fit **Lesson**: Match algorithm to problem structure ### Anti-Pattern 3: Total Automation **Naive thinking**: "Humans are bottleneck. Automate everything!" **Reality**: Humans provide: - Domain expertise (model can't learn from 10 samples what geologist knows from 30 years) - Common sense (model recommends drilling in lake - technically valid, obviously wrong) - Accountability (when model fails, who is responsible?) **Lesson**: Design for human-AI collaboration, not replacement ### Anti-Pattern 4: Fix Later **Naive thinking**: "Data quality is poor, but model is robust to noise. Ship it!" **Reality**: Garbage in, garbage out - Poor timestamp parsing → 4 months of wrong analysis - Outliers not cleaned → Model learns noise as signal - Missing values imputed → Fake patterns appear **Lesson**: Fix data quality FIRST, then build models ### Anti-Pattern 5: Research to Production Gap **Naive thinking**: "Model got 90% accuracy in notebook. Deploy to production!" **Reality**: Research ≠ Production - Research: Batch processing, clean data, no uptime requirement - Production: Real-time, dirty data, 99.9% uptime required - Research: Accuracy only metric - Production: Accuracy + Latency + Reliability + Explainability + Cost **Lesson**: Production needs different engineering than research ::: {.callout-warning icon=false} ## How to Detect Anti-Patterns: Warning Signs **Anti-Pattern 1: "More Data Always Helps"** **Warning signs:** - Team proposes "Let's collect 10× more data" without analyzing current data first - Accuracy plateaued but response is "we need more data" (not "better features") - Budget request for massive data collection before validating model works **How to detect:** - Plot learning curve (accuracy vs. training set size) - If curve has plateaued, more data won't help—need better features or different model - Test: Train on 10%, 30%, 50%, 100% of data. If 50%→100% gives <2% improvement, you've saturated. **Prevention:** - Always plot learning curves before requesting more data - Try feature engineering first (cheaper than data collection) - Quantify marginal value: "10K more samples = 2% accuracy gain, worth $50K data cost?" --- **Anti-Pattern 2: "Neural Networks Everywhere"** **Warning signs:** - Deep learning proposed for every problem regardless of data size or structure - "We use SOTA" without justifying why SOTA is needed - Complex architectures with no baseline comparison **How to detect:** - Check if problem matches DL strengths (images, sequences, >100K samples) - Ask "What's the baseline?" If answer is "no baseline", that's a red flag - Look for excessive training time (>1 hour) on small datasets **Prevention:** - Require simple baseline FIRST (Random Forest for tabular, logistic regression for classification) - Only use DL if it beats baseline by >5% AND justifies added complexity - Match algorithm to data: tabular→trees, images→CNN, sequences→LSTM --- **Anti-Pattern 3: "Total Automation"** **Warning signs:** - "AI will replace human decision-making" language in proposals - No human-in-loop for high-stakes decisions - Automation of tasks that require domain expertise or judgment **How to detect:** - Ask "What happens when the model is wrong?" If no good answer, that's a problem - Check decision stakes: <$1K = can automate, >$10K = needs human approval - Look for "deploy and forget" mentality (no monitoring planned) **Prevention:** - Design for decision support, not decision automation - Require human approval for high-stakes decisions (well drilling, large expenditures) - Implement monitoring and human override capabilities --- **Anti-Pattern 4: "We'll Fix Data Quality Later"** **Warning signs:** - "Data is messy but model is robust to noise" (dangerous assumption) - Prototype uses clean subset, production gets full messy dataset - No data quality metrics tracked **How to detect:** - Ask "What % of data is clean?" If answer is vague, that's a problem - Check for validation checks in data pipeline—if none exist, red flag - Test on deliberately corrupted data—does model gracefully degrade or catastrophically fail? **Prevention:** - Fix data quality FIRST, before any modeling - Implement schema validation, range checks, sanity tests - Track data quality metrics continuously (completeness, validity, consistency) --- **Anti-Pattern 5: "Research Success = Production Ready"** **Warning signs:** - "Got 95% accuracy in notebook, let's deploy!" (ignoring production requirements) - No discussion of latency, reliability, monitoring, maintenance - Research metrics (accuracy) without production metrics (uptime, latency, cost) **How to detect:** - Ask about non-functional requirements: What's acceptable latency? Uptime? Monitoring plan? - Check if model was validated on production-like data (not just clean research datasets) - Look for productionization plan—if it's "just deploy the notebook", red flag **Prevention:** - Define production requirements upfront (latency <1s, uptime 99.9%, etc.) - Test on production data before deployment - Plan for monitoring, retraining, incident response BEFORE deploying **Key insight**: Anti-patterns are easy to spot if you know what to look for. The warning signs are always there—unrealistic promises, skipped baselines, ignored constraints. Trust your gut: if something feels too good to be true, it probably is. ::: --- ## How to Use This ### For New Team Members **Read this chapter FIRST** before trying to "improve" the system. Chances are, your "new idea" was tried and failed 2 years ago. Save yourself 3 months by reading why it failed. ### For Managers **Reference when evaluating proposals**: "Let's try genetic algorithm for optimization!" → Check Lessons Learned → Already tried, failed, grid search is faster "Let's use deep learning for material classification!" → Check Lessons Learned → Already tried, 79% accuracy vs RF 86% ### For Researchers **Avoid repeating failures**: Before starting new investigation: 1. Search this chapter for similar experiments 2. Learn from failures 3. Design experiment to avoid known pitfalls 4. If you try something that fails, ADD IT HERE ### For Future Self **6 months from now**, you'll forget why you made decision X. This chapter is your **external memory**: searchable, version-controlled, permanent. --- ## Contributing to This Log ### When to Add Entry Add entry when: - Experiment failed (most important!) - Discovered non-obvious insight - Debugged production issue - Stakeholder rejected approach (even if technically sound) ### Entry Template ```markdown ## Failure: [Short Title] **What we tried**: [1 sentence] **Why it failed**: [Specific technical reason] **Time wasted**: [Realistic estimate] **What we learned**: [Actionable lesson] **Status**: [What we do now instead] ``` ### Version Control This chapter is **living document**: - Stored in Git (`lessons-learned-log.qmd`) - Every failure appended (never delete history) - Searchable (Ctrl+F for keywords) - Referenced in pull requests ("Avoiding failure from Issue #47") --- ## Meta-Lesson: Failure Is Data **Most valuable lesson**: Failure is not waste of time, it's **negative knowledge**. **Negative knowledge**: Knowing what doesn't work is as valuable as knowing what works. **Example**: - Positive knowledge: "Random Forest achieves 86% accuracy" - Negative knowledge: "XGBoost ensemble achieves 84% despite 7× complexity" **Decision**: Use Random Forest (only possible with negative knowledge) **This chapter documents our negative knowledge** so others can benefit. --- **Last Updated**: 2024-11-26 **Total Failures Documented**: 23 **Time Saved (estimated)**: 450 hours for future team **Next Review**: Ongoing (add failures as they happen) --- ## Summary The lessons learned log preserves **negative knowledge**—what didn't work: ✅ **23 documented failures** - From timestamp parsing to model complexity ✅ **450 hours saved** - For future team members who won't repeat mistakes ✅ **Root cause analysis** - Why each failure happened, not just what ✅ **Prevention strategies** - How to avoid similar failures ✅ **Living document** - Updated as new lessons emerge **Key Insight**: **Failure is data.** Knowing what doesn't work is as valuable as knowing what works. This chapter is the project's institutional memory. --- ## Reflection Questions 1. Which one or two failures in this log most closely resemble risks in your own groundwater or data projects, and what would you change today to avoid repeating them? 2. How could you embed “negative knowledge” into your team’s workflows (e.g., PR templates, design reviews, onboarding) so that lessons aren’t lost when people move on? 3. Looking at the anti-patterns, where do you see your current practices drifting toward “more data,” “more complexity,” or “more automation” without clear justification? 4. What specific checks (tests, monitors, dashboards, stakeholder reviews) would you add to catch failures earlier—before they turn into expensive or political problems? 5. If you were to add one new failure or insight from your own experience to this log, what would its short title be, and what would the core lesson look like? --- ## Related Chapters - [Data Quality Audit](../part-1-foundations/data-quality-audit.qmd) - Data-related lessons - [Material Classification ML](material-classification-ml.qmd) - ML-related lessons - [Synthesis Narrative](synthesis-narrative.qmd) - Project overview - [HTEM Survey Overview](../part-1-foundations/htem-survey-overview.qmd) - Domain context