9  Data Quality Audit

TipFor Newcomers

You will learn:

  • Why data quality is the foundation of all reliable analysis
  • How to spot common problems: missing values, outliers, duplicates
  • What “good enough” looks like for different types of analysis
  • How to prioritize data cleaning efforts for maximum impact

Think of data quality like food safety—you can have the best recipe (algorithm) in the world, but if the ingredients (data) are spoiled, the result will be inedible. This chapter teaches you to inspect your ingredients before cooking.

9.1 What You Will Learn in This Chapter

By the end of this chapter, you will be able to:

  • Describe the main quality dimensions (completeness, accuracy, consistency, timeliness, validity) across HTEM, wells, weather, and streams.
  • Summarize the key strengths and weaknesses of each data source, especially spatial vs temporal coverage.
  • Interpret the quality scorecard and understand why high accuracy with low spatial coverage is still problematic.
  • Identify the highest-priority quality improvements (especially monitoring network expansion) for future work.

9.2 Quality Matters

Data quality is critical for reliable aquifer analysis. Poor quality data leads to incorrect conclusions about groundwater sustainability, contamination risk, or recharge dynamics. This chapter provides comprehensive quality assessment across all four data sources.

WarningData Quality Foundation

Industry consensus: The majority of ML project failures stem from data quality problems, not algorithm problems (a widely-cited observation in data science practice).

Common issues that break models: - Missing values: Systematic gaps bias predictions - Outliers: Unhandled extremes corrupt training - Duplicates: Models learn noise, not signal - Label errors: Incorrect ground truth inverts predictions

Best practice: Spend 50% of project time on data quality, not modeling!


9.3 Quality Assessment Framework

NoteUnderstanding Data Quality Dimensions

What Are They? Data quality dimensions are standardized criteria for evaluating datasets, developed by data management professionals in the 1980s-90s. The framework emerged from database management research at MIT and IBM, recognizing that “quality” is multi-dimensional—data can be accurate but incomplete, or timely but inconsistent. The five core dimensions form a comprehensive quality assessment framework.

Historical Context: Richard Wang and Diane Strong (1996) formalized the “Data Quality Framework” at MIT, which has become the industry standard for assessing data fitness-for-use.

Why Do They Matter? For aquifer analysis, different quality dimensions have different impacts:

  • Completeness: Missing wells = biased regional assessment
  • Accuracy: Incorrect resistivity = wrong aquifer characterization
  • Consistency: Duplicate records = inflated sample size, statistical errors
  • Timeliness: Outdated data = miss recent drought impacts
  • Validity: Wrong coordinates = wells plotted in wrong locations

Poor quality in any dimension can invalidate analysis results, leading to incorrect management decisions.

WarningAquifer-Specific Consequences of Quality Failures

Real-world impacts of poor data quality:

Incomplete Well Network (missing spatial coverage): - Risk: Miss drought vulnerability areas - wells only in shallow aquifer zones means deep confined zones go unmonitored - Consequence: During 2012 drought, unmonitored agricultural areas experienced well failures before urban monitoring network showed stress signals - Management failure: Emergency well permits approved based on incomplete data, leading to over-extraction

Inaccurate Resistivity (HTEM measurement errors): - Risk: Misidentify good aquifer zones - clay layers classified as sand due to calibration errors - Consequence: Drilling programs target poor aquifer materials, resulting in low-yield wells and wasted investment - Financial impact: $50,000-$100,000 per failed well × 10 wells = $500K-$1M loss

Inconsistent Timestamps (formatting or timezone errors): - Risk: Cannot correlate precipitation with water level response - Consequence: Precipitation on 9/7/2008 parsed as 7/9/2008 shifts seasonal analysis by 2 months - Scientific failure: Recharge estimates off by 30-40%, leading to incorrect sustainable yield calculations

Outdated Weather Data (timeliness failure): - Risk: Cannot detect recent climate regime shifts - Consequence: Aquifer models calibrated to 1980-2000 precipitation patterns miss 2010-2020 intensification (more extreme events) - Policy failure: Water allocation based on outdated recharge rates leads to aquifer depletion

Invalid Coordinates (spatial reference errors): - Risk: Wells plotted in wrong locations, breaking spatial analysis - Consequence: Kriging interpolates across uncorrelated zones, vulnerability maps show safe areas as at-risk (and vice versa) - Legal liability: Contamination plume boundary errors lead to incorrect property restrictions

Key insight: In groundwater management, data quality failures don’t just produce bad numbers—they lead to expensive drilling mistakes, incorrect drought warnings, and flawed sustainability policies.

How Do They Work?

TipConcrete Examples: Why Good Quality Needs All Five Dimensions

Before diving into the formal framework, let’s see why one strong dimension doesn’t compensate for weak dimensions:

Example 1: HTEM Survey - Completeness: ✅ Excellent (100% spatial coverage, 884 km² mapped) - Accuracy: ✅ Excellent (98% of resistivity values within expected range) - Consistency: ✅ Excellent (uniform grid, no duplicates) - Timeliness: ❌ Weak (single 2021 snapshot, cannot detect aquifer changes) - Validity: ✅ Excellent (all coordinates within UTM Zone 16N bounds)

Overall assessment: HTEM is high quality for spatial analysis but cannot support temporal trend analysis. You can map where aquifer materials are, but not how they’re changing.

Example 2: Well Monitoring Network - Completeness: ❌ Critical failure (17% of wells operational - only 3 of 18) - Accuracy: ✅ Excellent (100% of measurements within valid range, no sensor drift) - Consistency: ✅ Excellent (hourly data, no duplicates, no gaps) - Timeliness: ✅ Excellent (real-time hourly updates, <1 hour lag) - Validity: ✅ Excellent (all timestamps and coordinates correct)

Overall assessment: Wells have perfect measurement quality but fail as a monitoring network due to inadequate spatial coverage. The 3 operational wells give precise data, but you can’t extrapolate 3 points to 884 km².

Example 3: Stream Gauges - Completeness: ❌ Failure (22% spatial coverage - only 3 gauges in 884 km² HTEM area) - Accuracy: ✅ Excellent (USGS-grade instruments, 98% within expected range) - Consistency: ✅ Excellent (daily records since 1970s, minimal duplicates) - Timeliness: ✅ Excellent (75+ year record captures climate variability) - Validity: ✅ Excellent (all discharge values non-negative, coordinates correct)

Overall assessment: Stream data is research-grade quality but has urban monitoring bias. All 3 gauges are in urban watersheds (27.8 mi²), missing agricultural stream-aquifer connections.

The Key Lesson: High scores on accuracy, consistency, and validity cannot compensate for failures in completeness or timeliness. All five dimensions must pass for data to support reliable analysis.

Why this matters: - You can have perfect sensors but a broken network (wells) - You can have complete coverage but outdated information (HTEM) - You can have long records but wrong locations (stream gauges)

Bottom line: Data quality is multi-dimensional. One strength doesn’t erase another weakness. You need acceptable scores across all five dimensions for reliable aquifer management.

Warning❌ Common Pitfall: Confusing Data Accuracy with Data Adequacy

What researchers often assume: “My measurements are 98% accurate (within valid ranges, no outliers) → My data is high quality → I can perform regional analysis.”

Why this fails: Accuracy without coverage is insufficient. The well network in this study demonstrates the failure mode: - Accuracy: 100% (every measurement from the 3 operational wells is valid, no sensor drift) - Completeness: 17% (only 3 of 18 wells operational) - Network adequacy: FAIL (cannot map regional water table with 3 points)

Lesson learned: High-quality sensors in a broken network produce high-quality data from too few locations. Before celebrating “98% data accuracy,” ask: 1. How many measurement locations exist? 2. Are they spatially distributed adequately for your research question? 3. What percentage of your advertised network is actually operational?

Real-world example from this study: - We initially reported “excellent data quality—zero outliers, continuous hourly monitoring” - Then discovered only 3 operational wells out of 18 in metadata - Revised assessment: “Excellent temporal data quality, inadequate spatial coverage”

Better approach: Report quality per dimension with explicit thresholds:

✅ Accuracy: 98% within valid ranges (PASS - threshold 95%)
❌ Completeness: 17% wells operational (FAIL - threshold 70%)
✅ Consistency: 100% continuous (PASS - threshold 90%)
✅ Timeliness: <1 hour lag (PASS - threshold <24 hours)
✅ Validity: 100% format compliance (PASS - threshold 95%)

Overall: HIGH QUALITY DATA, INADEQUATE COVERAGE

Key insight for project planning: A data quality audit that only reports “98% accuracy” is dangerously incomplete. It creates false confidence that leads to designing analyses (e.g., kriging interpolation requiring 10+ wells) that will fail when spatial coverage proves inadequate.

How to avoid this in your own work: 1. Assess all five quality dimensions separately—don’t average into a single score 2. Define analysis-specific thresholds (trend analysis needs different coverage than snapshot mapping) 3. Report limiting dimension prominently (e.g., “Spatial coverage is the bottleneck, not accuracy”) 4. Communicate to non-technical stakeholders: “Perfect sensors in wrong locations = inadequate data”

Each dimension has specific metrics and thresholds:

1. Completeness - Metric: % of expected records present, % of fields populated - Calculation: Completeness = (Present records / Expected records) × 100% - Example: 3 of 18 wells operational = 17% completeness (FAIL)

2. Accuracy - Metric: % of values within valid ranges, outlier frequency - Calculation: Compare values to physical constraints (e.g., resistivity 0-1000 Ω·m) - Example: 0.5% of temperature readings <-50°C = accuracy problem

3. Consistency - Metric: Duplicate rate, conflicting measurements - Calculation: Duplicates = (Duplicate records / Total records) × 100% - Example: Same well-date-time appears twice with different values = inconsistency

4. Timeliness - Metric: Data age, update frequency vs. requirements - Calculation: Days since last update, measurement frequency - Example: Hourly sensors with 7-day data gaps = timeliness failure

5. Validity - Metric: Format compliance, referential integrity - Calculation: % of records meeting schema, coordinate bounds - Example: Timestamps in wrong format = parsing errors

What Will You See? Quality scorecards showing percentage scores for each dimension across all four data sources. Bar charts comparing scores against threshold lines (typically 70-90% for passing).

How to Interpret Quality Scores

Score Range Quality Level Reliability Management Action
90-100% Excellent High confidence Use for all analyses
70-89% Good Moderate confidence Document limitations
50-69% Fair Limited confidence Use with caution, improvement needed
30-49% Poor Low confidence Major improvements required
<30% Critical failure Unreliable Do not use until fixed

Dimension-specific thresholds:

Dimension Passing Threshold Critical Issue
Completeness >80% records present <50% = unusable
Accuracy <5% outliers >20% outliers = unreliable
Consistency <1% duplicates >10% duplicates = serious problem
Timeliness <90 days old for dynamic systems >1 year = outdated
Validity >95% format compliance <80% = data corruption

The paradox in this dataset: Excellent accuracy (98%) but poor completeness (17% wells operational) = high-quality data from too few locations.

✓ Data Quality Pipeline initialized

9.3.1 Quality Dimensions

We assess five critical dimensions:

  1. Completeness: Missing values, temporal gaps, spatial coverage
  2. Accuracy: Range violations, outlier detection, cross-validation
  3. Consistency: Duplicate records, conflicting measurements
  4. Timeliness: Data freshness, update frequency
  5. Validity: Format compliance, coordinate bounds, temporal ordering

9.4 Well Network Quality

9.4.1 Data Availability

Show code
# Well network statistics
well_quality = pd.DataFrame({
    'Quality Metric': [
        'Wells in Metadata',
        'Wells with Measurements',
        'Data Availability Rate',
        'Total Measurements',
        'Mean Measurement Interval',
        'Wells with Continuous Data',
        'Wells with Gaps >7 days'
    ],
    'Value': [
        '18 wells',
        '3 wells',
        '17% (FAIL)',
        '173,418',
        '~1 hour (automated)',
        '3 (100% of operational)',
        '0 (excellent)'
    ],
    'Status': [
        '⚠️',
        '❌',
        '❌',
        '✅',
        '✅',
        '✅',
        '✅'
    ]
})

well_quality
Quality Metric Value Status
0 Wells in Metadata 18 wells ⚠️
1 Wells with Measurements 3 wells
2 Data Availability Rate 17% (FAIL)
3 Total Measurements 173,418
4 Mean Measurement Interval ~1 hour (automated)
5 Wells with Continuous Data 3 (100% of operational)
6 Wells with Gaps >7 days 0 (excellent)

Critical finding: Only 3 of 18 wells (17%) have ANY measurements. The other 15 exist in metadata only.

Quality paradox: The 3 operational wells have excellent quality (hourly data, zero gaps), but inadequate spatial coverage (3 points cannot represent regional aquifer).


9.5 HTEM Data Quality

9.5.1 Completeness Assessment

Show code
# HTEM quality metrics
htem_quality = pd.DataFrame({
    'Quality Metric': [
        'Spatial Coverage',
        'Grid Cells (Unit D)',
        'Missing Values',
        'Resistivity Range Validity',
        'Material Type Classifications',
        'Stratigraphic Units Complete'
    ],
    'Value': [
        '884 km² continuous',
        '~600,000 cells',
        '<1% (excellent)',
        '100% within 0-1000 Ω·m',
        '105 types defined',
        '6 units (A-F)'
    ],
    'Status': [
        '✅',
        '✅',
        '✅',
        '✅',
        '✅',
        '✅'
    ]
})

htem_quality
Quality Metric Value Status
0 Spatial Coverage 884 km² continuous
1 Grid Cells (Unit D) ~600,000 cells
2 Missing Values <1% (excellent)
3 Resistivity Range Validity 100% within 0-1000 Ω·m
4 Material Type Classifications 105 types defined
5 Stratigraphic Units Complete 6 units (A-F)

Assessment: HTEM data has excellent quality—comprehensive spatial coverage, minimal missing values, physically plausible resistivity ranges.

Limitation: Single time snapshot (not temporal), indirect measurement (resistivity ≠ permeability)


9.6 Weather Data Quality

Show code
# Weather network quality
weather_quality = pd.DataFrame({
    'Quality Metric': [
        'Active Stations',
        'Temporal Coverage',
        'Measurement Frequency',
        'Precipitation Range Validity',
        'Temperature Range Validity',
        'Data Completeness'
    ],
    'Value': [
        '~10 stations (Champaign County)',
        '2012-2025 (13+ years)',
        'Hourly to daily',
        '100% within 0-10 inches/day',
        '100% within -40 to 50°C',
        '>85% (good)'
    ],
    'Status': [
        '✅',
        '✅',
        '✅',
        '✅',
        '✅',
        '✅'
    ]
})

weather_quality
Quality Metric Value Status
0 Active Stations ~10 stations (Champaign County)
1 Temporal Coverage 2012-2025 (13+ years)
2 Measurement Frequency Hourly to daily
3 Precipitation Range Validity 100% within 0-10 inches/day
4 Temperature Range Validity 100% within -40 to 50°C
5 Data Completeness >85% (good)

Assessment: WARM weather data has good to excellent quality—research-grade instruments, continuous monitoring, physically plausible ranges.


9.7 Stream Gauge Quality

Show code
# Stream gauge network quality
stream_quality = pd.DataFrame({
    'Quality Metric': [
        'Total Gauges',
        'Gauges Inside HTEM',
        'Spatial Coverage of HTEM',
        'Temporal Coverage',
        'Data Completeness',
        'Longest Record'
    ],
    'Value': [
        '9 gauges',
        '3 gauges (urban only)',
        '21.6% (FAIL - need 70%)',
        '1948-2025 (75+ years)',
        '>95% (excellent)',
        '75+ years'
    ],
    'Status': [
        '⚠️',
        '❌',
        '❌',
        '✅',
        '✅',
        '✅'
    ]
})

stream_quality
Quality Metric Value Status
0 Total Gauges 9 gauges ⚠️
1 Gauges Inside HTEM 3 gauges (urban only)
2 Spatial Coverage of HTEM 21.6% (FAIL - need 70%)
3 Temporal Coverage 1948-2025 (75+ years)
4 Data Completeness >95% (excellent)
5 Longest Record 75+ years

Assessment: Stream gauge data has excellent temporal quality but inadequate spatial coverage. Urban monitoring bias—all 3 gauges in HTEM are in 27.8 mi² urban watershed.


9.8 Cross-Source Quality

Show code
# Summary across all sources
quality_summary = pd.DataFrame({
    'Data Source': [
        'HTEM Survey',
        'Groundwater Wells',
        'Weather Stations',
        'Stream Gauges'
    ],
    'Spatial Coverage': [
        '✅ Excellent (884 km²)',
        '❌ Poor (3 points)',
        '✅ Good (~10 stations)',
        '❌ Poor (22% of area)'
    ],
    'Temporal Coverage': [
        '⚠️ Single snapshot',
        '✅ Excellent (14+ years)',
        '✅ Good (13+ years)',
        '✅ Excellent (75+ years)'
    ],
    'Data Quality': [
        '✅ Excellent',
        '✅ Excellent (where available)',
        '✅ Good to Excellent',
        '✅ Excellent'
    ],
    'Overall Assessment': [
        '✅ PASS',
        '❌ FAIL (availability)',
        '✅ PASS',
        '⚠️ PARTIAL (coverage)'
    ]
})

quality_summary
Data Source Spatial Coverage Temporal Coverage Data Quality Overall Assessment
0 HTEM Survey ✅ Excellent (884 km²) ⚠️ Single snapshot ✅ Excellent ✅ PASS
1 Groundwater Wells ❌ Poor (3 points) ✅ Excellent (14+ years) ✅ Excellent (where available) ❌ FAIL (availability)
2 Weather Stations ✅ Good (~10 stations) ✅ Good (13+ years) ✅ Good to Excellent ✅ PASS
3 Stream Gauges ❌ Poor (22% of area) ✅ Excellent (75+ years) ✅ Excellent ⚠️ PARTIAL (coverage)

9.8.1 Quality Dashboard

NoteHow to Read the Quality Scorecard

Before looking at the chart below, understand what you’re seeing:

Visual Elements: - Four groups of bars: One for each data source (HTEM, Wells, Weather, Streams) - Four bars per source: Spatial Coverage (blue), Temporal Coverage (coral), Data Quality (green), Overall Score (gold) - Dashed red line at 70%: The minimum quality threshold for reliable analysis - Bar height: Higher = better quality (0-100% scale)

What to Look For:

1. Bars Below the Red Line = Quality Failure - Any bar below 70% indicates that dimension fails quality standards - Wells: Spatial Coverage bar (17%) far below threshold = critical failure - Streams: Spatial Coverage bar (22%) below threshold = failure

2. Unequal Bar Heights = Weak Dimensions - When bars for the same source have very different heights, it reveals quality imbalance - Wells example: Data Quality (100%, tallest) vs Spatial Coverage (17%, shortest) = excellent sensors, broken network - HTEM example: Spatial Coverage (100%, tallest) vs Temporal Coverage (20%, shortest) = complete map, but single snapshot

3. Overall Score Hides Individual Weaknesses - The gold “Overall Score” bar averages all dimensions—can mask critical failures - Wells: Overall 71% (barely passes) but Spatial Coverage 17% (critical failure) - Streams: Overall 73% (passes) but misses that urban bias makes coverage unreliable

Specific Interpretation:

HTEM (Group 1): - Spatial Coverage: 100% (excellent) - complete mapping - Temporal Coverage: 20% (failure) - single time point limits change detection - Data Quality: 98% (excellent) - minimal outliers - Overall: 73% (passes) - good for spatial analysis, weak for temporal

Wells (Group 2): - Spatial Coverage: 17% (critical failure) - only 3 of 18 wells operational - Temporal Coverage: 95% (excellent) - hourly continuous data - Data Quality: 100% (excellent) - perfect measurements where available - Overall: 71% (barely passes) - misleading, severe spatial gap makes network unreliable

Weather (Group 3): - All bars above 80% - most balanced dataset - Spatial Coverage: 80% (good) - ~10 stations cover region - Temporal Coverage: 85% (good) - 13+ year record - Overall: 85% (good) - reliable for most analyses

Streams (Group 4): - Spatial Coverage: 22% (failure) - only 3 gauges in HTEM area, all urban - Temporal Coverage: 100% (excellent) - 75+ year records - Data Quality: 98% (excellent) - USGS-grade instruments - Overall: 73% (passes) - urban bias not captured in score

Key Takeaway: Wells and streams show quality paradox—high temporal quality + high data quality cannot compensate for low spatial coverage. You need acceptable scores in all dimensions for reliable analysis.

Show code
# Create quality score dashboard
quality_scores = pd.DataFrame({
    'Data Source': ['HTEM', 'Wells', 'Weather', 'Streams'],
    'Spatial Coverage': [100, 17, 80, 22],
    'Temporal Coverage': [20, 95, 85, 100],
    'Data Quality': [98, 100, 90, 98],
    'Overall Score': [73, 71, 85, 73]
})

fig = go.Figure()

metrics = ['Spatial Coverage', 'Temporal Coverage', 'Data Quality', 'Overall Score']
colors = ['steelblue', 'coral', 'mediumseagreen', 'goldenrod']

for metric, color in zip(metrics, colors):
    fig.add_trace(go.Bar(
        name=metric,
        x=quality_scores['Data Source'],
        y=quality_scores[metric],
        marker_color=color,
        hovertemplate='%{x}<br>%{y:.0f}%<extra></extra>'
    ))

fig.add_hline(y=70, line_dash="dash", line_color="red",
              annotation_text="Quality Threshold (70%)",
              annotation_position="right")

fig.update_layout(
    title='Data Quality Scorecard by Source<br><sub>Wells and streams fail spatial coverage threshold</sub>',
    yaxis_title='Quality Score (%)',
    xaxis_title='Data Source',
    barmode='group',
    height=500,
    template='plotly_white'
)

fig.show()
(a) Data quality scorecard across all four data sources. The dashed red line indicates the 70% quality threshold. Wells and streams fail spatial coverage despite excellent temporal quality.
(b)
Figure 9.1

9.9 Quality Rules

9.9.1 Domain Quality Checks

NoteAutomated Quality Pipeline

Configuration-driven quality checks:

class DataQualityPipeline:
    def check_completeness(self, df, threshold=90):
        completeness = 100 * (1 - df.isnull().sum().sum() / df.size)
        assert completeness >= threshold, f"Completeness {completeness:.1f}% < {threshold}%"

    def check_range(self, df, column, min_val, max_val):
        out_of_range = ((df[column] < min_val) | (df[column] > max_val)).sum()
        assert out_of_range == 0, f"{out_of_range} values outside range"

    def check_duplicates(self, df, subset=None):
        duplicates = df.duplicated(subset=subset).sum()
        assert duplicates == 0, f"{duplicates} duplicate records found"

Quality rules: - HTEM resistivity: 0-1000 Ω·m (physical constraint) - Well water levels: Cannot exceed land surface (unless artesian) - Weather precipitation: 0-20 inches/day (extreme events) - Stream discharge: Must be non-negative

What Will You See?

When you run the automated quality pipeline, you’ll see output in JSON or DataFrame format showing:

  • Check name: Which quality rule was tested (e.g., “Completeness Check”, “Range Validation”)
  • Status: PASS/FAIL/WARNING for each check
  • Score: Numeric quality score (0-100%)
  • Details: Count of violations (e.g., “12 out-of-range values detected”)
  • Timestamp: When the check was run
  • Affected records: Specific rows that failed (if any)

The pipeline runs sequentially through all checks and generates a final quality report summarizing pass/fail counts across all dimensions.

How to Interpret Quality Pipeline Results

Use these thresholds to determine if data passes quality checks:

Check Type Pass Threshold Warning Threshold Fail Threshold Action if Failed
Completeness ≥90% non-null 70-89% non-null <70% non-null Flag for review, consider imputation or exclusion
Range Validity 0 violations 1-5 violations >5 violations Investigate measurement errors, sensor calibration
Duplicates 0 duplicates 1-10 duplicates >10 duplicates De-duplicate, check data ingestion pipeline
Temporal Order All sequential 1-2 out-of-order >2 out-of-order Fix timestamps, check system clock
Cross-Source <5% disagreement 5-15% disagreement >15% disagreement Investigate systematic bias, recalibrate sensors

Example interpretation: - HTEM resistivity check: 0 violations = PASS (excellent) - Well completeness: 17% operational = FAIL (critical - expand network) - Weather range check: 3 outliers = WARNING (review but acceptable) - Stream duplicates: 0 duplicates = PASS (excellent)

Management workflow: 1. PASS: Data ready for analysis, no action needed 2. WARNING: Document limitations, proceed with caution 3. FAIL: Do not use until fixed, prioritize corrections

NoteUnderstanding Data Leakage in Spatial-Temporal Analysis

What Is It? Data leakage occurs when information from outside the training dataset improperly influences model training, causing artificially inflated performance that doesn’t generalize. The term emerged in machine learning in the 2000s as researchers realized that naive data splitting creates unrealistic validation scenarios. In spatial-temporal datasets like aquifer monitoring, leakage is particularly insidious because correlation structure violates traditional ML assumptions.

Historical Context: Claudia Perlich’s 2008 paper “Learning Curves in Machine Learning” exposed how temporal leakage inflates competition leaderboard scores but fails in production. Spatial leakage became recognized later through geospatial ML research showing that nearby test points are “cheating.”

Why Does It Matter? Leakage makes models appear to work when they don’t:

  • Temporal leakage: Model “predicts” 2020 water levels using 2021 precipitation (impossible in practice)
  • Spatial leakage: Model learns from nearby wells, then “predicts” the same well cluster
  • Target leakage: Model uses feature that contains the answer (e.g., using well depth to predict aquifer depth)

Result: 90% validation accuracy, 50% deployment accuracy = costly failure

How Does It Work?

Types of leakage in aquifer data:

1. Temporal Leakage

# ❌ WRONG: Random split shuffles time
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(data, test_size=0.2)  # Mixes past and future!

# ✅ RIGHT: Time-based split
cutoff_date = '2020-01-01'
X_train = data[data['date'] < cutoff_date]  # Only past
X_test = data[data['date'] >= cutoff_date]  # Only future

Why it matters: In practice, you can’t know future precipitation when predicting water levels!

2. Spatial Leakage

# ❌ WRONG: Random k-fold (nearby points in train and test)
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5)  # Wells 1km apart may be in different folds!

# ✅ RIGHT: Spatial blocking
# Create spatial blocks (e.g., 10km grid cells)
# Entire blocks go into train or test
# Test blocks are far from training blocks

Why it matters: Nearby wells (< variogram range) are correlated. Testing on nearby wells isn’t truly independent validation.

3. Target Leakage

# ❌ WRONG: Using future information
features = ['precip_last_30_days', 'precip_next_7_days']  # "next" is cheating!

# ✅ RIGHT: Only use past and present
features = ['precip_last_30_days', 'precip_last_90_days']  # Only historical

What Will You See

If you have leakage, you’ll see: - Unrealistically high validation scores (>95% R² on noisy data) - Performance collapses when deployed - Simple models outperform complex ones (suspicious!) - Feature importance doesn’t make physical sense

How to Prevent Leakage

Leakage Type Prevention Strategy Validation Approach
Temporal Time-series split Always train on past, test on future
Spatial Spatial blocking Test blocks >2× variogram range from training
Target Feature audit Remove any feature computed after target
Preprocessing Fit on train only Never use test set statistics (mean, std, etc.)

Critical for this dataset: - Wells <5-10km apart are spatially correlated (variogram range) - Water levels lag precipitation by weeks (temporal correlation) - Use spatial-temporal blocking: Hold out well clusters AND time periods

Note💻 For Computer Scientists

Data Quality for ML Practitioners:

Cross-Validation Strategy:

# DON'T: Random k-fold (violates spatial/temporal structure)
from sklearn.model_selection import KFold  # ❌

# DO: Spatial blocking or time-series split
from sklearn.model_selection import TimeSeriesSplit  # ✅
# Or custom spatial blocks based on well clusters

Missing Data Handling: - Not Missing at Random (NMAR): 15/18 wells have zero data - this is systematic, not random - Imputation risks: Don’t impute spatial gaps - you’re inventing data where none exists - Honest reporting: Report coverage alongside accuracy metrics

Quality Metrics for Geospatial ML: - Standard accuracy hides spatial bias (model may only work where you have data) - Report metrics stratified by region/distance from training points - Consider spatial autocorrelation in residuals (Moran’s I)


9.10 Known Issues

9.10.1 Well Data Sparsity

Problem: 15 of 18 wells have zero measurements

Root cause: - Wells not yet operational (under construction?) - Data not finalized (QC review?) - Different database (historical archives?) - Wells decommissioned (metadata not updated?)

Impact: Regional spatial analysis impossible

Action: Contact Illinois State Water Survey to clarify status

9.10.2 Stream Gauge Bias

Problem: All 3 gauges in HTEM are in urban watershed

Root cause: Urban monitoring priorities (flood hazards, infrastructure)

Impact: Cannot assess agricultural stream-aquifer connectivity

Action: Install gauges in agricultural watersheds

9.10.3 HTEM Temporal Snapshot

Problem: Single time point (2021 survey)

Root cause: Cost constraint (HTEM surveys expensive)

Impact: Cannot detect aquifer changes over time

Mitigation: Integrate with well time series for temporal dimension


9.11 Quality Improvements

9.11.1 Immediate Actions

  1. Update metadata accuracy
    • Tag wells as operational/planned/inactive
    • Document data availability vs. existence
  2. Implement automated quality checks
    • Run quality pipeline on data updates
    • Alert on threshold violations
  3. Document known issues
    • Create issue tracker for data problems
    • Log resolution status

9.11.2 Short-term Actions

  1. Expand monitoring networks
    • Activate 7+ additional wells
    • Install 3-5 stream gauges in agricultural areas
  2. Enhance data validation
    • Cross-validate HTEM with well logs
    • Compare stream base flow with well water levels
  3. Implement version control
    • Track data changes over time
    • Enable rollback for corrupted updates

9.11.3 Long-term Actions

  1. Continuous quality monitoring
    • Real-time quality dashboards
    • Automated anomaly detection
  2. Multi-source validation
    • Bayesian data fusion with uncertainty quantification
    • Flag discrepancies between sources

9.12 Quality Metrics

NoteUnderstanding Network Redundancy

What Is It? Network redundancy refers to having backup or overlapping measurements so that failure of a single sensor doesn’t create a critical data gap. The concept comes from engineering reliability theory (1950s-60s), where redundant systems ensure continued operation despite component failures. In monitoring networks, redundancy means multiple wells/gauges can measure the same aquifer condition.

Historical Context: The U.S. Geological Survey established redundancy principles for hydrologic monitoring networks in the 1970s, recognizing that single-point failures (well failures, equipment malfunctions) shouldn’t compromise regional assessments.

Why Does It Matter? Redundancy protects against:

  • Equipment failure: Sensor malfunctions, power outages, vandalism
  • Measurement gaps: Maintenance periods, communication failures
  • Data quality issues: Outliers can be identified by comparing to nearby wells
  • Regional representation: Single well may not represent larger area

The risk: If you have only one well in a region and it fails, you have zero data. With three wells, losing one leaves 67% coverage.

How Does It Work?

Redundancy strategies:

  1. Spatial redundancy: Multiple wells within same aquifer zone
    • Provides cross-validation
    • Enables outlier detection
    • Reduces spatial uncertainty
  2. Temporal redundancy: Overlapping measurement periods
    • Enables data gap filling
    • Validates long-term trends
  3. Parameter redundancy: Multiple measurement types
    • HTEM + wells = cross-validation of aquifer structure
    • Streams + wells = validate aquifer-surface water connection

Optimal redundancy: - 2-3 wells per spatial unit (e.g., per 10km × 10km cell) - Spacing < half variogram range - Balance cost vs. risk

What Will You See

The quality report includes a “Network Redundancy” score showing how well the monitoring network is protected against single-point failures.

How to Interpret Redundancy Scores

Redundancy Score Network Status Failure Risk Management Action
>80% Excellent redundancy Low risk Maintain network
60-80% Good redundancy Moderate risk Acceptable for most uses
40-60% Minimal redundancy High risk Add backup wells/gauges
20-40% Poor redundancy Critical risk Single failures = regional data loss
<20% No redundancy Extreme risk Network inadequate, urgent expansion needed

This dataset: 45% redundancy (poor) - Only 3 operational wells (no redundancy per location) - Stream gauges don’t overlap spatially - Loss of 1 well = 33% data loss - Action: Activate 5+ additional wells for redundancy

Redundancy vs. Coverage: - Coverage: Do we measure all regions? - Redundancy: Can we tolerate failures? - Both are needed: Good coverage with no redundancy = fragile network

Show code
# Final quality report
final_report = pd.DataFrame({
    'Dimension': [
        'Overall Completeness',
        'Spatial Coverage Adequacy',
        'Temporal Coverage Adequacy',
        'Data Accuracy',
        'Network Redundancy',
        'Integration Readiness'
    ],
    'Score': [
        '78% (partial)',
        '62% (fail)',
        '88% (good)',
        '96% (excellent)',
        '45% (poor)',
        '75% (good)'
    ],
    'Key Issue': [
        '15 wells missing',
        'Stream gauge gaps',
        'HTEM single snapshot',
        'Minimal outliers',
        'Over-reliance on single wells',
        'Coordinate alignment needed'
    ],
    'Priority': [
        'High',
        'High',
        'Medium',
        'Low',
        'High',
        'Medium'
    ]
})

final_report
Dimension Score Key Issue Priority
0 Overall Completeness 78% (partial) 15 wells missing High
1 Spatial Coverage Adequacy 62% (fail) Stream gauge gaps High
2 Temporal Coverage Adequacy 88% (good) HTEM single snapshot Medium
3 Data Accuracy 96% (excellent) Minimal outliers Low
4 Network Redundancy 45% (poor) Over-reliance on single wells High
5 Integration Readiness 75% (good) Coordinate alignment needed Medium
Important🎯 Quality Assessment Conclusions

9.12.1 Strengths

  • Data accuracy: All sources have excellent accuracy (minimal outliers, valid ranges)
  • Temporal resolution: Wells and streams provide continuous high-frequency data
  • HTEM coverage: Unparalleled spatial mapping of aquifer structure

9.12.2 Critical Weaknesses

  • Well spatial coverage: Only 3 operational wells (17% of metadata) - FAIL
  • Stream spatial coverage: Only 22% of HTEM area covered - FAIL
  • Network redundancy: Over-reliance on single wells/gauges - HIGH RISK

9.12.3 Overall Assessment

Data quality: ✅ PASS (accuracy, consistency, validity all excellent) Data availability: ❌ FAIL (spatial coverage inadequate for regional analysis)

Bottom line: We have high-quality data from too few locations. Expansion of monitoring networks is the #1 priority for robust regional aquifer analysis.


9.13 Dependencies & Outputs

  • Data sources: All four (HTEM, wells, weather, streams)
  • Quality tools: pandas, numpy for statistical checks
  • Outputs: Quality scorecards, issue logs, validation reports

To run quality pipeline:

from src.data_loaders import IntegratedDataLoader

loader = IntegratedDataLoader()

# Check completeness
completeness = 1 - data.isnull().sum().sum() / data.size

# Check ranges
valid = data[(data['value'] >= min_val) & (data['value'] <= max_val)]

# Check duplicates
duplicates = data.duplicated(subset=['X', 'Y', 'Z']).sum()

9.14 Summary

Data quality audit reveals a paradox: excellent data accuracy but inadequate spatial coverage:

Data accuracy excellent - Minimal outliers, valid ranges, consistent formats

Temporal resolution good - Continuous high-frequency measurements from wells and streams

HTEM coverage unparalleled - Complete spatial mapping of aquifer structure

Well coverage fails - Only 3 operational wells (17% of metadata)

Stream coverage inadequate - Only 22% of HTEM area covered

Network redundancy high risk - Over-reliance on single monitoring points

Key Insight: We have high-quality data from too few locations. Expanding monitoring networks is the #1 priority for robust regional aquifer analysis.


9.16 Reflection Questions

  • Looking at the quality scorecard, which dimension (spatial coverage, temporal coverage, or accuracy) do you think is most limiting for regional aquifer analysis, and why?
  • If you had resources to address only one quality issue in the next year (for example, activating wells vs adding stream gauges), which would you choose, and how would you justify the choice to stakeholders?
  • How would you incorporate these quality assessments into model evaluation (for example, reporting performance separately in well-instrumented vs poorly-instrumented areas)?