3  Data Foundations Overview

Building a Multi-Source Intelligence System

TipFor Newcomers

You will get: - A plain-language tour of the four core data sources (HTEM, wells, weather, streams). - A sense of what each source can and cannot tell you about the aquifer. - Pointers to beginner-friendly chapters to start with.

You should already know (helpful but not required): - Very basic idea of what a map and a time series are. - How to skim over code without understanding every line.

This part is a safe starting point if you have no water background. You can skim code blocks and focus on the explanations and figures.

3.1 What You Will Learn in This Chapter

By the end of this chapter, you will be able to:

  • Describe the four core data sources (HTEM, wells, weather, streams) and what each contributes to understanding the aquifer.
  • Explain why no single dataset is sufficient on its own, and why combining independent sources reduces uncertainty.
  • Identify which Part 1 chapters to read next based on your background (computer science, hydrogeology, or general curiosity).
  • Know where to look for terminology help, data details, and common questions (Terminology Translation, Data Dictionary, FAQ).

3.2 The Foundation of Knowledge

Note📘 Why Foundations Matter First

What Is It? The “foundation” in data science means establishing reliable baseline knowledge before attempting complex analyses. This concept mirrors the scientific method formalized by Francis Bacon (1620s): observation before theory, data before models.

Why Does It Matter? Starting with foundations prevents costly mistakes: - Prevents wasted effort: Building models on flawed data yields flawed results - Establishes trust: Stakeholders need confidence in data quality before trusting predictions - Guides priorities: Understanding data gaps reveals where to focus improvement efforts

How Does It Work in This Part? We systematically inventory and validate each data source: 1. What exists: Catalog all available measurements (HTEM, wells, weather, streams) 2. What’s reliable: Assess data quality, gaps, and limitations 3. What’s missing: Identify critical coverage gaps that limit analysis

Management Implication: Time spent on foundations (Part 1) prevents expensive errors in operations (Part 5). A faulty foundation means every subsequent analysis inherits the same flaws.

Understanding an aquifer requires assembling evidence from multiple independent sources—like detectives gathering clues from witnesses, crime scenes, and forensics. No single data source tells the complete story. Each reveals a different facet of the subsurface world:

  • HTEM geophysics: The aquifer’s spatial structure and material properties
  • Monitoring wells: Direct measurements of water levels over time
  • Weather stations: The climate forcing that drives recharge
  • Stream gauges: The aquifer’s visible discharge at the surface
  • Data quality audits: Verification that our evidence is reliable

This part establishes the data foundations for everything that follows. We’ll inventory what data exists, assess its quality, and understand its strengths and limitations.

3.3 Why Multiple Data Sources

Note💻 For Computer Scientists

Multi-Source Data Fusion = Ensemble Learning for Datasets

Just as ensemble models often beat single models by combining diverse predictions, multi-source data fusion beats single-source analysis by combining complementary measurements.

Think of each dataset as a weak estimator of the aquifer state with its own uncertainty:

  • HTEM: excellent spatial coverage, but only a single time snapshot and indirect physics.
  • Wells: excellent temporal resolution, but sparse spatial network.
  • Streams: continuous data integrating the whole watershed, but only indirectly tied to the aquifer.

When their errors are at least partly independent, combining them reduces overall uncertainty. A familiar way to express this idea (for independent uncertainties) is:

\[ \sigma_{\text{combined}}^{-2} \;=\; \sigma_{\text{HTEM}}^{-2} \;+\; \sigma_{\text{well}}^{-2} \;+\; \sigma_{\text{stream}}^{-2} \]

You do not need this formula in detail—the key idea is that independent information sources add up in precision, so disagreement between sources flags bias, and agreement across sources increases confidence.

Tip🌍 For Hydrologists

The Subsurface Detective Story:

Imagine trying to map a buried river valley using only surface clues. You might: 1. Walk the landscape (HTEM) - See subtle depressions, infer buried channels 2. Dig test pits (Wells) - Confirm sand/clay layers at specific points 3. Watch springs (Streams) - See where groundwater emerges 4. Check rainfall records (Weather) - Understand recharge timing

Each method has limitations: - Walking reveals patterns but not depths - Test pits are accurate but spatially sparse - Springs integrate large areas but obscure details - Rainfall doesn’t directly tell you infiltration

Together, these sources triangulate the truth. Contradictions reveal errors or new discoveries.

NoteWhat Will You See?

The table below shows data coverage visualization comparing advertised capacity (what exists on paper) vs. operational reality (what’s actually collecting data now).

You’ll see: - Four rows: One for each data source (Wells, HTEM, Weather, Streams) - Three columns: - “Advertised” = number of sensors that exist in metadata or on maps - “Operational” = number actually recording data right now - “Coverage %” = operational divided by advertised (how much of the network is working) - Color-coded status: Red highlights indicate critical failures (<50% coverage)

Look for discrepancies: Large gaps between “Advertised” and “Operational” reveal network infrastructure problems that limit analysis capabilities.

NoteHow to Interpret Coverage Values

Use this table to assess whether the monitoring network is adequate for different analysis types:

Coverage % Uncertainty Reduction Analysis Capability Management Reliability
90-100% High confidence Full spatial analysis, regional mapping Comprehensive monitoring, early warning possible
50-89% Moderate confidence Regional patterns detectable, some blind spots Acceptable for planning, limited early warning
20-49% Low confidence Limited spatial analysis, point observations only Major coverage gaps, reactive management only
<20% Very low confidence Cannot perform spatial analysis Critical network failure, urgent expansion needed

Interpreting this dataset:

  • Wells (17%): Cannot map regional water levels - only 3 isolated point measurements
  • HTEM (100%): Can map aquifer structure across entire region
  • Weather (90%): Can characterize regional precipitation patterns
  • Streams (22%): Cannot assess aquifer-stream connectivity regionally

Management implication: Over-reliance on HTEM (single time snapshot) because well/stream networks too sparse for temporal-spatial integration.

Warning⚠️ Critical: Understanding “How Many Wells?”

You’ll see different well counts throughout this book. Here’s what each number means:

Count What It Means Use For
356 wells Total wells in metadata (OB_LOCATIONS table) Historical documentation, site planning
18 wells Wells with ANY water level measurements Basic analysis, data exploration
3 wells Wells with substantial continuous records (10+ years, daily data) Trend analysis, forecasting, ML training

Why the discrepancy?

  • Many wells were drilled decades ago and are no longer monitored
  • Some wells have only 1-10 measurements (single site visits)
  • Only 3 wells have the continuous, long-term records needed for time series analysis

Bottom line: For spatial mapping, we can use all 18 wells with measurements. For trend detection and forecasting, we can only use 3 wells with continuous records. This limitation shapes what analyses are possible throughout Parts 3-5.

See Well Network Analysis for the complete breakdown.

3.4 The Four Core Datasets

Note📘 Understanding the What/Why/How Framework

What Is It? The What/Why/How framework is a structured approach for evaluating data sources. Developed in information science, it asks three questions: What does this data measure? Why is it valuable for our goals? How do we access and use it?

Why Does It Matter? This framework prevents common pitfalls:

  • Misunderstanding scope: Knowing “what” prevents using data beyond its design purpose
  • Overlooking value: Understanding “why” reveals connections between seemingly unrelated datasets
  • Access barriers: Clarifying “how” identifies practical constraints (format, tools, permissions)

How to Apply This Framework:

For each data source below, notice:

  1. What: Physical measurements (resistivity, water levels, rainfall, discharge)
  2. Why: Each reveals different aquifer characteristics (structure, dynamics, inputs, outputs)
  3. How: Technical details (coverage, resolution, temporal span, limitations)

Management Insight: The four datasets complement each other—HTEM shows structure, wells show dynamics, weather shows inputs, streams show outputs. Missing any one creates blind spots.

3.4.1 1. HTEM Survey Subsurface

What: Helicopter electromagnetic survey measuring earth resistivity Coverage: 884 km² continuous spatial coverage Resolution: 50m horizontal, variable vertical Temporal: Single snapshot (2021 survey)

Strengths: Unparalleled spatial coverage reveals aquifer boundaries, thickness, material type Limitations: Indirect measurement (resistivity ≠ permeability), single time point

Chapters: - HTEM Survey Overview – 2D and 3D resistivity grids across 6 stratigraphic units. - Subsurface 3D Model – Material classification and aquifer structure.

3.4.2 2. Groundwater Wells Direct

What: Automated pressure transducers measuring water levels Coverage: 18 wells (only 3 operational) Resolution: Hourly measurements Temporal: 2009-2022 (longest well: ~14 years)

Strengths: Direct measurement of aquifer response, high temporal resolution Limitations: Spatially sparse (3 points), severe network gaps

Chapter: - Well Network Analysis – Measurement frequency, spatial coverage, temporal patterns.

3.4.3 3. Weather Stations Climate

What: WARM network agricultural weather stations Coverage: ~10 stations in/near Champaign County Resolution: Hourly climate data Temporal: 2012-2025 (13+ years)

Strengths: Quantifies precipitation (recharge source) and ET (loss mechanism) Limitations: Point measurements may miss spatial variability in precipitation

Chapter: - Weather Station Data – Precipitation, temperature, water balance analysis.

3.4.4 4. USGS Stream Gauges

What: Continuous stream discharge monitoring Coverage: 9 gauges (only 3 inside study area) Resolution: 15-minute discharge records Temporal: 1948-2025 (up to 75+ years)

Strengths: Longest records, base flow reveals aquifer storage, continuous monitoring Limitations: Urban monitoring bias, indirect aquifer measurement

Chapter: - Stream Gauge Network – Flow duration curves, baseflow analysis, spatial coverage.

3.4.5 5. Data Quality Assessment

What: Systematic evaluation of data completeness, accuracy, consistency Coverage: All four primary data sources Purpose: Identify gaps, outliers, errors before analysis

Chapter: - Data Quality Audit – Completeness metrics, temporal gaps, validation checks.

3.5 Critical Quality Issues Discovered

NoteUnderstanding Data Coverage Metrics

What Is Coverage?

Data coverage refers to how much of the study area we can actually monitor with operational sensors. Think of it like cell phone towers—having 100 towers on a map doesn’t help if only 3 are actually transmitting signals.

Why Does Coverage Matter?

Low coverage creates “blind spots” where we cannot: - Detect localized problems (contamination, declining water levels) - Validate model predictions - Understand spatial variability

How to Interpret the Coverage Table:

The table below compares advertised capacity (what exists on paper) vs. operational reality (what’s actually collecting data).

Coverage % Quality Rating Analysis Impact Management Implication
90-100% Excellent Full spatial analysis possible Comprehensive monitoring
50-89% Good Regional patterns detectable Some blind spots exist
20-49% Fair Limited spatial analysis Major coverage gaps
<20% Poor Point observations only Critical network failure

What Will You See?

The table shows four data sources with their operational status. Look for: - Discrepancies: Large gaps between “Advertised” and “Operational” indicate network problems - Coverage %: Below 50% means spatial analysis is severely limited - Critical infrastructure: Which data sources are failing?

Through systematic assessment, we identified severe limitations in monitoring network coverage:

Data Source Advertised Operational Coverage
Wells 18 wells 3 wells 17%
HTEM 884 km² 884 km² 100%
Weather ~10 stations ~10 stations ~90%
Streams 9 gauges 3 inside study area 22%

Key Finding: Both well and stream networks have severe spatial coverage gaps. Only HTEM provides comprehensive spatial coverage. This limits integrated analysis to the few locations where multiple data sources overlap.

Interpreting This Table:

  • Wells (17% coverage): Critical failure—can only monitor 3 points, making regional water level mapping impossible
  • HTEM (100% coverage): Success—continuous subsurface mapping across entire region
  • Weather (90% coverage): Good—sufficient for regional precipitation patterns
  • Streams (22% coverage): Poor—most gauges outside study area, limits aquifer-stream interaction analysis

Management Implication: The well and stream network gaps force over-reliance on HTEM spatial data, which only captures a single time snapshot. Temporal dynamics can only be studied at 3 isolated points.

3.6 Integration Strategy for Subsequent Parts

Note📘 Understanding the Integration Framework

What Is It? Data integration combines multiple sources to create unified datasets that enable analyses impossible with single sources alone. The concept emerged in the 1990s with enterprise data warehousing, but the core principle—synthesizing diverse information—dates to early scientific method.

Why Does It Matter? Integration unlocks value beyond individual sources:

  • Cross-validation: HTEM predictions validated by well measurements
  • Gap filling: Weather data explains well water level changes
  • Uncertainty reduction: Independent sources reduce overall uncertainty
  • New insights: Relationships between sources reveal system behavior

How Does Integration Progress?

  1. Part 2 (Spatial): Overlay datasets spatially—where do sources align?
  2. Part 3 (Temporal): Align time series—do patterns correlate as expected?
  3. Part 4 (Fusion): Combine sources mathematically—joint analysis
  4. Part 5 (Operations): Deploy integrated insights—actionable systems

Critical Success Factor: Integration quality depends on foundation quality. Poor spatial coverage (3 wells) limits all downstream integration analyses.

Parts 2-5 and the Reference Library progressively build on these data foundations:

Part 2: Spatial Patterns - Where are high-quality aquifer materials located? - How adequate is our monitoring network coverage? - Which areas are most vulnerable to contamination?

Part 3: Temporal Dynamics - How do water levels change over time (trends, seasonality)? - What is the lag between precipitation and groundwater response? - How does the aquifer respond to extreme events (droughts, floods)?

Part 4: Data Fusion Insights - Water balance closure across all four data sources - Causal relationships between climate forcing and aquifer response - Value of information analysis for monitoring investments

Part 5: Predictive Operations - Machine learning for material classification and forecasting - Well placement optimization with uncertainty quantification - Operational dashboards and early warning systems

Reference Library - Terminology translation across disciplines - Complete data dictionary for all sources - Frequently asked questions

3.7 The Path Ahead

Note📘 Interpreting Foundation Results

What Did We Learn? Part 1 inventory revealed a paradox: excellent data accuracy but inadequate spatial coverage.

Why Does This Pattern Matter?

  • Strength: High-quality temporal data enables robust trend analysis
  • Weakness: Sparse spatial network prevents regional mapping
  • Trade-off: 3 wells with 14-year records vs. 18 wells with 1-year records—which is better?

How to Interpret the Findings:

Finding Implication Next Steps
HTEM 100% coverage Can map aquifer everywhere Use as spatial foundation
Wells 17% coverage Cannot validate regionally Activate dormant wells
Weather 90% coverage Climate forcing well-characterized Sufficient for analysis
Streams 22% coverage Limited stream-aquifer analysis Install agricultural gauges

Management Decision: Accept limited spatial validation in Part 2-3 analyses, prioritize network expansion before Part 5 operational deployment.

Part 1 established what data we have and its quality. We discovered: - Excellent HTEM spatial coverage but single time snapshot - Excellent well temporal resolution but severe network gaps (3 wells) - Long stream records but urban monitoring bias - High-quality weather data spanning critical recharge period

Next: Part 2 will integrate these disparate sources into unified datasets, enabling the cross-source analyses that reveal relationships invisible within any single dataset alone.


3.8 Dependencies & Reproducibility

All chapters in Part 1 use: - Data sources: data/htem/, data/aquifer.db, data/warm.db, data/usgs_stream/ - Loaders: src.data_loaders.* (IntegratedDataLoader provides unified access) - Configuration: config/data_config.yaml (all paths, parameters) - Outputs: outputs/phase-1/ (figures, summaries for downstream use)

To reproduce Part 1 analyses:

# Render all chapters
cd aquifer-book
quarto render parts/part-1-foundations/

# Or preview interactively
quarto preview parts/part-1-foundations/

See the main index.qmd chapter and the repository quickstart guide for environment setup (Python, Quarto, dependencies) and for details on configuring config/data_config.yaml so these paths resolve correctly.

3.9 Reflection Questions

  • For each of the four core data sources, what is one thing it can tell you about the aquifer that the others cannot?
  • Where do you see the biggest gaps in the monitoring network, and how might those gaps affect later analyses?
  • After reading this overview, which Part 1 chapter do you want to examine next, given your background (CS, hydro, or newcomer)?