56 Frequently Asked Questions

Quick answers to common questions

57 General Questions

57.1 What is this project?

This is an interdisciplinary knowledge repository for understanding groundwater systems through data fusion. It combines:

HTEM geophysical data (subsurface structure)
Groundwater monitoring (well measurements)
Weather/climate data (precipitation, temperature)
USGS stream gauge data (surface water)

We bridge computer science, hydrogeology, statistics, and geophysics to teach how to extract insights from multi-source environmental data.

See: INTERDISCIPLINARY_VISION.md for the complete mission.

57.2 Just a Decision Support Tool?

No. While some chapters show how insights could inform decisions (for example, comparing well sites or highlighting anomalies), this project is fundamentally:

A knowledge repository preserving techniques and insights.
A research platform enabling novel questions about groundwater systems.
A living textbook for interdisciplinary data science on environmental data.
A bridge between different scientific disciplines working on the same aquifer.

The core objective is to understand and explain how the aquifer behaves using four datasets—not to prescribe specific management actions.

57.3 Who is this for?

Multiple audiences:

Computer scientists learning environmental data science
Hydrogeologists learning data science and machine learning
Statisticians learning applied environmental science
Students learning interdisciplinary collaboration
Managers understanding data-driven water resource management
Researchers exploring new questions in aquifer science

Different backgrounds, different entry points. The playbook is designed for learning across disciplines.

57.4 Do I need to know how to code?

No, but it helps.

This playbook does not teach Python from scratch.
We include complete code so that:
- People who code can reproduce and extend the analyses.
- Non-coders can still see how the analysis was done, at a high level.

If you are not a programmer: - You can skip or skim code blocks and focus on the story, figures, and “Key Takeaways”. - Use the “For Newcomers” callouts in each chapter to see what you’ll learn and what you can safely skip.

57.5 I have no water background. Is this book for me?

Yes.

Start with: index.qmd and Part 1 - Data Foundations.
Follow: Pathway 0 in Learning Pathways (“No-Water, No-Code On-Ramp”).
Use: Terminology Translation whenever you see an unfamiliar term.

The goal is to give you a clear narrative of: - What an aquifer is, - How we observe it, - And what we can learn from these observations.

58 Getting Started

58.1 Where to Start?

Three-step approach:

Read Part 1 - Foundations to understand each data source individually
Check Terminology Translation to learn cross-discipline language
Explore parts aligned with your background:
- CS background → Start with Part 2 (Spatial Patterns)
- Hydro background → Start with Part 3 (Temporal Dynamics)
- Stats background → Start with Part 4 (Data Fusion)

Then dive into specific chapters based on your interests.

58.2 Need All Disciplines?

Not all three! That’s the point.

If you know computer science, we’ll teach you hydrogeology
If you know hydrogeology, we’ll teach you data science
If you know statistics, we’ll teach you environmental applications
If you’re a student, we’ll teach all of the above

The project is designed for learning across disciplines, not requiring expertise in all.

58.3 Can I skip chapters?

Absolutely! This is not a linear textbook.

Skip chapters covering topics you already know
Focus on chapters filling your knowledge gaps
Jump between parts based on your interests
Use callout boxes to get discipline-specific perspectives

But: If you’re struggling with a chapter, check its prerequisites. Some advanced chapters assume knowledge from earlier ones.

59 Technical Questions

59.1 Installation and Setup?

Basic installation:

git clone https://github.com/ngcharithperera/aquifer-data.git
cd aquifer-data
pip install -r requirements.txt
pip install -e .

Verify installation:

pytest -q

See: README.md for detailed instructions.

59.2 Accessibility and Interaction

We aim to make figures and interactive elements usable for a wide range of readers.

Alt text and captions: Conceptual figures (such as aquifer cross-sections) include descriptive alt text and captions summarizing key messages.
Keyboard navigation: Interactive Plotly charts can be reached via normal browser focus (Tab/Shift+Tab) and manipulated using built-in controls (zoom, pan, reset) without a mouse.
Text-first explanations: Every important visual has surrounding narrative text that explains what patterns to look for and why they matter, so readers can follow even if they cannot see or interact with the figure.
Static exports: Most interactive charts can be exported as static images (PNG) or underlying data (CSV) for offline inspection, screen-reader pipelines, or printing.

If you encounter accessibility issues or have suggestions (for example, with screen readers, color palettes, or keyboard use), please open a GitHub issue so we can improve future iterations.

59.3 Where is the data?

Data is NOT in the git repository (too large: 10+ GB total).

Four data sources:

data/aquifer.db - Groundwater database (114 MB)
data/warm.db - Weather database (6 GB)
data/htem/ - HTEM geophysical data (5 GB)
data/usgs_stream/ - Stream gauge data (5-10 MB)

Contact repository maintainers for data access or check DATA_PROTECTION.md for details.

59.4 Use Custom Data?

Yes! The framework is designed for extensibility.

To add your data:

Create a data loader following existing patterns (src/data_loaders/)
Add paths to config/data_config.yaml
Integrate into IntegratedDataLoader (if it’s a core data type)
Write unit tests for your loader
Update documentation

See: Contributing Guide for details.

59.5 TIMESTAMP Parsing Importance?

Critical issue: The database uses US format (M/D/YYYY), not ISO.

Problem:

# Ambiguous - could misinterpret "7/9/2008"
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])  # ❌ WRONG

# In US locale: July 9, 2008 ✓
# In EU locale: September 7, 2008 ✗ WRONG!

Solution:

# Explicit format specification
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%m/%d/%Y')  # ✅ CORRECT

Impact if wrong: All temporal analysis will be incorrect (trends, seasonality, forecasts).

See: TIMESTAMP_AUDIT_AND_FIXES.md for complete documentation.

59.6 Python Package Requirements?

Core dependencies:

pandas, numpy - Data manipulation
scikit-learn - Machine learning
plotly - Interactive visualizations
statsmodels - Time series analysis
geopandas, rasterio - Spatial data (optional)
pytest - Testing

All listed in: requirements.txt

Python version: 3.11+

59.7 Running Quarto Book?

Preview (live reload):

cd aquifer-book
quarto preview

Render to HTML:

quarto render

Output: aquifer-book/_book/index.html

Tip: Use quarto preview during development for instant feedback.

60 Content Questions

60.1 Callout Box Purpose?

Callout boxes provide discipline-specific perspectives on the same content.

Example:

::: {.callout-note icon=false}
## For Computer Scientists
K-means finds clusters by minimizing within-cluster sum of squares.
But for geological data, clusters should respect spatial continuity.
:::

::: {.callout-tip icon=false}
## For Hydrogeologists
The algorithm groups similar resistivity values. But it doesn't
know that geological units are spatially continuous. You may need
to post-process results to ensure geological realism.
:::

Why? Different audiences need different explanations. Callout boxes let everyone learn from all perspectives.

60.2 Why document failed experiments?

Failed experiments are valuable!

Benefits:

Save others time (don’t repeat mistakes)
Teach when methods fail and why
Make assumptions explicit
Show science as it really is (not just successes)

Most tutorials hide failures. We document them proudly.

Example: Documenting “linear regression failed on time series” teaches more about temporal structure than just showing “SARIMAX worked.”

60.3 Data Fusion Meaning?

Data fusion = Combining multiple data sources to generate insights impossible from single sources alone.

Example: Understanding Aquifer Recharge

HTEM alone: “Sand layers exist at 15-25m depth”
Groundwater alone: “Water levels rise in spring”
Weather alone: “5 inches of rain in March-April”
Stream gauge alone: “Discharge drops in summer but baseflow sustained”

FUSION reveals: “The aquifer recharges through sandy units during spring precipitation, stores water through summer, and sustains stream baseflow during drought.”

No single data type tells this complete story.

60.4 What are stratigraphic units?

For Newcomers: Think “Layer Cake”

The ground beneath Illinois is like a layer cake built over millions of years. Each layer has different properties and formed at different times.

“Stratigraphic units” = geological layers, labeled A through F (deepest to shallowest).

Six units (A-F) representing different geological layers from deep to shallow:

Unit A: Deep bedrock (180-194m depth, ~48.7 Ω·m resistivity)
Unit B: Transition zone (108-168m depth)
Unit C: Upper bedrock (124-166m depth)
Unit D: Primary Aquifer - Mahomet (12-96m depth, ~128.3 Ω·m) - Most important for water
Unit E: Clay-rich Quaternary (0-30m depth) - Confining layer above aquifer
Unit F: Mixed surface materials (0-20m depth)

Why Unit D is Special

Unit D (Mahomet Aquifer) is an ancient river valley that got buried by glaciers:

Long ago: A large river carved a valley through bedrock
Ice Age: Glaciers filled the valley with sand and gravel (good for water storage)
More glaciers: Deposited clay on top (Unit E), sealing it in
Today: That buried valley is full of groundwater—our main water supply

It’s like a buried underground reservoir, protected by a clay “lid” (Unit E) from surface contamination.

Think of it: If you drilled 50 meters down in many places, you’d hit this ancient river valley still full of water after thousands of years.

Unit D is the focus for water resource analysis - it’s the buried sand and gravel valley (Mahomet Aquifer) that provides most groundwater.

See: Data Dictionary for complete descriptions.

61 Contributing

61.1 How can I contribute?

Many ways to contribute:

Document failures - Tried something that didn’t work? Tell us!
Add analysis chapters - New methods, new questions, new insights
Improve code - Bug fixes, new features, better algorithms
Add to terminology translation - Terms you found confusing
Contribute data - New regions, additional data types
Improve documentation - Clearer explanations, more examples
Create visualizations - Better ways to show results

See: CONTRIBUTING.md for complete guide.

61.2 Permission to Contribute?

No! This is an open project.

Process:

Fork the repository
Make your changes
Submit a pull request
We review and provide feedback
Iterate if needed
Merge!

For major changes: Open an issue first to discuss, avoid wasted effort.

61.3 Non-Code Contributions Possible?

Non-code contributions are valuable!

You can contribute:

Documentation improvements
Terminology translations
Failed experiments you’ve encountered
Questions that should be in this FAQ
Suggestions for clearer explanations
Domain expertise review (hydrogeology, stats, etc.)

See: CONTRIBUTING.md section “Types of Contributions”

61.4 Review Process Overview?

We review for:

Interdisciplinary clarity - Accessible to multiple audiences?
Scientific rigor - Assumptions documented? Limitations acknowledged?
Reproducibility - Can someone else run this?
Code quality - Follows project conventions?

Timeline:

Minor fixes: 1-2 days
Code contributions: 3-7 days
New chapters: 1-2 weeks

We aim for initial feedback within 48 hours.

62 Conceptual Questions

62.1 Multiple Perspectives Why?

Different backgrounds need different explanations.

Example: “Autocorrelation”

Statistician: Correlation between X_t and X_{t-k}
Hydrogeologist: “Memory in the system” - aquifer responds slowly
Computer Scientist: Sequential data points aren’t i.i.d.

All three are correct. Different framings help different people understand.

62.2 Versus Research Papers?

Research papers:

Show successes (failures hidden)
Methods often in supplements
Code rarely available
Written for experts in one discipline
Static (published once)

This project:

Documents failures as well as successes
Methods explained in detail with multiple perspectives
All code available and executable
Written for multiple disciplines
Living (continuously updated)

We’re building a different model for scientific communication.

62.3 Reproducibility Importance?

Reproducible research enables:

Verification - Others can check your work
Extension - Others can build on your work
Learning - Others can understand your methods
Trust - Results can be validated

How we ensure reproducibility:

All code version controlled
Dependencies locked (requirements.txt)
Random seeds set
Paths configurable (not hard-coded)
Data access documented
Methods explained step-by-step

63 Data and Methods

63.1 HTEM versus Wells?

HTEM (Helicopter Time-domain ElectroMagnetic):

Continuous spatial coverage (every ~100m)
Sees subsurface structure (geological layers)
Fast data collection (helicopter survey)
Indirect measurement (resistivity → material type)
One-time snapshot (2008 survey)

Well measurements:

Direct measurement (water level, depth)
Time series (continuous monitoring)
High precision (±0.01 ft)
Sparse spatial coverage (356 wells across region)
Expensive to collect (drilling costs)

Together: HTEM provides spatial structure, wells provide temporal dynamics. Fusion gives complete picture.

63.2 Machine Learning Methods?

For Newcomers: What is Machine Learning?

Machine Learning (ML) = Teaching computers to find patterns in data and make predictions.

Instead of writing explicit rules (“if resistivity > 100, then sand”), we show the computer many examples (“here are 1000 measurements where we know it’s sand”) and let it learn the pattern.

Why use it for aquifer data?

Relationships are complex (not simple thresholds)
We have lots of data (1M+ measurements)
Patterns exist but are hard to describe explicitly

Multiple approaches:

Classification: Random Forest, XGBoost (material type from resistivity)
- Translation: “Is this sand, clay, or gravel?” (categories)
- Example: Given resistivity = 150 Ω·m, predict material type = “well-sorted sand”
Regression: Linear, polynomial, ML regressors (property prediction)
- Translation: “What number will this be?” (continuous values)
- Example: Given resistivity and depth, predict hydraulic conductivity = 25 m/day
Time Series: SARIMAX, Prophet (water level forecasting)
- Translation: “What happens next?” (future predictions)
- Example: Given past water levels and rainfall, predict levels 30 days ahead
Spatial: Kriging, spatial regression (interpolation)
- Translation: “What’s between the measurements?” (filling gaps)
- Example: Given 356 well measurements, estimate water levels everywhere in between
Clustering: DBSCAN (spatially-constrained)
- Translation: “What groups naturally exist?” (finding patterns)
- Example: Find groups of wells that behave similarly

Key Principle: Physics-Constrained ML

We don’t use black-box ML without domain validation.

Why not? ML can find spurious patterns that violate physics:

Predicting water flowing uphill (impossible!)
Predicting negative resistivity (nonsense!)
Predicting aquifer transmissivity from shoe size (coincidental correlation)

Our approach:

Use ML to find patterns
Validate predictions against physical laws
Interpret results with domain expertise
Constrain models to respect physics (water flows downhill, etc.)

Example: If ML predicts groundwater flowing from low elevation to high elevation, we know the model is wrong, even if it fits the training data perfectly.

Key principle: Methods must respect physical constraints. We don’t use black-box ML without domain validation.

63.3 Linear Regression Issues?

For Newcomers: Why Not Just Use Linear Regression?

Simple answer: Time series data (like water levels measured every day) violates a key assumption of linear regression.

The assumption: Each data point is independent (knowing one value tells you nothing about another).

The reality: Water levels are autocorrelated (today’s level depends heavily on yesterday’s level).

Violation of independence assumption.

Problem:

# Water levels today correlated with water levels yesterday
# Violates "rows are independent" assumption
# Results in:
# - Wrong standard errors (too optimistic)
# - Misleading p-values (claim significance when there isn't)
# - Poor predictions (overconfident, doesn't capture dynamics)

Concrete Example of What Goes Wrong

Scenario: Predict tomorrow’s water level from rainfall

Bad approach (Linear Regression):

# Treats each day as independent
model = LinearRegression()
model.fit(rainfall, water_level)  # WRONG for time series!

What happens:

Model says “R² = 0.95, amazing fit!”
But predictions are terrible (misses trends, lags)
Confidence intervals too narrow (overconfident)
Autocorrelation in residuals (warning sign)

Why it fails: Yesterday’s water level is the strongest predictor of today’s level, but linear regression ignores this temporal structure!

Good approach (Time Series Model):

# Explicitly models autocorrelation
model = SARIMAX(water_level, exog=rainfall, order=(1,0,0))
# order=(1,0,0) means "use yesterday's value to predict today"

What happens:

Model captures both rainfall influence AND persistence
Predictions match observed dynamics
Realistic confidence intervals
Residuals pass autocorrelation tests

Solution: Use time series methods (SARIMAX, VAR) that account for autocorrelation.

Key insight: Aquifers have memory. Water doesn’t instantly respond to rain—it persists over days, weeks, months. Time series methods capture this memory; linear regression doesn’t.

See: Part 3 - Temporal Dynamics for proper time series analysis.

63.4 Handling Uncertainty?

Multiple approaches:

Measurement uncertainty: Document precision of instruments
Model uncertainty: Confidence intervals, prediction intervals
Parameter uncertainty: Bootstrap, Bayesian credible intervals
Scenario uncertainty: Sensitivity analysis, Monte Carlo

Key principle: Always quantify and report uncertainty. Point estimates without uncertainty are misleading.

63.5 Kriging versus ML?

Kriging (geostatistics):

Provides uncertainty estimates (kriging variance)
Optimal under Gaussian assumptions
Accounts for spatial correlation explicitly
Assumes stationarity
Computationally expensive for large datasets

Machine Learning (Random Forest, XGBoost):

No stationarity assumption
Handles non-linear relationships
Fast for large datasets
Incorporates multiple covariates easily
Uncertainty estimation more complex

Best practice: Try both, validate against hold-out data, compare results.

63.6 Physical Validation Methods?

Example: Mass balance check

# Recharge - Discharge - Storage Change = 0 (mass conservation)
recharge = precipitation * recharge_coefficient
discharge = pumping + baseflow
storage_change = water_level_change * storativity * area

residual = recharge - discharge - storage_change

# Residual should be small (near zero)
if abs(residual / recharge) > 0.1:  # >10% error
    print("WARNING: Mass balance violated - check assumptions")

Other checks:

Flow should be from high to low hydraulic head (no uphill flow)
Transmissivity should be positive
Porosity should be between 0 and 1
Predictions should be within physical range

64 Troubleshooting

64.1 Q: My forecast accuracy suddenly dropped. What should I check?

A: Work through this checklist:

Data quality: Did a sensor fail or get recalibrated? Check for gaps or spikes.
New wells added: If monitoring network changed, model may need retraining.
Extreme weather: Unusual events (drought, flood) outside training data.
Seasonal shift: Performance often drops in seasons with less training data.
Code changes: Did someone modify preprocessing or feature engineering?

Quick fix: Retrain on most recent 2 years of data. If still poor, investigate data.

64.2 Q: The dashboard shows “Insufficient Data” for a chapter. How do I fix it?

A: This means the required databases aren’t available or accessible. Check:

File exists: Is data/aquifer.db present? Is data/warm.db present?
Path correct: Check config/data_config.yaml points to correct locations.
Table exists: Connect to SQLite and verify tables exist (sqlite3 data/aquifer.db ".tables")
Date overlap: Weather and groundwater data must cover same time period.

See Data Dictionary for required table schemas.

64.3 Q: Which tool should I use for my problem?

A: Use this decision matrix:

Your Goal	Use This Tool	Chapter
Avoid drilling dry holes	Material Classification ML	Part 5 Ch.1
Predict drought 7-14 days ahead	Water Level Forecasting	Part 5 Ch.2
Detect sensor failures	Anomaly Detection	Part 5 Ch.3
Find best drilling locations	Well Placement Optimizer	Part 5 Ch.4
Design recharge systems	MAR Site Selection	Part 5 Ch.5
Explain predictions to stakeholders	Explainable AI	Part 5 Ch.6

64.4 Q: The optimizer recommends expensive sites. How do I adjust?

A: The optimizer uses weighted objectives. To prioritize cost:

Increase cost weight: Change from 0.25 to 0.40 in objective function
Add cost constraint: “Maximum drilling cost < $50K”
View Pareto frontier: Look at lower-cost alternatives with slightly lower yield

See Well Placement Optimizer for parameter tuning.

64.5 Code Troubleshooting Steps?

Common issues:

Data not found: Check data/ directory exists with all 4 sources
Import errors: Run pip install -r requirements.txt
Wrong Python version: Requires Python 3.11+
Path issues: Use get_data_path() from config, not hard-coded paths
TIMESTAMP parsing: Use explicit format %m/%d/%Y

If still stuck: Open an issue with error message and minimal reproducible example.

64.6 Quarto Rendering Issues?

Check:

Quarto installed? quarto --version
In correct directory? cd aquifer-book
Python environment active? which python
All packages installed? pip install -r requirements.txt

Common errors:

“File not found”: Check paths in chapters are relative to aquifer-book/
“Module not found”: Python environment not activated
“Code execution failed”: Set execute: enabled: true in chapter front matter if code shouldn’t run

64.7 Autocorrelation Warnings Mean?

Your residuals are correlated (bad for standard regression).

What it means:

Model hasn’t captured all temporal structure
Standard errors are wrong
P-values misleading

Solutions:

Use time series methods (SARIMAX instead of regression)
Add lag terms to capture temporal structure
Use Newey-West standard errors (robust to autocorrelation)
Check for omitted variables

See: Part 3 - Temporal Dynamics for time series analysis.

64.8 Unrealistic Predictions Why?

Common causes:

Extrapolation beyond training range - Model doesn’t know what to do
Ignoring spatial autocorrelation - Random splits leak information
Missing physical constraints - Model violates physics
Wrong coordinate system - Spatial relationships broken

Solutions:

Use spatial cross-validation (block CV)
Add physical constraints to model
Verify coordinate transformations
Check for extrapolation (plot training data extent)

65 Project Philosophy

65.1 Why emphasize interdisciplinary communication?

Real science is interdisciplinary, but communication gaps cause problems:

Computer scientists build models that violate physical laws
Hydrogeologists miss insights available from modern ML
Statisticians apply methods without understanding domain constraints
Results get lost in translation between disciplines

Solution: Build communication bridges. Translate jargon. Show multiple perspectives. Document assumptions.

Goal: Enable true collaboration, not just “throw results over the wall.”

65.2 Failed Experiments Value?

Reasons:

Save time: Others don’t repeat your mistakes
Teach assumptions: Show when methods fail reveals what they assume
Honest science: Real research includes failures, not just successes
Build intuition: Understanding failure deepens understanding of success

Example: Documenting “linear regression failed on time series” teaches more about time series structure than just showing “SARIMAX worked.”

65.3 Living Document Meaning?

Traditional documents: Written once, published, frozen.

Living documents: Continuously updated as we learn.

This project:

New chapters added as new analyses done
Failed experiments updated as new failures discovered
Terminology translation grows as new terms encountered
Methods updated as better approaches found
Community contributions integrated

Version controlled so you can see how it evolves.

66 Still Have Questions?

66.1 Ask Questions Where?

Options:

GitHub Discussions - General questions, ideas, conversations
GitHub Issues - Bugs, feature requests, specific problems
This FAQ - Submit PR to add questions others might have

66.2 Suggest FAQ Entry?

Two ways:

Open an issue with label documentation and title “FAQ: [your question]”
Submit a PR adding your question+answer to this file

We’ll review and integrate if it’s generally useful.

66.3 Who maintains this project?

Current maintainers: See CONTRIBUTING.md for contribution guidelines and maintainer information

Contributions welcome from anyone! This is designed as a community project.

See: CONTRIBUTING.md

Didn’t find your question? Ask in Discussions or open an issue!

Last Updated: November 26, 2025 Version: 2.0 (Consolidated for Playbook) Maintainers: Community-driven (open to contributors)

--- title: "Frequently Asked Questions" subtitle: "Quick answers to common questions" description: "Common questions about the Aquifer Intelligence Playbook, data sources, methods, and contributions" --- # General Questions ## What is this project? This is an **interdisciplinary knowledge repository** for understanding groundwater systems through data fusion. It combines: - HTEM geophysical data (subsurface structure) - Groundwater monitoring (well measurements) - Weather/climate data (precipitation, temperature) - USGS stream gauge data (surface water) We bridge **computer science**, **hydrogeology**, **statistics**, and **geophysics** to teach how to extract insights from multi-source environmental data. **See:** [INTERDISCIPLINARY_VISION.md](https://github.com/ngcharithperera/aquifer-data/blob/main/INTERDISCIPLINARY_VISION.md) for the complete mission. --- ## Just a Decision Support Tool? **No.** While some chapters show how insights *could* inform decisions (for example, comparing well sites or highlighting anomalies), this project is fundamentally: - A knowledge repository preserving techniques and insights. - A research platform enabling novel questions about groundwater systems. - A living textbook for interdisciplinary data science on environmental data. - A bridge between different scientific disciplines working on the same aquifer. The **core objective** is to understand and explain **how the aquifer behaves** using four datasets—not to prescribe specific management actions. --- ## Who is this for? **Multiple audiences:** - **Computer scientists** learning environmental data science - **Hydrogeologists** learning data science and machine learning - **Statisticians** learning applied environmental science - **Students** learning interdisciplinary collaboration - **Managers** understanding data-driven water resource management - **Researchers** exploring new questions in aquifer science **Different backgrounds, different entry points.** The playbook is designed for learning across disciplines. --- ## Do I need to know how to code? **No, but it helps.** - This playbook **does not teach Python from scratch**. - We include **complete code** so that: - People who code can reproduce and extend the analyses. - Non-coders can still see *how* the analysis was done, at a high level. If you are not a programmer: - You can **skip or skim code blocks** and focus on the story, figures, and “Key Takeaways”. - Use the **“For Newcomers”** callouts in each chapter to see what you’ll learn and what you can safely skip. --- ## I have no water background. Is this book for me? Yes. - Start with: `index.qmd` and **Part 1 - Data Foundations**. - Follow: **Pathway 0** in [Learning Pathways](../../learning-pathways.qmd) (“No-Water, No-Code On-Ramp”). - Use: [Terminology Translation](terminology-translation.qmd) whenever you see an unfamiliar term. The goal is to give you a **clear narrative** of: - What an aquifer is, - How we observe it, - And what we can learn from these observations. --- # Getting Started ## Where to Start? **Three-step approach:** 1. **Read Part 1 - Foundations** to understand each data source individually 2. **Check [Terminology Translation](terminology-translation.qmd)** to learn cross-discipline language 3. **Explore parts aligned with your background:** - CS background → Start with Part 2 (Spatial Patterns) - Hydro background → Start with Part 3 (Temporal Dynamics) - Stats background → Start with Part 4 (Data Fusion) **Then** dive into specific chapters based on your interests. --- ## Need All Disciplines? **Not all three!** That's the point. - If you know **computer science**, we'll teach you hydrogeology - If you know **hydrogeology**, we'll teach you data science - If you know **statistics**, we'll teach you environmental applications - If you're a **student**, we'll teach all of the above The project is designed for **learning across disciplines**, not requiring expertise in all. --- ## Can I skip chapters? **Absolutely!** This is not a linear textbook. - Skip chapters covering topics you already know - Focus on chapters filling your knowledge gaps - Jump between parts based on your interests - Use callout boxes to get discipline-specific perspectives **But:** If you're struggling with a chapter, check its prerequisites. Some advanced chapters assume knowledge from earlier ones. --- # Technical Questions ## Installation and Setup? **Basic installation:** ```bash git clone https://github.com/ngcharithperera/aquifer-data.git cd aquifer-data pip install -r requirements.txt pip install -e . ``` **Verify installation:** ```bash pytest -q ``` **See:** [README.md](https://github.com/ngcharithperera/aquifer-data/blob/main/README.md) for detailed instructions. --- ## Accessibility and Interaction We aim to make figures and interactive elements usable for a wide range of readers. - **Alt text and captions:** Conceptual figures (such as aquifer cross-sections) include descriptive alt text and captions summarizing key messages. - **Keyboard navigation:** Interactive Plotly charts can be reached via normal browser focus (Tab/Shift+Tab) and manipulated using built-in controls (zoom, pan, reset) without a mouse. - **Text-first explanations:** Every important visual has surrounding narrative text that explains what patterns to look for and why they matter, so readers can follow even if they cannot see or interact with the figure. - **Static exports:** Most interactive charts can be exported as static images (PNG) or underlying data (CSV) for offline inspection, screen-reader pipelines, or printing. If you encounter accessibility issues or have suggestions (for example, with screen readers, color palettes, or keyboard use), please open a GitHub issue so we can improve future iterations. --- ## Where is the data? **Data is NOT in the git repository** (too large: 10+ GB total). **Four data sources:** 1. `data/aquifer.db` - Groundwater database (114 MB) 2. `data/warm.db` - Weather database (6 GB) 3. `data/htem/` - HTEM geophysical data (5 GB) 4. `data/usgs_stream/` - Stream gauge data (5-10 MB) Contact repository maintainers for data access or check `DATA_PROTECTION.md` for details. --- ## Use Custom Data? **Yes!** The framework is designed for extensibility. **To add your data:** 1. Create a data loader following existing patterns (`src/data_loaders/`) 2. Add paths to `config/data_config.yaml` 3. Integrate into `IntegratedDataLoader` (if it's a core data type) 4. Write unit tests for your loader 5. Update documentation **See:** [Contributing Guide](https://github.com/ngcharithperera/aquifer-data/blob/main/CONTRIBUTING.md) for details. --- ## TIMESTAMP Parsing Importance? **Critical issue:** The database uses **US format (M/D/YYYY)**, not ISO. **Problem:** ```python # Ambiguous - could misinterpret "7/9/2008" df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']) # ❌ WRONG # In US locale: July 9, 2008 ✓ # In EU locale: September 7, 2008 ✗ WRONG! ``` **Solution:** ```python # Explicit format specification df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%m/%d/%Y') # ✅ CORRECT ``` **Impact if wrong:** All temporal analysis will be incorrect (trends, seasonality, forecasts). **See:** `TIMESTAMP_AUDIT_AND_FIXES.md` for complete documentation. --- ## Python Package Requirements? **Core dependencies:** - `pandas`, `numpy` - Data manipulation - `scikit-learn` - Machine learning - `plotly` - Interactive visualizations - `statsmodels` - Time series analysis - `geopandas`, `rasterio` - Spatial data (optional) - `pytest` - Testing **All listed in:** `requirements.txt` **Python version:** 3.11+ --- ## Running Quarto Book? **Preview (live reload):** ```bash cd aquifer-book quarto preview ``` **Render to HTML:** ```bash quarto render ``` **Output:** `aquifer-book/_book/index.html` **Tip:** Use `quarto preview` during development for instant feedback. --- # Content Questions ## Callout Box Purpose? **Callout boxes provide discipline-specific perspectives** on the same content. **Example:** ```markdown ::: {.callout-note icon=false} ## For Computer Scientists K-means finds clusters by minimizing within-cluster sum of squares. But for geological data, clusters should respect spatial continuity. ::: ::: {.callout-tip icon=false} ## For Hydrogeologists The algorithm groups similar resistivity values. But it doesn't know that geological units are spatially continuous. You may need to post-process results to ensure geological realism. ::: ``` **Why?** Different audiences need different explanations. Callout boxes let everyone learn from all perspectives. --- ## Why document failed experiments? **Failed experiments are valuable!** **Benefits:** - Save others time (don't repeat mistakes) - Teach when methods fail and why - Make assumptions explicit - Show science as it really is (not just successes) Most tutorials hide failures. We document them proudly. **Example:** Documenting "linear regression failed on time series" teaches more about temporal structure than just showing "SARIMAX worked." --- ## Data Fusion Meaning? **Data fusion = Combining multiple data sources to generate insights impossible from single sources alone.** **Example: Understanding Aquifer Recharge** - **HTEM alone:** "Sand layers exist at 15-25m depth" - **Groundwater alone:** "Water levels rise in spring" - **Weather alone:** "5 inches of rain in March-April" - **Stream gauge alone:** "Discharge drops in summer but baseflow sustained" **FUSION reveals:** "The aquifer recharges through sandy units during spring precipitation, stores water through summer, and sustains stream baseflow during drought." **No single data type tells this complete story.** --- ## What are stratigraphic units? ::: {.callout-note icon=false} ## For Newcomers: Think "Layer Cake" The ground beneath Illinois is like a **layer cake** built over millions of years. Each layer has different properties and formed at different times. **"Stratigraphic units" = geological layers**, labeled A through F (deepest to shallowest). ::: **Six units (A-F)** representing different geological layers from deep to shallow: - **Unit A:** Deep bedrock (180-194m depth, ~48.7 Ω·m resistivity) - **Unit B:** Transition zone (108-168m depth) - **Unit C:** Upper bedrock (124-166m depth) - **Unit D:** **Primary Aquifer - Mahomet** (12-96m depth, ~128.3 Ω·m) - **Most important for water** - **Unit E:** Clay-rich Quaternary (0-30m depth) - Confining layer above aquifer - **Unit F:** Mixed surface materials (0-20m depth) ::: {.callout-tip icon=false} ## Why Unit D is Special **Unit D (Mahomet Aquifer)** is an ancient river valley that got buried by glaciers: 1. **Long ago:** A large river carved a valley through bedrock 2. **Ice Age:** Glaciers filled the valley with sand and gravel (good for water storage) 3. **More glaciers:** Deposited clay on top (Unit E), sealing it in 4. **Today:** That buried valley is full of groundwater—our main water supply **It's like a buried underground reservoir**, protected by a clay "lid" (Unit E) from surface contamination. **Think of it:** If you drilled 50 meters down in many places, you'd hit this ancient river valley still full of water after thousands of years. ::: **Unit D** is the focus for water resource analysis - it's the buried sand and gravel valley (Mahomet Aquifer) that provides most groundwater. **See:** [Data Dictionary](data-dictionary.qmd) for complete descriptions. --- # Contributing ## How can I contribute? **Many ways to contribute:** 1. **Document failures** - Tried something that didn't work? Tell us! 2. **Add analysis chapters** - New methods, new questions, new insights 3. **Improve code** - Bug fixes, new features, better algorithms 4. **Add to terminology translation** - Terms you found confusing 5. **Contribute data** - New regions, additional data types 6. **Improve documentation** - Clearer explanations, more examples 7. **Create visualizations** - Better ways to show results **See:** [CONTRIBUTING.md](https://github.com/ngcharithperera/aquifer-data/blob/main/CONTRIBUTING.md) for complete guide. --- ## Permission to Contribute? **No!** This is an open project. **Process:** 1. Fork the repository 2. Make your changes 3. Submit a pull request 4. We review and provide feedback 5. Iterate if needed 6. Merge! **For major changes:** Open an issue first to discuss, avoid wasted effort. --- ## Non-Code Contributions Possible? **Non-code contributions are valuable!** You can contribute: - Documentation improvements - Terminology translations - Failed experiments you've encountered - Questions that should be in this FAQ - Suggestions for clearer explanations - Domain expertise review (hydrogeology, stats, etc.) **See:** [CONTRIBUTING.md](https://github.com/ngcharithperera/aquifer-data/blob/main/CONTRIBUTING.md) section "Types of Contributions" --- ## Review Process Overview? **We review for:** 1. **Interdisciplinary clarity** - Accessible to multiple audiences? 2. **Scientific rigor** - Assumptions documented? Limitations acknowledged? 3. **Reproducibility** - Can someone else run this? 4. **Code quality** - Follows project conventions? **Timeline:** - Minor fixes: 1-2 days - Code contributions: 3-7 days - New chapters: 1-2 weeks We aim for initial feedback within 48 hours. --- # Conceptual Questions ## Multiple Perspectives Why? **Different backgrounds need different explanations.** **Example:** "Autocorrelation" - **Statistician:** Correlation between X_t and X_{t-k} - **Hydrogeologist:** "Memory in the system" - aquifer responds slowly - **Computer Scientist:** Sequential data points aren't i.i.d. **All three are correct.** Different framings help different people understand. --- ## Versus Research Papers? **Research papers:** - Show successes (failures hidden) - Methods often in supplements - Code rarely available - Written for experts in one discipline - Static (published once) **This project:** - Documents failures as well as successes - Methods explained in detail with multiple perspectives - All code available and executable - Written for multiple disciplines - Living (continuously updated) **We're building a different model for scientific communication.** --- ## Reproducibility Importance? **Reproducible research enables:** - **Verification** - Others can check your work - **Extension** - Others can build on your work - **Learning** - Others can understand your methods - **Trust** - Results can be validated **How we ensure reproducibility:** - All code version controlled - Dependencies locked (`requirements.txt`) - Random seeds set - Paths configurable (not hard-coded) - Data access documented - Methods explained step-by-step --- # Data and Methods ## HTEM versus Wells? **HTEM (Helicopter Time-domain ElectroMagnetic):** - Continuous spatial coverage (every ~100m) - Sees subsurface structure (geological layers) - Fast data collection (helicopter survey) - Indirect measurement (resistivity → material type) - One-time snapshot (2008 survey) **Well measurements:** - Direct measurement (water level, depth) - Time series (continuous monitoring) - High precision (±0.01 ft) - Sparse spatial coverage (356 wells across region) - Expensive to collect (drilling costs) **Together:** HTEM provides spatial structure, wells provide temporal dynamics. **Fusion gives complete picture.** --- ## Machine Learning Methods? ::: {.callout-note icon=false} ## For Newcomers: What is Machine Learning? **Machine Learning (ML)** = Teaching computers to find patterns in data and make predictions. Instead of writing explicit rules ("if resistivity > 100, then sand"), we show the computer many examples ("here are 1000 measurements where we know it's sand") and let it learn the pattern. **Why use it for aquifer data?** - Relationships are complex (not simple thresholds) - We have lots of data (1M+ measurements) - Patterns exist but are hard to describe explicitly ::: **Multiple approaches:** - **Classification:** Random Forest, XGBoost (material type from resistivity) - *Translation:* "Is this sand, clay, or gravel?" (categories) - *Example:* Given resistivity = 150 Ω·m, predict material type = "well-sorted sand" - **Regression:** Linear, polynomial, ML regressors (property prediction) - *Translation:* "What number will this be?" (continuous values) - *Example:* Given resistivity and depth, predict hydraulic conductivity = 25 m/day - **Time Series:** SARIMAX, Prophet (water level forecasting) - *Translation:* "What happens next?" (future predictions) - *Example:* Given past water levels and rainfall, predict levels 30 days ahead - **Spatial:** Kriging, spatial regression (interpolation) - *Translation:* "What's between the measurements?" (filling gaps) - *Example:* Given 356 well measurements, estimate water levels everywhere in between - **Clustering:** DBSCAN (spatially-constrained) - *Translation:* "What groups naturally exist?" (finding patterns) - *Example:* Find groups of wells that behave similarly ::: {.callout-important icon=false} ## Key Principle: Physics-Constrained ML **We don't use black-box ML without domain validation.** **Why not?** ML can find spurious patterns that violate physics: - Predicting water flowing uphill (impossible!) - Predicting negative resistivity (nonsense!) - Predicting aquifer transmissivity from shoe size (coincidental correlation) **Our approach:** 1. Use ML to find patterns 2. **Validate** predictions against physical laws 3. **Interpret** results with domain expertise 4. **Constrain** models to respect physics (water flows downhill, etc.) **Example:** If ML predicts groundwater flowing from low elevation to high elevation, we know the model is wrong, even if it fits the training data perfectly. ::: **Key principle:** Methods must respect physical constraints. We don't use black-box ML without domain validation. --- ## Linear Regression Issues? ::: {.callout-note icon=false} ## For Newcomers: Why Not Just Use Linear Regression? **Simple answer:** Time series data (like water levels measured every day) violates a key assumption of linear regression. **The assumption:** Each data point is **independent** (knowing one value tells you nothing about another). **The reality:** Water levels are **autocorrelated** (today's level depends heavily on yesterday's level). ::: **Violation of independence assumption.** **Problem:** ```python # Water levels today correlated with water levels yesterday # Violates "rows are independent" assumption # Results in: # - Wrong standard errors (too optimistic) # - Misleading p-values (claim significance when there isn't) # - Poor predictions (overconfident, doesn't capture dynamics) ``` ::: {.callout-warning icon=false} ## Concrete Example of What Goes Wrong **Scenario:** Predict tomorrow's water level from rainfall **Bad approach (Linear Regression):** ```python # Treats each day as independent model = LinearRegression() model.fit(rainfall, water_level) # WRONG for time series! ``` **What happens:** - Model says "R² = 0.95, amazing fit!" - But predictions are terrible (misses trends, lags) - Confidence intervals too narrow (overconfident) - Autocorrelation in residuals (warning sign) **Why it fails:** Yesterday's water level is the **strongest predictor** of today's level, but linear regression ignores this temporal structure! **Good approach (Time Series Model):** ```python # Explicitly models autocorrelation model = SARIMAX(water_level, exog=rainfall, order=(1,0,0)) # order=(1,0,0) means "use yesterday's value to predict today" ``` **What happens:** - Model captures both rainfall influence AND persistence - Predictions match observed dynamics - Realistic confidence intervals - Residuals pass autocorrelation tests ::: **Solution:** Use time series methods (SARIMAX, VAR) that account for autocorrelation. **Key insight:** Aquifers have **memory**. Water doesn't instantly respond to rain—it persists over days, weeks, months. Time series methods capture this memory; linear regression doesn't. **See:** Part 3 - Temporal Dynamics for proper time series analysis. --- ## Handling Uncertainty? **Multiple approaches:** 1. **Measurement uncertainty:** Document precision of instruments 2. **Model uncertainty:** Confidence intervals, prediction intervals 3. **Parameter uncertainty:** Bootstrap, Bayesian credible intervals 4. **Scenario uncertainty:** Sensitivity analysis, Monte Carlo **Key principle:** Always quantify and report uncertainty. Point estimates without uncertainty are misleading. --- ## Kriging versus ML? **Kriging (geostatistics):** - Provides uncertainty estimates (kriging variance) - Optimal under Gaussian assumptions - Accounts for spatial correlation explicitly - Assumes stationarity - Computationally expensive for large datasets **Machine Learning (Random Forest, XGBoost):** - No stationarity assumption - Handles non-linear relationships - Fast for large datasets - Incorporates multiple covariates easily - Uncertainty estimation more complex **Best practice:** Try both, validate against hold-out data, compare results. --- ## Physical Validation Methods? **Example: Mass balance check** ```python # Recharge - Discharge - Storage Change = 0 (mass conservation) recharge = precipitation * recharge_coefficient discharge = pumping + baseflow storage_change = water_level_change * storativity * area residual = recharge - discharge - storage_change # Residual should be small (near zero) if abs(residual / recharge) > 0.1: # >10% error print("WARNING: Mass balance violated - check assumptions") ``` **Other checks:** - Flow should be from high to low hydraulic head (no uphill flow) - Transmissivity should be positive - Porosity should be between 0 and 1 - Predictions should be within physical range --- # Troubleshooting ## Q: My forecast accuracy suddenly dropped. What should I check? **A:** Work through this checklist: 1. **Data quality**: Did a sensor fail or get recalibrated? Check for gaps or spikes. 2. **New wells added**: If monitoring network changed, model may need retraining. 3. **Extreme weather**: Unusual events (drought, flood) outside training data. 4. **Seasonal shift**: Performance often drops in seasons with less training data. 5. **Code changes**: Did someone modify preprocessing or feature engineering? **Quick fix**: Retrain on most recent 2 years of data. If still poor, investigate data. --- ## Q: The dashboard shows "Insufficient Data" for a chapter. How do I fix it? **A:** This means the required databases aren't available or accessible. Check: 1. **File exists**: Is `data/aquifer.db` present? Is `data/warm.db` present? 2. **Path correct**: Check `config/data_config.yaml` points to correct locations. 3. **Table exists**: Connect to SQLite and verify tables exist (`sqlite3 data/aquifer.db ".tables"`) 4. **Date overlap**: Weather and groundwater data must cover same time period. See [Data Dictionary](data-dictionary.qmd) for required table schemas. --- ## Q: Which tool should I use for my problem? **A:** Use this decision matrix: | Your Goal | Use This Tool | Chapter | |-----------|---------------|---------| | Avoid drilling dry holes | Material Classification ML | [Part 5 Ch.1](../part-5-operations/material-classification-ml.qmd) | | Predict drought 7-14 days ahead | Water Level Forecasting | [Part 5 Ch.2](../part-5-operations/water-level-forecasting.qmd) | | Detect sensor failures | Anomaly Detection | [Part 5 Ch.3](../part-5-operations/anomaly-early-warning.qmd) | | Find best drilling locations | Well Placement Optimizer | [Part 5 Ch.4](../part-5-operations/well-placement-optimizer.qmd) | | Design recharge systems | MAR Site Selection | [Part 5 Ch.5](../part-5-operations/mar-site-selection.qmd) | | Explain predictions to stakeholders | Explainable AI | [Part 5 Ch.6](../part-5-operations/explainable-ai-insights.qmd) | --- ## Q: The optimizer recommends expensive sites. How do I adjust? **A:** The optimizer uses weighted objectives. To prioritize cost: 1. **Increase cost weight**: Change from 0.25 to 0.40 in objective function 2. **Add cost constraint**: "Maximum drilling cost < $50K" 3. **View Pareto frontier**: Look at lower-cost alternatives with slightly lower yield See [Well Placement Optimizer](../part-5-operations/well-placement-optimizer.qmd) for parameter tuning. --- ## Code Troubleshooting Steps? **Common issues:** 1. **Data not found:** Check `data/` directory exists with all 4 sources 2. **Import errors:** Run `pip install -r requirements.txt` 3. **Wrong Python version:** Requires Python 3.11+ 4. **Path issues:** Use `get_data_path()` from config, not hard-coded paths 5. **TIMESTAMP parsing:** Use explicit format `%m/%d/%Y` **If still stuck:** Open an issue with error message and minimal reproducible example. --- ## Quarto Rendering Issues? **Check:** 1. Quarto installed? `quarto --version` 2. In correct directory? `cd aquifer-book` 3. Python environment active? `which python` 4. All packages installed? `pip install -r requirements.txt` **Common errors:** - **"File not found":** Check paths in chapters are relative to `aquifer-book/` - **"Module not found":** Python environment not activated - **"Code execution failed":** Set `execute: enabled: true` in chapter front matter if code shouldn't run --- ## Autocorrelation Warnings Mean? **Your residuals are correlated (bad for standard regression).** **What it means:** - Model hasn't captured all temporal structure - Standard errors are wrong - P-values misleading **Solutions:** 1. Use time series methods (SARIMAX instead of regression) 2. Add lag terms to capture temporal structure 3. Use Newey-West standard errors (robust to autocorrelation) 4. Check for omitted variables **See:** Part 3 - Temporal Dynamics for time series analysis. --- ## Unrealistic Predictions Why? **Common causes:** 1. **Extrapolation beyond training range** - Model doesn't know what to do 2. **Ignoring spatial autocorrelation** - Random splits leak information 3. **Missing physical constraints** - Model violates physics 4. **Wrong coordinate system** - Spatial relationships broken **Solutions:** - Use spatial cross-validation (block CV) - Add physical constraints to model - Verify coordinate transformations - Check for extrapolation (plot training data extent) --- # Project Philosophy ## Why emphasize interdisciplinary communication? **Real science is interdisciplinary**, but communication gaps cause problems: - Computer scientists build models that violate physical laws - Hydrogeologists miss insights available from modern ML - Statisticians apply methods without understanding domain constraints - Results get lost in translation between disciplines **Solution:** Build communication bridges. Translate jargon. Show multiple perspectives. Document assumptions. **Goal:** Enable true collaboration, not just "throw results over the wall." --- ## Failed Experiments Value? **Reasons:** 1. **Save time:** Others don't repeat your mistakes 2. **Teach assumptions:** Show when methods fail reveals what they assume 3. **Honest science:** Real research includes failures, not just successes 4. **Build intuition:** Understanding failure deepens understanding of success **Example:** Documenting "linear regression failed on time series" teaches more about time series structure than just showing "SARIMAX worked." --- ## Living Document Meaning? **Traditional documents:** Written once, published, frozen. **Living documents:** Continuously updated as we learn. **This project:** - New chapters added as new analyses done - Failed experiments updated as new failures discovered - Terminology translation grows as new terms encountered - Methods updated as better approaches found - Community contributions integrated **Version controlled** so you can see how it evolves. --- # Still Have Questions? ## Ask Questions Where? **Options:** 1. **[GitHub Discussions](https://github.com/ngcharithperera/aquifer-data/discussions)** - General questions, ideas, conversations 2. **[GitHub Issues](https://github.com/ngcharithperera/aquifer-data/issues)** - Bugs, feature requests, specific problems 3. **This FAQ** - Submit PR to add questions others might have --- ## Suggest FAQ Entry? **Two ways:** 1. **Open an issue** with label `documentation` and title "FAQ: [your question]" 2. **Submit a PR** adding your question+answer to this file We'll review and integrate if it's generally useful. --- ## Who maintains this project? **Current maintainers:** See [CONTRIBUTING.md](https://github.com/ngcharithperera/aquifer-data/blob/main/CONTRIBUTING.md) for contribution guidelines and maintainer information **Contributions welcome** from anyone! This is designed as a community project. **See:** [CONTRIBUTING.md](https://github.com/ngcharithperera/aquifer-data/blob/main/CONTRIBUTING.md) --- **Didn't find your question?** [Ask in Discussions](https://github.com/ngcharithperera/aquifer-data/discussions) or [open an issue](https://github.com/ngcharithperera/aquifer-data/issues)! --- **Last Updated:** November 26, 2025 **Version:** 2.0 (Consolidated for Playbook) **Maintainers:** Community-driven (open to contributors)