56 Frequently Asked Questions
Quick answers to common questions
57 General Questions
57.1 What is this project?
This is an interdisciplinary knowledge repository for understanding groundwater systems through data fusion. It combines:
- HTEM geophysical data (subsurface structure)
- Groundwater monitoring (well measurements)
- Weather/climate data (precipitation, temperature)
- USGS stream gauge data (surface water)
We bridge computer science, hydrogeology, statistics, and geophysics to teach how to extract insights from multi-source environmental data.
See: INTERDISCIPLINARY_VISION.md for the complete mission.
57.2 Just a Decision Support Tool?
No. While some chapters show how insights could inform decisions (for example, comparing well sites or highlighting anomalies), this project is fundamentally:
- A knowledge repository preserving techniques and insights.
- A research platform enabling novel questions about groundwater systems.
- A living textbook for interdisciplinary data science on environmental data.
- A bridge between different scientific disciplines working on the same aquifer.
The core objective is to understand and explain how the aquifer behaves using four datasets—not to prescribe specific management actions.
57.3 Who is this for?
Multiple audiences:
- Computer scientists learning environmental data science
- Hydrogeologists learning data science and machine learning
- Statisticians learning applied environmental science
- Students learning interdisciplinary collaboration
- Managers understanding data-driven water resource management
- Researchers exploring new questions in aquifer science
Different backgrounds, different entry points. The playbook is designed for learning across disciplines.
57.4 Do I need to know how to code?
No, but it helps.
- This playbook does not teach Python from scratch.
- We include complete code so that:
- People who code can reproduce and extend the analyses.
- Non-coders can still see how the analysis was done, at a high level.
If you are not a programmer: - You can skip or skim code blocks and focus on the story, figures, and “Key Takeaways”. - Use the “For Newcomers” callouts in each chapter to see what you’ll learn and what you can safely skip.
57.5 I have no water background. Is this book for me?
Yes.
- Start with:
index.qmdand Part 1 - Data Foundations. - Follow: Pathway 0 in Learning Pathways (“No-Water, No-Code On-Ramp”).
- Use: Terminology Translation whenever you see an unfamiliar term.
The goal is to give you a clear narrative of: - What an aquifer is, - How we observe it, - And what we can learn from these observations.
58 Getting Started
58.1 Where to Start?
Three-step approach:
- Read Part 1 - Foundations to understand each data source individually
- Check Terminology Translation to learn cross-discipline language
- Explore parts aligned with your background:
- CS background → Start with Part 2 (Spatial Patterns)
- Hydro background → Start with Part 3 (Temporal Dynamics)
- Stats background → Start with Part 4 (Data Fusion)
Then dive into specific chapters based on your interests.
58.2 Need All Disciplines?
Not all three! That’s the point.
- If you know computer science, we’ll teach you hydrogeology
- If you know hydrogeology, we’ll teach you data science
- If you know statistics, we’ll teach you environmental applications
- If you’re a student, we’ll teach all of the above
The project is designed for learning across disciplines, not requiring expertise in all.
58.3 Can I skip chapters?
Absolutely! This is not a linear textbook.
- Skip chapters covering topics you already know
- Focus on chapters filling your knowledge gaps
- Jump between parts based on your interests
- Use callout boxes to get discipline-specific perspectives
But: If you’re struggling with a chapter, check its prerequisites. Some advanced chapters assume knowledge from earlier ones.
59 Technical Questions
59.1 Installation and Setup?
Basic installation:
git clone https://github.com/ngcharithperera/aquifer-data.git
cd aquifer-data
pip install -r requirements.txt
pip install -e .Verify installation:
pytest -qSee: README.md for detailed instructions.
59.2 Accessibility and Interaction
We aim to make figures and interactive elements usable for a wide range of readers.
- Alt text and captions: Conceptual figures (such as aquifer cross-sections) include descriptive alt text and captions summarizing key messages.
- Keyboard navigation: Interactive Plotly charts can be reached via normal browser focus (Tab/Shift+Tab) and manipulated using built-in controls (zoom, pan, reset) without a mouse.
- Text-first explanations: Every important visual has surrounding narrative text that explains what patterns to look for and why they matter, so readers can follow even if they cannot see or interact with the figure.
- Static exports: Most interactive charts can be exported as static images (PNG) or underlying data (CSV) for offline inspection, screen-reader pipelines, or printing.
If you encounter accessibility issues or have suggestions (for example, with screen readers, color palettes, or keyboard use), please open a GitHub issue so we can improve future iterations.
59.3 Where is the data?
Data is NOT in the git repository (too large: 10+ GB total).
Four data sources:
data/aquifer.db- Groundwater database (114 MB)data/warm.db- Weather database (6 GB)data/htem/- HTEM geophysical data (5 GB)data/usgs_stream/- Stream gauge data (5-10 MB)
Contact repository maintainers for data access or check DATA_PROTECTION.md for details.
59.4 Use Custom Data?
Yes! The framework is designed for extensibility.
To add your data:
- Create a data loader following existing patterns (
src/data_loaders/) - Add paths to
config/data_config.yaml - Integrate into
IntegratedDataLoader(if it’s a core data type) - Write unit tests for your loader
- Update documentation
See: Contributing Guide for details.
59.5 TIMESTAMP Parsing Importance?
Critical issue: The database uses US format (M/D/YYYY), not ISO.
Problem:
# Ambiguous - could misinterpret "7/9/2008"
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']) # ❌ WRONG
# In US locale: July 9, 2008 ✓
# In EU locale: September 7, 2008 ✗ WRONG!Solution:
# Explicit format specification
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%m/%d/%Y') # ✅ CORRECTImpact if wrong: All temporal analysis will be incorrect (trends, seasonality, forecasts).
See: TIMESTAMP_AUDIT_AND_FIXES.md for complete documentation.
59.6 Python Package Requirements?
Core dependencies:
pandas,numpy- Data manipulationscikit-learn- Machine learningplotly- Interactive visualizationsstatsmodels- Time series analysisgeopandas,rasterio- Spatial data (optional)pytest- Testing
All listed in: requirements.txt
Python version: 3.11+
59.7 Running Quarto Book?
Preview (live reload):
cd aquifer-book
quarto previewRender to HTML:
quarto renderOutput: aquifer-book/_book/index.html
Tip: Use quarto preview during development for instant feedback.
60 Content Questions
60.1 Callout Box Purpose?
Callout boxes provide discipline-specific perspectives on the same content.
Example:
::: {.callout-note icon=false}
## For Computer Scientists
K-means finds clusters by minimizing within-cluster sum of squares.
But for geological data, clusters should respect spatial continuity.
:::
::: {.callout-tip icon=false}
## For Hydrogeologists
The algorithm groups similar resistivity values. But it doesn't
know that geological units are spatially continuous. You may need
to post-process results to ensure geological realism.
:::Why? Different audiences need different explanations. Callout boxes let everyone learn from all perspectives.
60.2 Why document failed experiments?
Failed experiments are valuable!
Benefits:
- Save others time (don’t repeat mistakes)
- Teach when methods fail and why
- Make assumptions explicit
- Show science as it really is (not just successes)
Most tutorials hide failures. We document them proudly.
Example: Documenting “linear regression failed on time series” teaches more about temporal structure than just showing “SARIMAX worked.”
60.3 Data Fusion Meaning?
Data fusion = Combining multiple data sources to generate insights impossible from single sources alone.
Example: Understanding Aquifer Recharge
- HTEM alone: “Sand layers exist at 15-25m depth”
- Groundwater alone: “Water levels rise in spring”
- Weather alone: “5 inches of rain in March-April”
- Stream gauge alone: “Discharge drops in summer but baseflow sustained”
FUSION reveals: “The aquifer recharges through sandy units during spring precipitation, stores water through summer, and sustains stream baseflow during drought.”
No single data type tells this complete story.
60.4 What are stratigraphic units?
Six units (A-F) representing different geological layers from deep to shallow:
- Unit A: Deep bedrock (180-194m depth, ~48.7 Ω·m resistivity)
- Unit B: Transition zone (108-168m depth)
- Unit C: Upper bedrock (124-166m depth)
- Unit D: Primary Aquifer - Mahomet (12-96m depth, ~128.3 Ω·m) - Most important for water
- Unit E: Clay-rich Quaternary (0-30m depth) - Confining layer above aquifer
- Unit F: Mixed surface materials (0-20m depth)
Unit D is the focus for water resource analysis - it’s the buried sand and gravel valley (Mahomet Aquifer) that provides most groundwater.
See: Data Dictionary for complete descriptions.
61 Contributing
61.1 How can I contribute?
Many ways to contribute:
- Document failures - Tried something that didn’t work? Tell us!
- Add analysis chapters - New methods, new questions, new insights
- Improve code - Bug fixes, new features, better algorithms
- Add to terminology translation - Terms you found confusing
- Contribute data - New regions, additional data types
- Improve documentation - Clearer explanations, more examples
- Create visualizations - Better ways to show results
See: CONTRIBUTING.md for complete guide.
61.2 Permission to Contribute?
No! This is an open project.
Process:
- Fork the repository
- Make your changes
- Submit a pull request
- We review and provide feedback
- Iterate if needed
- Merge!
For major changes: Open an issue first to discuss, avoid wasted effort.
61.3 Non-Code Contributions Possible?
Non-code contributions are valuable!
You can contribute:
- Documentation improvements
- Terminology translations
- Failed experiments you’ve encountered
- Questions that should be in this FAQ
- Suggestions for clearer explanations
- Domain expertise review (hydrogeology, stats, etc.)
See: CONTRIBUTING.md section “Types of Contributions”
61.4 Review Process Overview?
We review for:
- Interdisciplinary clarity - Accessible to multiple audiences?
- Scientific rigor - Assumptions documented? Limitations acknowledged?
- Reproducibility - Can someone else run this?
- Code quality - Follows project conventions?
Timeline:
- Minor fixes: 1-2 days
- Code contributions: 3-7 days
- New chapters: 1-2 weeks
We aim for initial feedback within 48 hours.
62 Conceptual Questions
62.1 Multiple Perspectives Why?
Different backgrounds need different explanations.
Example: “Autocorrelation”
- Statistician: Correlation between X_t and X_{t-k}
- Hydrogeologist: “Memory in the system” - aquifer responds slowly
- Computer Scientist: Sequential data points aren’t i.i.d.
All three are correct. Different framings help different people understand.
62.2 Versus Research Papers?
Research papers:
- Show successes (failures hidden)
- Methods often in supplements
- Code rarely available
- Written for experts in one discipline
- Static (published once)
This project:
- Documents failures as well as successes
- Methods explained in detail with multiple perspectives
- All code available and executable
- Written for multiple disciplines
- Living (continuously updated)
We’re building a different model for scientific communication.
62.3 Reproducibility Importance?
Reproducible research enables:
- Verification - Others can check your work
- Extension - Others can build on your work
- Learning - Others can understand your methods
- Trust - Results can be validated
How we ensure reproducibility:
- All code version controlled
- Dependencies locked (
requirements.txt) - Random seeds set
- Paths configurable (not hard-coded)
- Data access documented
- Methods explained step-by-step
63 Data and Methods
63.1 HTEM versus Wells?
HTEM (Helicopter Time-domain ElectroMagnetic):
- Continuous spatial coverage (every ~100m)
- Sees subsurface structure (geological layers)
- Fast data collection (helicopter survey)
- Indirect measurement (resistivity → material type)
- One-time snapshot (2008 survey)
Well measurements:
- Direct measurement (water level, depth)
- Time series (continuous monitoring)
- High precision (±0.01 ft)
- Sparse spatial coverage (356 wells across region)
- Expensive to collect (drilling costs)
Together: HTEM provides spatial structure, wells provide temporal dynamics. Fusion gives complete picture.
63.2 Machine Learning Methods?
Multiple approaches:
- Classification: Random Forest, XGBoost (material type from resistivity)
- Translation: “Is this sand, clay, or gravel?” (categories)
- Example: Given resistivity = 150 Ω·m, predict material type = “well-sorted sand”
- Regression: Linear, polynomial, ML regressors (property prediction)
- Translation: “What number will this be?” (continuous values)
- Example: Given resistivity and depth, predict hydraulic conductivity = 25 m/day
- Time Series: SARIMAX, Prophet (water level forecasting)
- Translation: “What happens next?” (future predictions)
- Example: Given past water levels and rainfall, predict levels 30 days ahead
- Spatial: Kriging, spatial regression (interpolation)
- Translation: “What’s between the measurements?” (filling gaps)
- Example: Given 356 well measurements, estimate water levels everywhere in between
- Clustering: DBSCAN (spatially-constrained)
- Translation: “What groups naturally exist?” (finding patterns)
- Example: Find groups of wells that behave similarly
Key principle: Methods must respect physical constraints. We don’t use black-box ML without domain validation.
63.3 Linear Regression Issues?
Violation of independence assumption.
Problem:
# Water levels today correlated with water levels yesterday
# Violates "rows are independent" assumption
# Results in:
# - Wrong standard errors (too optimistic)
# - Misleading p-values (claim significance when there isn't)
# - Poor predictions (overconfident, doesn't capture dynamics)Solution: Use time series methods (SARIMAX, VAR) that account for autocorrelation.
Key insight: Aquifers have memory. Water doesn’t instantly respond to rain—it persists over days, weeks, months. Time series methods capture this memory; linear regression doesn’t.
See: Part 3 - Temporal Dynamics for proper time series analysis.
63.4 Handling Uncertainty?
Multiple approaches:
- Measurement uncertainty: Document precision of instruments
- Model uncertainty: Confidence intervals, prediction intervals
- Parameter uncertainty: Bootstrap, Bayesian credible intervals
- Scenario uncertainty: Sensitivity analysis, Monte Carlo
Key principle: Always quantify and report uncertainty. Point estimates without uncertainty are misleading.
63.5 Kriging versus ML?
Kriging (geostatistics):
- Provides uncertainty estimates (kriging variance)
- Optimal under Gaussian assumptions
- Accounts for spatial correlation explicitly
- Assumes stationarity
- Computationally expensive for large datasets
Machine Learning (Random Forest, XGBoost):
- No stationarity assumption
- Handles non-linear relationships
- Fast for large datasets
- Incorporates multiple covariates easily
- Uncertainty estimation more complex
Best practice: Try both, validate against hold-out data, compare results.
63.6 Physical Validation Methods?
Example: Mass balance check
# Recharge - Discharge - Storage Change = 0 (mass conservation)
recharge = precipitation * recharge_coefficient
discharge = pumping + baseflow
storage_change = water_level_change * storativity * area
residual = recharge - discharge - storage_change
# Residual should be small (near zero)
if abs(residual / recharge) > 0.1: # >10% error
print("WARNING: Mass balance violated - check assumptions")Other checks:
- Flow should be from high to low hydraulic head (no uphill flow)
- Transmissivity should be positive
- Porosity should be between 0 and 1
- Predictions should be within physical range
64 Troubleshooting
64.1 Q: My forecast accuracy suddenly dropped. What should I check?
A: Work through this checklist:
- Data quality: Did a sensor fail or get recalibrated? Check for gaps or spikes.
- New wells added: If monitoring network changed, model may need retraining.
- Extreme weather: Unusual events (drought, flood) outside training data.
- Seasonal shift: Performance often drops in seasons with less training data.
- Code changes: Did someone modify preprocessing or feature engineering?
Quick fix: Retrain on most recent 2 years of data. If still poor, investigate data.
64.2 Q: The dashboard shows “Insufficient Data” for a chapter. How do I fix it?
A: This means the required databases aren’t available or accessible. Check:
- File exists: Is
data/aquifer.dbpresent? Isdata/warm.dbpresent? - Path correct: Check
config/data_config.yamlpoints to correct locations. - Table exists: Connect to SQLite and verify tables exist (
sqlite3 data/aquifer.db ".tables") - Date overlap: Weather and groundwater data must cover same time period.
See Data Dictionary for required table schemas.
64.3 Q: Which tool should I use for my problem?
A: Use this decision matrix:
| Your Goal | Use This Tool | Chapter |
|---|---|---|
| Avoid drilling dry holes | Material Classification ML | Part 5 Ch.1 |
| Predict drought 7-14 days ahead | Water Level Forecasting | Part 5 Ch.2 |
| Detect sensor failures | Anomaly Detection | Part 5 Ch.3 |
| Find best drilling locations | Well Placement Optimizer | Part 5 Ch.4 |
| Design recharge systems | MAR Site Selection | Part 5 Ch.5 |
| Explain predictions to stakeholders | Explainable AI | Part 5 Ch.6 |
64.4 Q: The optimizer recommends expensive sites. How do I adjust?
A: The optimizer uses weighted objectives. To prioritize cost:
- Increase cost weight: Change from 0.25 to 0.40 in objective function
- Add cost constraint: “Maximum drilling cost < $50K”
- View Pareto frontier: Look at lower-cost alternatives with slightly lower yield
See Well Placement Optimizer for parameter tuning.
64.5 Code Troubleshooting Steps?
Common issues:
- Data not found: Check
data/directory exists with all 4 sources - Import errors: Run
pip install -r requirements.txt - Wrong Python version: Requires Python 3.11+
- Path issues: Use
get_data_path()from config, not hard-coded paths - TIMESTAMP parsing: Use explicit format
%m/%d/%Y
If still stuck: Open an issue with error message and minimal reproducible example.
64.6 Quarto Rendering Issues?
Check:
- Quarto installed?
quarto --version - In correct directory?
cd aquifer-book - Python environment active?
which python - All packages installed?
pip install -r requirements.txt
Common errors:
- “File not found”: Check paths in chapters are relative to
aquifer-book/ - “Module not found”: Python environment not activated
- “Code execution failed”: Set
execute: enabled: truein chapter front matter if code shouldn’t run
64.7 Autocorrelation Warnings Mean?
Your residuals are correlated (bad for standard regression).
What it means:
- Model hasn’t captured all temporal structure
- Standard errors are wrong
- P-values misleading
Solutions:
- Use time series methods (SARIMAX instead of regression)
- Add lag terms to capture temporal structure
- Use Newey-West standard errors (robust to autocorrelation)
- Check for omitted variables
See: Part 3 - Temporal Dynamics for time series analysis.
64.8 Unrealistic Predictions Why?
Common causes:
- Extrapolation beyond training range - Model doesn’t know what to do
- Ignoring spatial autocorrelation - Random splits leak information
- Missing physical constraints - Model violates physics
- Wrong coordinate system - Spatial relationships broken
Solutions:
- Use spatial cross-validation (block CV)
- Add physical constraints to model
- Verify coordinate transformations
- Check for extrapolation (plot training data extent)
65 Project Philosophy
65.1 Why emphasize interdisciplinary communication?
Real science is interdisciplinary, but communication gaps cause problems:
- Computer scientists build models that violate physical laws
- Hydrogeologists miss insights available from modern ML
- Statisticians apply methods without understanding domain constraints
- Results get lost in translation between disciplines
Solution: Build communication bridges. Translate jargon. Show multiple perspectives. Document assumptions.
Goal: Enable true collaboration, not just “throw results over the wall.”
65.2 Failed Experiments Value?
Reasons:
- Save time: Others don’t repeat your mistakes
- Teach assumptions: Show when methods fail reveals what they assume
- Honest science: Real research includes failures, not just successes
- Build intuition: Understanding failure deepens understanding of success
Example: Documenting “linear regression failed on time series” teaches more about time series structure than just showing “SARIMAX worked.”
65.3 Living Document Meaning?
Traditional documents: Written once, published, frozen.
Living documents: Continuously updated as we learn.
This project:
- New chapters added as new analyses done
- Failed experiments updated as new failures discovered
- Terminology translation grows as new terms encountered
- Methods updated as better approaches found
- Community contributions integrated
Version controlled so you can see how it evolves.
66 Still Have Questions?
66.1 Ask Questions Where?
Options:
- GitHub Discussions - General questions, ideas, conversations
- GitHub Issues - Bugs, feature requests, specific problems
- This FAQ - Submit PR to add questions others might have
66.2 Suggest FAQ Entry?
Two ways:
- Open an issue with label
documentationand title “FAQ: [your question]” - Submit a PR adding your question+answer to this file
We’ll review and integrate if it’s generally useful.
66.3 Who maintains this project?
Current maintainers: See CONTRIBUTING.md for contribution guidelines and maintainer information
Contributions welcome from anyone! This is designed as a community project.
See: CONTRIBUTING.md
Didn’t find your question? Ask in Discussions or open an issue!
Last Updated: November 26, 2025 Version: 2.0 (Consolidated for Playbook) Maintainers: Community-driven (open to contributors)