A repository of examples of using different statistical and machine learning algorithms (mostly in R) in hydropedology
I'll largely be focused on using R.
A combination of USGS stream discharge, landscape, and climate data.
https://help.waterdata.usgs.gov/ https://owi.usgs.gov/R/dataRetrieval.html
https://www.sciencebase.gov/catalog/item/59692a64e4b0d1f9f05fbd39
http://www.prism.oregonstate.edu/
Keep it simple, so focus on:
- Precipitation
- Mean temperature
- Dew point temperature? Use this to get at relative humidity?
- How can I automate the download of these data?
- Could I use these data to optimize a phase curve via logistic regression?
Focus: ISRIC soils information (https://www.isric.org/) Data availability: ISRIC Soil Data Hub (https://data.isric.org)
- Is there a relationship between soil attributes and climate (Koppen-Geiger)?
- We know this from the five soil-forming factors, but can we quantify the relationship?
- Can I tell which continent a soil came from?
- What are the most important attributes defining a soil (relative to the data I have)? (PCA or NMDS question.)
- Do different soil attributes influence one another? (SEM question)
- Are mean annual temperature data from PRISM and actual station data different from each other?
- Is there geographic bias in the errors or significant differences?
- Pair-wise t-tests or other comparisons (Mann-Whitney)
- Download the data from CompBio
- Frequentist vs Bayesian methods
- Can we predict the phase of snow using air temperature and other environmental data?
- This is a classification problem that could be addressed with logistic regression and SVM.
- Are there significant trends in annual discharge over time?
- Linear regression
- Map out the slope of significant trends across the US.
- Use leaflet and clickable links to see individual annual hydrographs marked with a colored trend line and highlighting abnormal years using the emperical density function.
- Include both Frequentist and Bayesian forms of the analysis.
- Is there a relationship between annual discharge, temperature, snow, elevation, etc?
- Multiple linear regression
- What role do different landscape features have on the above relationships?
- Could use the GAGES-II data set for this
- Hierarchical multiple linear regression
- Frequentist and Bayesian
- Are their "natural" groups of discharge sensitivity (represented by the steepness of the slope)?
- Discriminant analysis
How should I organize these algorithms? By Data type output? (This will help me figure out how to organize the site.)
-
Data types
- Categorical
- Nominal (Categories with no obvious relationship)
- Ordinal (Categories in which order does matter)
- Numerical
- Interval (Integer data that maintain the same distance from each other -- -5, 0, 5, 10)
- Ratio
- Categorical
-
Further attributes to consider
- Data output type
- Data input type
- Parameter type
- Single
- Multiple
- Mixed (categorical and numerical)
- Linear regression
- Frequentist
- Bayesian
- Support Vector Machine
- Generalized Linear Model (GLM)
- Logistic regression
- Frequentist
- Bayesian
- Other GLMs
- Frequentist
- Bayesian
- Logistic regression
- Generalized Additive Model (GAM)
- Dimensional reduction
- PCA
- NMDS
- Classification
- Supervised
- Random forest
party
- Naive Bayes
- Is this the same thing as discriminant analysis (which uses Bayes' Theorem)
- Random forest
- Unsupervised
- K-means clustering
- kNN (K-nearest neighbors)
- Supervised
- Gradient boosting
- Collaborative filtering
- ARIMA
- Neural Nets
- A/B testing
- t-tests
- Mann-Whitney
- Hierarchical modeling
- Focus: van Genuchten model
- Frequentist
lme4
- Bayesian
Stan
- Deeply nested
- Structural equation modeling
- Frequentists
- Bayesian
Other topics that don't fit neatly into the space above.
- Leave-one-out cross validation
- k-folds cross validation