diff --git a/01-introduction.Rmd b/01-introduction.Rmd index c8f6719ef..3aac7f296 100644 --- a/01-introduction.Rmd +++ b/01-introduction.Rmd @@ -116,7 +116,7 @@ With a wide range of packages, R also supports advanced geospatial statistics\in \index{R!language} New integrated development environments (IDEs\index{IDE}) such as RStudio\index{RStudio} have made R more user-friendly for many, easing map making with a panel dedicated to interactive visualization. -At its core, R is an object-oriented, [functional programming language](http://adv-r.had.co.nz/Functional-programming.html) [@wickham_advanced_2019], and was specifically designed as an interactive interface to other software [@chambers_extending_2016]. +At its core, R is an object-oriented, [functional programming language](https://adv-r.hadley.nz/fp.html) [@wickham_advanced_2019], and was specifically designed as an interactive interface to other software [@chambers_extending_2016]. The latter also includes many 'bridges' to a treasure trove of GIS\index{GIS} software, 'geolibraries' and functions (see Chapter \@ref(gis)). It is thus ideal for quickly creating 'geo-tools', without needing to master lower level languages (compared to R) such as C\index{C}, FORTRAN\index{FORTRAN} or Java\index{Java} (see Section \@ref(software-for-geocomputation)). \index{R} diff --git a/12-spatial-cv.Rmd b/12-spatial-cv.Rmd index f55f8b210..e06e365da 100644 --- a/12-spatial-cv.Rmd +++ b/12-spatial-cv.Rmd @@ -6,21 +6,21 @@ This chapter assumes proficiency with geographic data analysis\index{geographic A familiarity with generalized linear models (GLM)\index{GLM} and machine learning\index{machine learning} is highly recommended [for example from @zuur_mixed_2009;@james_introduction_2013]. The chapter uses the following packages:^[ -Package **kernlab**, **pROC**, **RSAGA**\index{RSAGA (package)} and **spDataLarge** must also be installed although these do not need to be attached. +Packages **GGally**, **lgr**, **kernlab**, **ml3measures**, **paradox**, **pROC**, **progressr** and **spDataLarge** must also be installed although these do not need to be attached. ] -```{r 11-spatial-cv-1, message=FALSE } -library(sf) -library(raster) +```{r 12-spatial-cv-1, message=FALSE} library(dplyr) +library(mlr3) +library(mlr3learners) +library(mlr3extralearners) +library(mlr3spatiotempcv) +library(mlr3tuning) +library(mlr3viz) +library(sf) +library(terra) ``` -```{r 11-spatial-cv-1-2, eval=FALSE} -library(mlr) -library(parallelMap) -``` - - Required data will be attached in due course. ## Introduction {#intro-cv1} @@ -69,83 +69,64 @@ Spatial CV\index{cross-validation!spatial CV} alleviates this problem and is the ## Case study: Landslide susceptibility {#case-landslide} This case study is based on a dataset of landslide locations in Southern Ecuador, illustrated in Figure \@ref(fig:lsl-map) and described in detail in @muenchow_geomorphic_2012. -A subset of the dataset used in that paper is provided in the **RSAGA**\index{RSAGA (package)} package, which can be loaded as follows: +A subset of the dataset used in that paper is provided in the **spDataLarge**\index{spDataLarge (package)} package, which can be loaded as follows: -```{r 11-spatial-cv-2, eval=FALSE} -data("landslides", package = "RSAGA") +```{r 12-spatial-cv-2} +data("lsl", "study_mask", package = "spDataLarge") +ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) ``` -This should load three objects: a `data.frame` named `landslides`, a `list` named `dem`, and an `sf` object named `study_area`. -`landslides` contains a factor column `lslpts` where `TRUE` corresponds to an observed landslide 'initiation point', with the coordinates stored in columns `x` and `y`.^[ +This should load three objects: a `data.frame` named `lsl`, an `sf` object named `study_mask` and a `SpatRaster` (see Section \@ref(raster-classes)) named `ta` containing terrain attribute rasters. +`lsl` contains a factor column `lslpts` where `TRUE` corresponds to an observed landslide 'initiation point', with the coordinates stored in columns `x` and `y`.^[ The landslide initiation point is located in the scarp of a landslide polygon. See @muenchow_geomorphic_2012 for further details. ] - -There are 175 landslide points and 1360 non-landslide, as shown by `summary(landslides)`. -The 1360 non-landslide points were sampled randomly from the study area, with the restriction that they must fall outside a small buffer around the landslide polygons. - -To make the number of landslide and non-landslide points balanced, let us sample 175 from the 1360 non-landslide points.^[The `landslides` dataset has been used in classes and summer schools. -To show how predictive performance of different algorithms changes with an unbalanced and highly spatially autocorrelated response variable, 1360 non-landslide points were randomly selected, i.e., many more absences than presences. -However, especially a logistic regression\index{regression!logistic} with a log-link, as used in this chapter, expects roughly the same number of presences and absences in the response.] - -```{r 11-spatial-cv-3, eval=FALSE} -# select non-landslide points -non_pts = filter(landslides, lslpts == FALSE) -# select landslide points -lsl_pts = filter(landslides, lslpts == TRUE) -# randomly select 175 non-landslide points -set.seed(11042018) -non_pts_sub = sample_n(non_pts, size = nrow(lsl_pts)) -# create smaller landslide dataset (lsl) -lsl = bind_rows(non_pts_sub, lsl_pts) -``` - -`dem` is a digital elevation model\index{digital elevation model} consisting of two elements: -`dem$header`, a `list` which represents a raster 'header'\index{raster!header} (see Section \@ref(raster-data)), and `dem$data`, a matrix with the altitude of each pixel. -`dem` can be converted into a `raster` object with: - - - -```{r 11-spatial-cv-4, eval=FALSE} -dem = raster( - dem$data, - crs = dem$header$proj4string, - xmn = dem$header$xllcorner, - xmx = dem$header$xllcorner + dem$header$ncols * dem$header$cellsize, - ymn = dem$header$yllcorner, - ymx = dem$header$yllcorner + dem$header$nrows * dem$header$cellsize - ) -``` - +There are 175 landslide and 175 non-landslide points, as shown by `summary(lsl$lslpts)`. +The 175 non-landslide points were sampled randomly from the study area, with the restriction that they must fall outside a small buffer around the landslide polygons. ```{r lsl-map, echo=FALSE, fig.cap="Landslide initiation points (red) and points unaffected by landsliding (blue) in Southern Ecuador.", fig.scap="Landslide initiation points."} # library(tmap) # data("lsl", package = "spDataLarge") -# data("ta", package = "spDataLarge") -# lsl_sf = st_as_sf(lsl, coords = c("x", "y"), crs = 32717) -# hs = hillShade(ta$slope * pi / 180, terrain(ta$elev, opt = "aspect")) -# rect = tmaptools::bb_poly(hs) -# bbx = tmaptools::bb(hs, xlim = c(-0.02, 1), ylim = c(-0.02, 1), relative = TRUE) +# ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) +# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717) +# hs = terra::shade(slope = ta$slope * pi / 180, +# terra::terrain(ta$elev, v = "aspect", unit = "radians")) +# # so far tmaptools does not support terra objects # -# tm_shape(hs, bbox = bbx) + -# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, -# labels.rot = c(0, 90)) + -# tm_raster(palette = gray(0:100 / 100), n = 100, legend.show = FALSE) + -# tm_shape(ta$elev) + -# tm_raster(alpha = 0.5, palette = terrain.colors(10), legend.show = FALSE) + -# tm_shape(lsl_sf) + -# tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu", title.col = "Landslide: ") + -# # tm_shape(sam) + -# # tm_bubbles(border.col = "gold", border.lwd = 2, alpha = 0, size = 0.5) + +# rect = tmaptools::bb_poly(raster::raster(hs)) +# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1), +# ylim = c(-0.0001, 1), relative = TRUE) +# +# map = tm_shape(hs, bbox = bbx) + +# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, +# labels.rot = c(0, 90)) + +# tm_raster(palette = gray(0:100 / 100), n = 100, legend.show = FALSE) + +# tm_shape(ta$elev) + +# tm_raster(alpha = 0.5, palette = terrain.colors(10), legend.show = FALSE) + +# tm_shape(lsl_sf) + +# tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu", +# title.col = "Landslide: ") + # qtm(rect, fill = NULL) + -# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE) + +# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE) + # tm_legend(bg.color = "white") +# tmap::tmap_save(map, filename = "figures/lsl-map-1.png", width = 11, +# height = 11, units = "cm") knitr::include_graphics("figures/lsl-map-1.png") ``` \index{hillshade} +The first three rows of `lsl`, rounded to two significant digits, can be found in Table \@ref(tab:lslsummary). + +```{r lslsummary, echo=FALSE, warning=FALSE} +lsl_table = lsl |> + mutate(across(.cols = -any_of(c("x", "y", "lslpts")), ~signif(., 2))) |> + head(3) +knitr::kable(lsl_table, caption = "Structure of the lsl dataset.", + caption.short = "`lsl` dataset.", booktabs = TRUE) |> + kableExtra::kable_styling(latex_options = "scale_down") +``` + To model landslide susceptibility, we need some predictors. -Terrain attributes are frequently associated with landsliding [@muenchow_geomorphic_2012], and these can be computed from the digital elevation model (`dem`) using R-GIS bridges (see Chapter \@ref(gis)). -We leave it as an exercise to the reader to compute the following terrain attribute rasters and extract the corresponding values to our landslide/non-landslide data frame (see exercises; we also provide the resulting data frame via the **spDataLarge** package, see further below): +Since terrain attributes are frequently associated with landsliding [@muenchow_geomorphic_2012], we have already extracted following terrain attributes from `ta` to `lsl`: - `slope`: slope angle (°). - `cplan`: plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow. @@ -153,37 +134,26 @@ We leave it as an exercise to the reader to compute the following terrain attrib - `elev`: elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area. - `log10_carea`: the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location. -Data containing the landslide points, with the corresponding terrain attributes, is provided in the **spDataLarge** package, along with the terrain attribute raster stack from which the values were extracted. +Data containing the landslide points, with the corresponding terrain attributes, is provided in the **spDataLarge** package, along with the multilayer terrain attribute raster from which the values were extracted. Hence, if you have not computed the predictors yourself, attach the corresponding data before running the code of the remaining chapter: -```{r 11-spatial-cv-5} +```{r 12-spatial-cv-5} # attach landslide points with terrain attributes data("lsl", package = "spDataLarge") # attach terrain attribute raster stack -ta = stack(system.file("raster/ta.tif", package = "spDataLarge")) -``` - -The first three rows of `lsl`, rounded to two significant digits, can be found in Table \@ref(tab:lslsummary). - -```{r lslsummary, echo=FALSE, warning=FALSE} -lsl_table = lsl %>% - mutate_at(vars(-one_of("x", "y", "lslpts")), list(~signif(., 2))) %>% - head(3) -knitr::kable(lsl_table, caption = "Structure of the lsl dataset.", - caption.short = "`lsl` dataset.", booktabs = TRUE) %>% - kableExtra::kable_styling(latex_options="scale_down") +ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) ``` ## Conventional modeling approach in R {#conventional-model} -Before introducing the **mlr**\index{mlr (package)} package, an umbrella-package providing a unified interface to dozens of learning algorithms (Section \@ref(spatial-cv-with-mlr)), it is worth taking a look at the conventional modeling interface in R\index{R}. -This introduction to supervised statistical learning\index{statistical learning} provides the basis for doing spatial CV\index{cross-validation!spatial CV}, and contributes to a better grasp on the **mlr**\index{mlr (package)} approach presented subsequently. +Before introducing the **mlr3**\index{mlr3 (package)} package, an umbrella-package providing a unified interface to dozens of learning algorithms (Section \@ref(spatial-cv-with-mlr3)), it is worth taking a look at the conventional modeling interface in R\index{R}. +This introduction to supervised statistical learning\index{statistical learning} provides the basis for doing spatial CV\index{cross-validation!spatial CV}, and contributes to a better grasp on the **mlr3**\index{mlr3 (package)} approach presented subsequently. Supervised learning involves predicting a response variable as a function of predictors (Section \@ref(intro-cv)). In R\index{R}, modeling functions are usually specified using formulas (see `?formula` and the detailed [Formulas in R Tutorial](https://www.datacamp.com/community/tutorials/r-formula-tutorial) for details of R formulas). The following command specifies and runs a generalized linear model\index{GLM}: -```{r 11-spatial-cv-6} +```{r 12-spatial-cv-6} fit = glm(lslpts ~ slope + cplan + cprof + elev + log10_carea, family = binomial(), data = lsl) @@ -197,7 +167,7 @@ It is worth understanding each of the three input arguments: The results of this model can be printed as follows (`summary(fit)` provides a more detailed account of the results): -```{r 11-spatial-cv-7} +```{r 12-spatial-cv-7} class(fit) fit ``` @@ -207,34 +177,42 @@ It can also be used for prediction. This is done with the generic `predict()` method, which in this case calls the function `predict.glm()`. Setting `type` to `response` returns the predicted probabilities (of landslide occurrence) for each observation in `lsl`, as illustrated below (see `?predict.glm`): -```{r 11-spatial-cv-8} +```{r 12-spatial-cv-8} pred_glm = predict(object = fit, type = "response") head(pred_glm) ``` Spatial predictions can be made by applying the coefficients to the predictor rasters. -This can be done manually or with `raster::predict()`. -In addition to a model object (`fit`), this function also expects a raster stack with the predictors named as in the model's input data frame (Figure \@ref(fig:lsl-susc)). +This can be done manually or with `terra::predict()`. +In addition to a model object (`fit`), this function also expects a `SpatRaster` with the predictors named as in the model's input data frame (Figure \@ref(fig:lsl-susc)). -```{r 11-spatial-cv-9} +```{r 12-spatial-cv-9, eval=FALSE} # making the prediction -pred = raster::predict(ta, model = fit, type = "response") +pred = terra::predict(ta, model = fit, type = "response") ``` ```{r lsl-susc, echo=FALSE, fig.cap="Spatial prediction of landslide susceptibility using a GLM.", fig.scap = "Spatial prediction of landslide susceptibility.", warning=FALSE} -# attach study mask for the natural part of the study area -# data("study_mask", package = "spDataLarge") +# # attach study mask for the natural part of the study area +# data("lsl", "study_mask", package = "spDataLarge") +# ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) +# study_mask = terra::vect(study_mask) # # white raster to only plot the axis ticks, otherwise gridlines would be visible -# tm_shape(hs, bbox = bbx) + +# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717) +# hs = terra::shade(ta$slope * pi / 180, +# terra::terrain(ta$elev, v = "aspect", unit = "radians")) +# rect = tmaptools::bb_poly(raster::raster(hs)) +# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.00001, 1), +# ylim = c(-0.00001, 1), relative = TRUE) +# map = tm_shape(hs, bbox = bbx) + # tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, # labels.rot = c(0, 90)) + # tm_raster(palette = "white", legend.show = FALSE) + # # hillshade -# tm_shape(mask(hs, study_mask), bbox = bbx) + +# tm_shape(terra::mask(hs, study_mask), bbox = bbx) + # tm_raster(palette = gray(0:100 / 100), n = 100, # legend.show = FALSE) + # # prediction raster -# tm_shape(mask(pred, study_area)) + +# tm_shape(terra::mask(pred, study_mask)) + # tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE, # title = "Susceptibility") + # # rectangle and outer margins @@ -242,11 +220,13 @@ pred = raster::predict(ta, model = fit, type = "response") # tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE, # legend.position = c("left", "bottom"), # legend.title.size = 0.9) +# tmap::tmap_save(map, filename = "figures/lsl-susc-1.png", width = 11, +# height = 11, units = "cm") knitr::include_graphics("figures/lsl-susc-1.png") ``` Here, when making predictions we neglect spatial autocorrelation\index{autocorrelation!spatial} since we assume that on average the predictive accuracy remains the same with or without spatial autocorrelation structures. -However, it is possible to include spatial autocorrelation\index{autocorrelation!spatial} structures into models [@zuur_mixed_2009;@blangiardo_spatial_2015;@zuur_beginners_2017] as well as into predictions [kriging approaches, see, e.g., @goovaerts_geostatistics_1997;@hengl_practical_2007;@bivand_applied_2013]. +However, it is possible to include spatial autocorrelation\index{autocorrelation!spatial} structures into models [@zuur_mixed_2009;@blangiardo_spatial_2015;@zuur_beginners_2017] as well as into predictions [kriging approaches, see, e.g., @goovaerts_geostatistics_1997;@hengl_practical_2007;@bivand_applied_2013]. This is, however, beyond the scope of this book. -0.83 -represents a good fit. +0.82 represents a good fit. However, this is an overoptimistic estimation since we have computed it on the complete dataset. To derive a biased-reduced assessment, we have to use cross-validation\index{cross-validation} and in the case of spatial data should make use of spatial CV\index{cross-validation!spatial CV}. @@ -315,72 +294,90 @@ This is because there are still a few points in the test and training data which knitr::include_graphics("figures/13_partitioning.png") ``` -## Spatial CV with **mlr** -\index{mlr (package)} +## Spatial CV with **mlr3** +\index{mlr3 (package)} There are dozens of packages for statistical learning\index{statistical learning}, as described for example in the [CRAN machine learning task view](https://CRAN.R-project.org/view=MachineLearning). Getting acquainted with each of these packages, including how to undertake cross-validation and hyperparameter\index{hyperparameter} tuning, can be a time-consuming process. Comparing model results from different packages can be even more laborious. -The **mlr** package was developed to address these issues. -It acts as a 'meta-package', providing a unified interface to popular supervised and unsupervised statistical learning techniques including classification, regression\index{regression}, survival analysis and clustering\index{clustering} [@bischl_mlr:_2016]. - - - -The standardized **mlr** interface is based on eight 'building blocks'. +The **mlr3** package and ecosystem was developed to address these issues. +It acts as a 'meta-package', providing a unified interface to popular supervised and unsupervised statistical learning techniques including classification, regression\index{regression}, survival analysis and clustering\index{clustering} [@lang_mlr3_2019]. +The standardized **mlr3** interface is based on eight 'building blocks'. As illustrated in Figure \@ref(fig:building-blocks), these have a clear order. -```{r building-blocks, echo=FALSE, fig.height=4, fig.width=4, fig.cap="Basic building blocks of the mlr package. Source: http://bit.ly/2tcb2b7. (Permission to reuse this figure was kindly granted.)", fig.scap="Basic building blocks of the mlr package."} +```{r building-blocks, echo=FALSE, fig.height=4, fig.width=4, fig.cap="Basic building blocks of the mlr3 package. Source: @becker_mlr3_2021 (Permission to reuse this figure was kindly granted.)", fig.scap="Basic building blocks of the mlr3 package."} knitr::include_graphics("figures/13_ml_abstraction_crop.png") ``` -The **mlr** modeling process consists of three main stages. +The **mlr3** modeling process consists of three main stages. First, a **task** specifies the data (including response and predictor variables) and the model type (such as regression\index{regression} or classification\index{classification}). Second, a **learner** defines the specific learning algorithm that is applied to the created task. Third, the **resampling** approach assesses the predictive performance of the model, i.e., its ability to generalize to new data (see also Section \@ref(intro-cv)). ### Generalized linear model {#glm} -To implement a GLM\index{GLM} in **mlr**\index{mlr (package)}, we must create a **task** containing the landslide data. -Since the response is binary (two-category variable), we create a classification\index{classification} task with `makeClassifTask()` (for regression\index{regression} tasks, use `makeRegrTask()`, see `?makeRegrTask` for other task types). -The first essential argument of these `make*()` functions is `data`. -The `target` argument expects the name of a response variable and `positive` determines which of the two factor levels of the response variable indicate the landslide initiation point (in our case this is `TRUE`). -All other variables of the `lsl` dataset will serve as predictors except for the coordinates (see the result of `getTaskFormula(task)` for the model formula). -For spatial CV, the `coordinates` parameter is used (see Section \@ref(intro-cv) and Figure \@ref(fig:partitioning)) which expects the coordinates as a xy data frame. - -```{r 11-spatial-cv-11, eval=FALSE} -library(mlr) -# coordinates needed for the spatial partitioning -coords = lsl[, c("x", "y")] -# select response and predictors to use in the modeling -data = dplyr::select(lsl, -x, -y) +To implement a GLM\index{GLM} in **mlr3**\index{mlr3 (package)}, we must create a **task** containing the landslide data. +Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `mlr3spatiotempcv::TaskClassifST$new()` (for non-spatial tasks, use `TaskClassif$new()` or `TaskRegr$new()` for regression\index{regression} tasks, see `?Task` for other task types).^[The **mlr3** ecosystem makes heavily use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/index.html. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the Advanced R book [@wickham_advanced_2019].] +The first essential argument of these `Task*$new()` functions is `backend`. +`backend` expects the data to be used for the modeling including the response and predictor variables. +The `target` argument indicates the name of a response variable (in our case this is `lslpts`) and `positive` determines which of the two factor levels of the response variable indicate the landslide initiation point (in our case this is `TRUE`). +All other variables of the `lsl` dataset will serve as predictors. +For spatial CV, we need to provide a few extra arguments (`extra_args`). +The `coordinate_names` argument expects the names of the coordinate columns (see Section \@ref(intro-cv) and Figure \@ref(fig:partitioning)). +Additionally, one should indicate the used CRS (`crs`) and if one wishes to use the coordinates as predictors in the modeling (`coords_as_features`). + +```{r 12-spatial-cv-11, eval=FALSE} # create task -task = makeClassifTask(data = data, target = "lslpts", - positive = "TRUE", coordinates = coords) +task = mlr3spatiotempcv::TaskClassifST$new( + id = "ecuador_lsl", + backend = mlr3::as_data_backend(lsl), + target = "lslpts", + positive = "TRUE", + extra_args = list( + coordinate_names = c("x", "y"), + coords_as_features = FALSE, + crs = 32717) + ) ``` -`makeLearner()` determines the statistical learning\index{statistical learning} method to use. -All classification\index{classification} **learners** start with `classif.` and all regression\index{regression} learners with `regr.` (see `?makeLearners` for details). -`listLearners()` helps to find out about all available learners and from which package **mlr** imports them (Table \@ref(tab:lrns)). -For a specific task, we can run: +Please note that `TaskClassifST$new()` also accepts an `sf`-object as input for the `backend` parameter. +In this case, you might only want to specify the `coords_as_features` argument of the `extra_args` list. +We did not convert `lsl` into an `sf`-object because `TaskClassifST$new()` just converts it back into a non-spatial `data.table` object in the background. +For a short data exploration, the `autoplot()` function of the **mlr3viz** package might come in handy since it plots the response against all predictors and all predictors against all predictors (not shown). -```{r 11-spatial-cv-12, eval=FALSE} -listLearners(task, warn.missing.packages = FALSE) %>% - dplyr::select(class, name, short.name, package) %>% +```{r autoplot, eval=FALSE} +library(mlr3viz) +# plot response against each predictor +mlr3viz::autoplot(task, type = "duo") +# plot all variables against each other +mlr3viz::autoplot(task, type = "pairs") +``` + +Having created a task, we need to choose a **learner** that determines the statistical learning\index{statistical learning} method to use. +All classification\index{classification} **learners** start with `classif.` and all regression\index{regression} learners with `regr.` (see `?Learner` for details). +`mlr3extralearners::list_mlr3learners()` lists all available learners and from which package **mlr3** imports them (Table \@ref(tab:lrns)). +To find out about learners that are able to model a binary response variable, we can run: + +```{r 12-spatial-cv-12, eval=FALSE} +library(mlr3extralearners) +mlr3extralearners::list_mlr3learners( + filter = list(class = "classif", properties = "twoclass"), + select = c("id", "mlr3_package", "required_packages")) |> head() ``` ```{r lrns, echo=FALSE} -# lrns_df = -# listLearners(task, warn.missing.packages = FALSE) %>% -# dplyr::select(Class = class, Name = name, `Short name` = short.name, Package = package) %>% -# head() +lrns_df = mlr3extralearners::list_mlr3learners( + filter = list(class = "classif", properties = "twoclass"), + select = c("id", "mlr3_package", "required_packages")) |> + head() # dput(lrns_df) -lrns_df = structure(list(Class = c("classif.adaboostm1", "classif.binomial", -"classif.featureless", "classif.fnn", "classif.gausspr", "classif.IBk" -), Name = c("ada Boosting M1", "Binomial Regression", "Featureless classifier", -"Fast k-Nearest Neighbour", "Gaussian Processes", "k-Nearest Neighbours" -), `Short name` = c("adaboostm1", "binomial", "featureless", -"fnn", "gausspr", "ibk"), Package = c("RWeka", "stats", "mlr", -"FNN", "kernlab", "RWeka")), row.names = c(NA, 6L), class = "data.frame") +# lrns_df = structure(list(Class = c("classif.adaboostm1", "classif.binomial", +# "classif.featureless", "classif.fnn", "classif.gausspr", "classif.IBk" +# ), Name = c("ada Boosting M1", "Binomial Regression", "Featureless classifier", +# "Fast k-Nearest Neighbour", "Gaussian Processes", "k-Nearest Neighbours" +# ), `Short name` = c("adaboostm1", "binomial", "featureless", +# "fnn", "gausspr", "ibk"), Package = c("RWeka", "stats", "mlr", +# "FNN", "kernlab", "RWeka")), row.names = c(NA, 6L), class = "data.frame") knitr::kable(lrns_df, caption = paste("Sample of available learners for binomial", "tasks in the mlr package."), @@ -388,111 +385,107 @@ knitr::kable(lrns_df, ``` This yields all learners able to model two-class problems (landslide yes or no). -We opt for the binomial classification\index{classification} method used in Section \@ref(conventional-model) and implemented as `classif.binomial` in **mlr**. -Additionally, we must specify the link-function, `logit` in this case, which is also the default of the `binomial()` function. -`predict.type` determines the type of the prediction with `prob` resulting in the predicted probability for landslide occurrence between 0 and 1 (this corresponds to `type = response` in `predict.glm`). +We opt for the binomial classification\index{classification} method used in Section \@ref(conventional-model) and implemented as `classif.log_reg` in **mlr3learners**. +Additionally, we need to specify the `predict.type` which determines the type of the prediction with `prob` resulting in the predicted probability for landslide occurrence between 0 and 1 (this corresponds to `type = response` in `predict.glm`). -```{r 11-spatial-cv-13, eval=FALSE} -lrn = makeLearner(cl = "classif.binomial", - link = "logit", - predict.type = "prob", - fix.factors.prediction = TRUE) +```{r 12-spatial-cv-13, eval=FALSE} +library(mlr3learners) +learner = mlr3::lrn("classif.log_reg", predict_type = "prob") ``` -To find out from which package the specified learner is taken and how to access the corresponding help pages, we can run: +To access the help page of the learner and find out from which package it was taken, we can run: -```{r 11-spatial-cv-14, eval=FALSE} -getLearnerPackages(lrn) -helpLearner(lrn) +```{r 12-spatial-cv-14, eval=FALSE} +learner$help() ``` -The set-up steps for modeling with **mlr**\index{mlr (package)} may seem tedious. -But remember, this single interface provides access to the 150+ learners shown by `listLearners()`; it would be far more tedious to learn the interface for each learner! +The set-up steps for modeling with **mlr3**\index{mlr3 (package)} may seem tedious. +But remember, this single interface provides access to the 130+ learners shown by `mlr3extralearners::list_mlr3learners()`; it would be far more tedious to learn the interface for each learner! Further advantages are simple parallelization of resampling techniques and the ability to tune machine learning hyperparameters\index{hyperparameter} (see Section \@ref(svm)). -Most importantly, (spatial) resampling in **mlr** is straightforward, requiring only two more steps: specifying a resampling method and running it. +Most importantly, (spatial) resampling in **mlr3spatiotempcv** is straightforward, requiring only two more steps: specifying a resampling method and running it. We will use a 100-repeated 5-fold spatial CV\index{cross-validation!spatial CV}: five partitions will be chosen based on the provided coordinates in our `task` and the partitioning will be repeated 100 times:[^13] [^13]: Note that package **sperrorest** initially implemented spatial cross-validation in R [@brenning_spatial_2012]. - In the meantime, its functionality was integrated into the **mlr** package which is the reason why we are using **mlr** [@schratz_performance_nodate].The **caret** package is another umbrella-package [@kuhn_applied_2013] for streamlined modeling in R; however, so far it does not provide spatial CV which is why we refrain from using it for spatial data. + In the meantime, its functionality was integrated into the **mlr3** ecosystem which is the reason why we are using **mlr3** [@schratz_hyperparameter_2019]. The **tidymodels** framework is another umbrella-package for streamlined modeling in R; however, it only recently integrated support for spatial cross validation via **spatialsample** which so far only supports one spatial resampling method. -```{r 11-spatial-cv-18, eval=FALSE} -perf_level = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 100) +```{r 12-spatial-cv-18, eval=FALSE} +perf_level = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100) ``` -To execute the spatial resampling, we run `resample()` using the specified learner, task, resampling strategy and of course the performance measure, here the AUROC\index{AUROC}. -This takes some time (around 10 seconds on a modern laptop) because it computes the AUROC for 500 models. +To execute the spatial resampling, we run `resample()` using the previously specified task, learner, and resampling strategy. +This takes some time (around 15 seconds on a modern laptop) because it computes the performance measure for 500 models. Setting a seed ensures the reproducibility\index{reproducibility} of the obtained result and will ensure the same spatial partitioning when re-running the code. - - - -```{r 11-spatial-cv-19, eval=FALSE} +```{r 12-spatial-cv-19, eval=FALSE} set.seed(012348) -sp_cv = mlr::resample(learner = lrn, task = task, - resampling = perf_level, - measures = mlr::auc) +# reduce verbosity +lgr::get_logger("mlr3")$set_threshold("warn") +# run spatial cross-validation +sp_cv = mlr3::resample(task = task, + learner = learner, + resampling = perf_level) ``` -```{r 11-spatial-cv-20, eval=FALSE, echo=FALSE} -set.seed(012348) -sp_cv = mlr::resample(learner = lrn, task = task, - resampling = perf_level, - measures = mlr::auc) -``` - - +The output of the preceding code chunk is a bias-reduced assessment of the model's predictive performance, as illustrated in the following code chunk (required input data is saved in the file `12-sp_cv.rds` in the book's GitHub repo). +As performance measure, we again choose the AUROC. -The output of the preceding code chunk is a bias-reduced assessment of the model's predictive performance, as illustrated in the following code chunk (required input data is saved in the file `spatialcv.Rdata` in the book's GitHub repo): - -```{r 11-spatial-cv-21, echo=FALSE} -sp_cv = readRDS("extdata/sp_cv.rds") -conv_cv = readRDS("extdata/conv_cv.rds") +```{r 12-spatial-cv-21, echo=FALSE} +# bm_agg = readRDS("extdata/12-sp_conv_cv.rds") +# sp_cv = bm_agg$resample_result[[1]] +score = readRDS("extdata/12-glm_score_sp_nsp.rds") ``` -```{r 11-spatial-cv-22} +```{r 12-spatial-cv-22} +score[, .(mean_auc = mean(classif.auc)), by = resampling_id] # summary statistics of the 500 models -summary(sp_cv$measures.test$auc) +# sp_cv$score(measures = mlr3::msr("classif.auc"))[, summary(classif.auc)] # mean AUROC of the 500 models -mean(sp_cv$measures.test$auc) +# sp_cv$aggregate(measures = mlr3::msr("classif.auc")) ``` To put these results in perspective, let us compare them with AUROC\index{AUROC} values from a 100-repeated 5-fold non-spatial cross-validation (Figure \@ref(fig:boxplot-cv); the code for the non-spatial cross-validation\index{cross-validation} is not shown here but will be explored in the exercise section). As expected, the spatially cross-validated result yields lower AUROC values on average than the conventional cross-validation approach, underlining the over-optimistic predictive performance due to spatial autocorrelation\index{autocorrelation!spatial} of the latter. ```{r boxplot-cv, echo=FALSE, fig.width=6, fig.height=9, fig.cap="Boxplot showing the difference in AUROC values between spatial and conventional 100-repeated 5-fold cross-validation.", fig.scap="Boxplot showing AUROC values."} -# Visualization of non-spatial overfitting -boxplot(sp_cv$measures.test$auc, - conv_cv$measures.test$auc, - col = c("lightblue2", "mistyrose2"), - names = c("spatial CV", "conventional CV"), - ylab = "AUROC") +library(ggplot2) +# extract the AUROC values and put them into one data.table +# d = purrr::map_dfr(bm_agg$resample_result, ~ .$score(mlr3::msr("classif.auc"))) +d = score +# rename the levels of resampling_id +d[, resampling_id := as.factor(resampling_id) |> + forcats::fct_recode("conventional CV" = "repeated_cv", + "spatial CV" = "repeated_spcv_coords") |> + forcats::fct_rev()] +# create the boxplot +ggplot2::ggplot(data = d, + mapping = ggplot2::aes(x = resampling_id, y = classif.auc)) + + ggplot2::geom_boxplot(fill = c("lightblue2", "mistyrose2")) + + ggplot2::theme_bw() + + ggplot2::labs(y = "AUROC", x = "") ``` ### Spatial tuning of machine-learning hyperparameters {#svm} @@ -521,27 +514,21 @@ This is called hyperparameter tuning. Some SVM implementations such as that provided by **kernlab** allow hyperparameters to be tuned automatically, usually based on random sampling (see upper row of Figure \@ref(fig:partitioning)). This works for non-spatial data but is of less use for spatial data where 'spatial tuning' should be undertaken. -Before defining spatial tuning, we will set up the **mlr**\index{mlr (package)} building blocks, introduced in Section \@ref(glm), for the SVM. +Before defining spatial tuning, we will set up the **mlr3**\index{mlr3 (package)} building blocks, introduced in Section \@ref(glm), for the SVM. The classification\index{classification} task remains the same, hence we can simply reuse the `task` object created in Section \@ref(glm). Learners implementing SVM can be found using `listLearners()` as follows: -```{r 11-spatial-cv-23, eval=FALSE} -lrns = listLearners(task, warn.missing.packages = FALSE) -filter(lrns, grepl("svm", class)) %>% - dplyr::select(class, name, short.name, package) -#> class name short.name package -#> 6 classif.ksvm Support Vector Machines ksvm kernlab -#> 9 classif.lssvm Least Squares Support Vector Machine lssvm kernlab -#> 17 classif.svm Support Vector Machines (libsvm) svm e1071 +```{r 12-spatial-cv-23} +mlr3extralearners::list_mlr3learners() %>% + .[class == "classif" & grepl("svm", id), + .(id, class, mlr3_package, required_packages)] ``` Of the options illustrated above, we will use `ksvm()` from the **kernlab** package [@karatzoglou_kernlab_2004]. To allow for non-linear relationships, we use the popular radial basis function (or Gaussian) kernel which is also the default of `ksvm()`. -```{r 11-spatial-cv-24, eval=FALSE} -lrn_ksvm = makeLearner("classif.ksvm", - predict.type = "prob", - kernel = "rbfdot") +```{r 12-spatial-cv-24, eval=FALSE} +lrn_ksvm = mlr3::lrn("classif.ksvm", predict_type = "prob", kernel = "rbfdot") ``` The next stage is to specify a resampling strategy. @@ -550,9 +537,9 @@ Again we will use a 100-repeated 5-fold spatial CV\index{cross-validation!spatia -```{r 11-spatial-cv-25, eval=FALSE} +```{r 12-spatial-cv-25, eval=FALSE} # performance estimation level -perf_level = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 100) +perf_level = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100) ``` Note that this is the exact same code as used for the GLM\index{GLM} in Section \@ref(glm); we have simply repeated it here as a reminder. @@ -562,14 +549,14 @@ The next step is new, however: to tune the hyperparameters\index{hyperparameter} Using the same data for the performance assessment and the tuning would potentially lead to overoptimistic results [@cawley_overfitting_2010]. This can be avoided using nested spatial CV\index{cross-validation!spatial CV}. -```{r inner-outer, echo=FALSE, fig.cap="Schematic of hyperparameter tuning and performance estimation levels in CV. (Figure was taken from Schratz et al. (2018). Permission to reuse it was kindly granted.)", fig.scap="Schematic of hyperparameter tuning."} +```{r inner-outer, echo=FALSE, fig.cap="Schematic of hyperparameter tuning and performance estimation levels in CV. (Figure was taken from Schratz et al. (2019). Permission to reuse it was kindly granted.)", fig.scap="Schematic of hyperparameter tuning."} knitr::include_graphics("figures/13_cv.png") ``` This means that we split each fold again into five spatially disjoint subfolds which are used to determine the optimal hyperparameters\index{hyperparameter} (`tune_level` object in the code chunk below; see Figure \@ref(fig:inner-outer) for a visual representation). -To find the optimal hyperparameter combination, we fit 50 models (`ctrl` object in the code chunk below) in each of these subfolds with randomly selected values for the hyperparameters C and Sigma. +To find the optimal hyperparameter combination, we fit 50 models (`terminator` parameter in the code chunk below) in each of these subfolds with randomly selected values for the hyperparameters C and Sigma. The random selection of values C and Sigma is additionally restricted to a predefined tuning space (`ps` object). -The range of the tuning space was chosen with values recommended in the literature [@schratz_performance_nodate]. +The range of the tuning space was chosen with values recommended in the literature [@schratz_hyperparameter_2019]. -```{r 11-spatial-cv-26, eval=FALSE} +```{r 12-spatial-cv-26, eval=FALSE} +library("mlr3tuning") # five spatially disjoint partitions -tune_level = makeResampleDesc("SpCV", iters = 5) +tune_level = mlr3::rsmp("spcv_coords", folds = 5) # use 50 randomly selected hyperparameters -ctrl = makeTuneControlRandom(maxit = 50) +terminator = mlr3tuning::trm("evals", n_evals = 50) +tuner = mlr3tuning::tnr("random_search") # define the outer limits of the randomly selected hyperparameters -ps = makeParamSet( - makeNumericParam("C", lower = -12, upper = 15, trafo = function(x) 2^x), - makeNumericParam("sigma", lower = -15, upper = 6, trafo = function(x) 2^x) - ) +ps = paradox::ps( + C = paradox::p_dbl(lower = -12, upper = 15, trafo = function(x) 2^x), + sigma = paradox::p_dbl(lower = -15, upper = 6, trafo = function(x) 2^x) +) ``` -The next stage is to modify the learner `lrn_ksvm` in accordance with all the characteristics defining the hyperparameter tuning with `makeTuneWrapper()`. +The next stage is to modify the learner `lrn_ksvm` in accordance with all the characteristics defining the hyperparameter tuning with `AutoTuner$new()`. -```{r 11-spatial-cv-27, eval=FALSE} -wrapped_lrn_ksvm = makeTuneWrapper(learner = lrn_ksvm, - resampling = tune_level, - par.set = ps, - control = ctrl, - show.info = TRUE, - measures = mlr::auc) +```{r 12-spatial-cv-27, eval=FALSE} +at_ksvm = mlr3tuning::AutoTuner$new( + learner = lrn_ksvm, + resampling = tune_level, + measure = mlr3::msr("classif.auc"), + search_space = ps, + terminator = terminator, + tuner = tuner +) ``` -The **mlr** is now set-up to fit 250 models to determine optimal hyperparameters for one fold. +The tuning is now set-up to fit 250 models to determine optimal hyperparameters for one fold. Repeating this for each fold, we end up with 1250 (250 \* 5) models for each repetition. Repeated 100 times means fitting a total of 125,000 models to identify optimal hyperparameters (Figure \@ref(fig:partitioning)). These are used in the performance estimation, which requires the fitting of another 500 models (5 folds \* 100 repetitions; see Figure \@ref(fig:partitioning)). @@ -624,100 +615,76 @@ Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of th 1. Repeat steps 2 to 4, 100 times. The process of hyperparameter tuning and performance estimation is computationally intensive. -Model runtime can be reduced with parallelization, which can be done in a number of ways, depending on the operating system. - - -Before starting the parallelization\index{parallelization}, we ensure that the processing continues even if one of the models throws an error by setting `on.learner.error` to `warn`. -This avoids the process stopping just because of one failed model, which is desirable on large model runs. -To inspect the failed models once the processing is completed, we dump them: +To decrease model runtime, **mlr3** offers the possibility to use parallelization\index{parallelization} with the help of the **future** package. +Since we are about to run a nested cross-validation, we can decide if we would like to parallelize the inner or the outer loop (see lower left part of Figure \@ref(fig:inner-outer)). +Since the former will run 125,000 models, whereas the latter only runs 500, it is quite obvious that we should parallelize the inner loop. +To set up the parallelization of the inner loop, we run: -```{r 11-spatial-cv-28, eval=FALSE} -configureMlr(on.learner.error = "warn", on.error.dump = TRUE) +```{r future, eval=FALSE} +library(future) +# execute the outer loop sequentially and parallelize the inner loop +future::plan(list("sequential", "multisession"), + workers = floor(availableCores() / 2)) ``` -To start the parallelization\index{parallelization}, we set the `mode` to `multicore` which will use `mclapply()` in the background on a single machine in the case of a Unix-based operating system.^[ -See `?parallelStart` for further modes and github.com/berndbischl/parallelMap for more on the unified interface to popular parallelization back-ends. -] -Equivalenty, `parallelStartSocket()` enables parallelization under Windows. -`level` defines the level at which to enable parallelization, with `mlr.tuneParams` determining that the hyperparameter tuning level should be parallelized (see lower left part of Figure \@ref(fig:inner-outer), `?parallelGetRegisteredLevels`, and the **mlr** [parallelization tutorial](https://mlr-org.github.io/mlr-tutorial/release/html/parallelization/index.html#parallelization-levels) for details). -We will use half of the available cores (set with the `cpus` parameter), a setting that allows possible other users to work on the same high performance computing cluster in case one is used (which was the case when we ran the code). - -Setting `mc.set.seed` to `TRUE` ensures that the randomly chosen hyperparameters\index{hyperparameter} during the tuning can be reproduced when running the code again. -Unfortunately, `mc.set.seed` is only available under Unix-based systems. - -```{r 11-spatial-cv-29, eval=FALSE} -library(parallelMap) -if (Sys.info()["sysname"] %in% c("Linux", "Darwin")) { -parallelStart(mode = "multicore", - # parallelize the hyperparameter tuning level - level = "mlr.tuneParams", - # just use half of the available cores - cpus = round(parallel::detectCores() / 2), - mc.set.seed = TRUE) -} - -if (Sys.info()["sysname"] == "Windows") { - parallelStartSocket(level = "mlr.tuneParams", - cpus = round(parallel::detectCores() / 2)) -} -``` +Additionally, we instructed **future** to only use half instead of all available cores (default), a setting that allows possible other users to work on the same high performance computing cluster in case one is used. Now we are set up for computing the nested spatial CV. Using a seed allows us to recreate the exact same spatial partitions when re-running the code. -Specifying the `resample()` parameters follows the exact same procedure as presented when using a GLM\index{GLM}, the only difference being the `extract` argument. -This allows the extraction of the hyperparameter\index{hyperparameter} tuning results which is important if we plan follow-up analyses on the tuning. -After the processing, it is good practice to explicitly stop the parallelization\index{parallelization} with `parallelStop()`. -Finally, we save the output object (`result`) to disk in case we would like to use it another R session. -Before running the subsequent code, be aware that it is time-consuming: -the 125,500 models took ~1/2hr on a server using 24 cores (see below). - -```{r 11-spatial-cv-30, eval=FALSE} +Specifying the `resample()` parameters follows the exact same procedure as presented when using a GLM\index{GLM}, the only difference being the `store_models` and `encapsulate` arguments. +The former allows the extraction of the hyperparameter\index{hyperparameter} tuning results which is important if we plan follow-up analyses on the tuning. +The latter ensures that the processing continues even if one of the models throws an error. +This avoids the process stopping just because of one failed model, which is desirable on large model runs. +Once the processing is completed, one can have a look at the failed models. +After the processing, it is good practice to explicitly stop the parallelization\index{parallelization} with `future:::ClusterRegistry("stop")`. +Finally, we save the output object (`result`) to disk in case we would like to use it in another R session. +Before running the subsequent code, be aware that it is time-consuming since it will run the spatial cross-validation with 125,500 models. +Note that runtime depends on many aspects: CPU speed, the selected algorithm, the selected number of cores and the dataset. + +```{r 12-spatial-cv-30, eval=FALSE} set.seed(12345) -result = mlr::resample(learner = wrapped_lrn_ksvm, - task = task, - resampling = perf_level, - extract = getTuneResult, - measures = mlr::auc) +progressr::with_progress(expr = { + result = mlr3::resample(task = task, + learner = at_ksvm, + # outer resampling (performance level) + resampling = perf_level, + store_models = TRUE, + encapsulate = "evaluate") +}) + # stop parallelization -parallelStop() +future:::ClusterRegistry("stop") # save your result, e.g.: # saveRDS(result, "svm_sp_sp_rbf_50it.rds") ``` -In case you do not want to run the code locally, we have saved a subset of the [results](https://github.com/Robinlovelace/geocompr/blob/main/extdata/spatial_cv_result.rds) in the book's GitHub repo. +In case you do not want to run the code locally, we have saved a subset of the [results](https://github.com/Robinlovelace/geocompr/blob/main/extdata/12-spatial_cv_result.rds) in the book's GitHub repo. They can be loaded as follows: -```{r 11-spatial-cv-31} -result = readRDS("extdata/spatial_cv_result.rds") +```{r 12-spatial-cv-31} +result = readRDS("extdata/12-spatial_cv_result.rds") ``` -Note that runtime depends on many aspects: CPU speed, the selected algorithm, the selected number of cores and the dataset. - -```{r 11-spatial-cv-32} -# Exploring the results -# runtime in minutes -round(result$runtime / 60, 2) -``` +Let us have a look at the final AUROC\index{AUROC}: the model's ability to discriminate the two classes. - - -Even more important than the runtime is the final aggregated AUROC\index{AUROC}: the model's ability to discriminate the two classes. - -```{r 11-spatial-cv-33} +```{r 12-spatial-cv-33} # final aggregated AUROC result$aggr # same as mean(result$measures.test$auc) ``` -It appears that the GLM\index{GLM} (aggregated AUROC\index{AUROC} was `r sp_cv$aggr`) is slightly better than the SVM\index{SVM} in this specific case. -However, using more than 50 iterations in the random search would probably yield hyperparameters\index{hyperparameter} that result in models with a better AUROC [@schratz_performance_nodate]. +It appears that the GLM\index{GLM} (aggregated AUROC\index{AUROC} was `r score[resampling_id == "repeated_spcv_coords", round(mean(classif.auc), 2)]`) is slightly better than the SVM\index{SVM} in this specific case. +To guarantee an absolute fair comparison, one should also make sure that the two models use the exact same partitions something we haven't done here. +To do so, **mlr3** offers the functions `benchmark()` and `benchmark_grid()` (see also https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking; @becker_mlr3_2021). +We will explore these functions in more detail in the Exercises. +Please note also that using more than 50 iterations in the random search of the SVM would probably yield hyperparameters\index{hyperparameter} that result in models with a better AUROC [@schratz_hyperparameter_2019]. On the other hand, increasing the number of random search iterations would also increase the total number of models and thus runtime. The estimated optimal hyperparameters for each fold at the performance estimation level can also be viewed. The following command shows the best hyperparameter combination of the first fold of the first iteration (recall this results from the first 5 \* 50 model runs): -```{r 11-spatial-cv-34} +```{r 12-spatial-cv-34} # winning hyperparameters of tuning step, # i.e. the best combination out of 50 * 5 models result$extract[[1]]$x @@ -725,7 +692,7 @@ result$extract[[1]]$x The estimated hyperparameters have been used for the first fold in the first iteration of the performance estimation level which resulted in the following AUROC\index{AUROC} value: -```{r 11-spatial-cv-35} +```{r 12-spatial-cv-35} result$measures.test[1, ] ``` @@ -742,37 +709,21 @@ This chapter used cross-validation\index{cross-validation} to assess predictive As described in Section \@ref(intro-cv), observations with spatial coordinates may not be statistically independent due to spatial autocorrelation\index{autocorrelation!spatial}, violating a fundamental assumption of cross-validation. Spatial CV\index{cross-validation!spatial CV} addresses this issue by reducing bias introduced by spatial autocorrelation\index{autocorrelation!spatial}. -The **mlr**\index{mlr (package)} package facilitates (spatial) resampling\index{resampling} techniques in combination with the most popular statistical learning\index{statistical learning} techniques including linear regression\index{regression!linear}, semi-parametric models such as generalized additive models\index{generalized additive model} and machine learning\index{machine learning} techniques such as random forests\index{random forest}, SVMs\index{SVM}, and boosted regression trees [@bischl_mlr:_2016;@schratz_performance_nodate]. +The **mlr3**\index{mlr3 (package)} package facilitates (spatial) resampling\index{resampling} techniques in combination with the most popular statistical learning\index{statistical learning} techniques including linear regression\index{regression!linear}, semi-parametric models such as generalized additive models\index{generalized additive model} and machine learning\index{machine learning} techniques such as random forests\index{random forest}, SVMs\index{SVM}, and boosted regression trees [@bischl_mlr:_2016;@schratz_hyperparameter_2019]. Machine learning algorithms often require hyperparameter\index{hyperparameter} inputs, the optimal 'tuning' of which can require thousands of model runs which require large computational resources, consuming much time, RAM and/or cores. -**mlr** tackles this issue by enabling parallelization\index{parallelization}. +**mlr3** tackles this issue by enabling parallelization\index{parallelization}. Machine learning overall, and its use to understand spatial data, is a large field and this chapter has provided the basics, but there is more to learn. We recommend the following resources in this direction: -- The **mlr** tutorials on [Machine Learning in R](https://mlr-org.github.io/mlr-tutorial/release/html/) and [Handling of spatial Data](https://mlr-org.github.io/mlr-tutorial/release/html/handling_of_spatial_data/index.html) -- An academic paper on hyperparameter\index{hyperparameter} tuning [@schratz_performance_nodate] +- The **mlr3 book** [@becker_mlr3_2021; https://mlr-org.github.io/mlr-tutorial/release/html/)] and especially the [chapter on the handling of spatio-temporal data](https://mlr3book.mlr-org.com/spatiotemporal.html). +- An academic paper on hyperparameter\index{hyperparameter} tuning [@schratz_hyperparameter_2019] +- An academic paper on how to use **mlr3spatiotempcv** [@schratz_mlr3spatiotempcv_2021] - In case of spatio-temporal data, one should account for spatial\index{autocorrelation!spatial} and temporal\index{autocorrelation!temporal} autocorrelation when doing CV\index{cross-validation} [@meyer_improving_2018] ## Exercises -1. Compute the following terrain attributes from the `dem` datasets loaded with `data("landslides", package = "RSAGA")` with the help of R-GIS bridges (see Chapter \@ref(gis)): - - Slope - - Plan curvature - - Profile curvature - - Catchment area -1. Extract the values from the corresponding output rasters to the `landslides` data frame (`data(landslides, package = "RSAGA"`) by adding new variables called `slope`, `cplan`, `cprof`, `elev` and `log_carea`. Keep all landslide initiation points and 175 randomly selected non-landslide points (see Section \@ref(case-landslide) for details). -1. Use the derived terrain attribute rasters in combination with a GLM to make a spatial prediction map similar to that shown in Figure \@ref(fig:lsl-susc). -Running `data("study_mask", package = "spDataLarge")` attaches a mask of the study area. -1. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling\index{resampling} strategies with the help of boxplots (see Figure \@ref(fig:boxplot-cv)). -Hint: You need to specify a non-spatial task and a non-spatial resampling\index{resampling} strategy. - - -1. Model landslide susceptibility using a quadratic discriminant analysis [QDA, @james_introduction_2013]. -Assess the predictive performance (AUROC) of the QDA. -What is the difference between the spatially cross-validated mean AUROC value of the QDA and the GLM? - -Hint: Before running the spatial cross-validation for both learners, set a seed to make sure that both use the same spatial partitions which in turn guarantees comparability. -1. Run the SVM without tuning the hyperparameters. -Use the `rbfdot` kernel with $\sigma$ = 1 and *C* = 1. -Leaving the hyperparameters unspecified in **kernlab**'s `ksvm()` would otherwise initialize an automatic non-spatial hyperparameter tuning. -For a discussion on the need for (spatial) tuning of hyperparameters, please refer to @schratz_performance_nodate. +```{r, echo=FALSE, results='asis'} +res = knitr::knit_child('_12-ex.Rmd', quiet = TRUE, options = list(include = FALSE, eval = FALSE)) +cat(res, sep = '\n') +``` diff --git a/15-eco.Rmd b/15-eco.Rmd index 09ca4066d..817d65c86 100644 --- a/15-eco.Rmd +++ b/15-eco.Rmd @@ -7,12 +7,19 @@ In it you will also make use of R's\index{R} interfaces to dedicated GIS\index{G The chapter uses the following packages: -```{r 14-eco-1, message=FALSE, eval=FALSE} -library(sf) -library(raster) -# library(RQGIS) -library(mlr) +```{r 15-eco-1, message=FALSE, eval=FALSE} +library(data.table) library(dplyr) +library(mlr3) +library(mlr3spatiotempcv) +library(mlr3tuning) +library(mlr3learners) +library(qgisprocess) +library(paradox) +library(ranger) +library(sf) +library(terra) +library(tree) library(vegan) ``` @@ -21,7 +28,7 @@ library(vegan) In this chapter we will model the floristic gradient of fog oases to reveal distinctive vegetation belts that are clearly controlled by water availability. To do so, we will bring together concepts presented in previous chapters and even extend them (Chapters \@ref(spatial-class) to \@ref(geometric-operations) and Chapters \@ref(gis) and \@ref(spatial-cv)). -Fog oases are one of the most fascinating vegetation formations we have ever encountered. +Fog oases are one of the most fascinating vegetation formations we have ever encountered. These formations, locally termed *lomas*, develop on mountains along the coastal deserts of Peru and Chile.^[Similar vegetation formations develop also in other parts of the world, e.g., in Namibia and along the coasts of Yemen and Oman [@galletti_land_2016].] The deserts' extreme conditions and remoteness provide the habitat for a unique ecosystem, including species endemic to the fog oases. Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austal winter. @@ -40,7 +47,7 @@ In this chapter we will demonstrate ecological applications of some of the techn This case study will involve analyzing the composition and the spatial distribution of the vascular plants on the southern slope of Mt. Mongón, a *lomas* mountain near Casma on the central northern coast of Peru (Figure \@ref(fig:study-area-mongon)). ```{r study-area-mongon, echo=FALSE, fig.cap="The Mt. Mongón study area, from Muenchow, Schratz, and Brenning (2017).", out.width="60%", fig.scap="The Mt. Mongón study area."} -knitr::include_graphics("figures/study-area-mongon.png") +knitr::include_graphics("figures/15_study_area_mongon.png") # knitr::include_graphics("https://user-images.githubusercontent.com/1825120/38989956-6eae7c9a-43d0-11e8-8f25-3dd3594f7e74.png") ``` @@ -63,15 +70,19 @@ To guarantee an optimal prediction, it is advisable to tune beforehand the hyper All the data needed for the subsequent analyses is available via the **spDataLarge** package. -```{r 14-eco-2, eval=FALSE} -data("study_area", "random_points", "comm", "dem", "ndvi", package = "spDataLarge") +```{r 15-eco-2, eval=FALSE} +# spatial vector objects +data("study_area", "random_points", "comm", package = "spDataLarge") +# spatial raster objects +dem = rast(system.file("raster/dem.tif", package = "spDataLarge")) +ndvi = rast(system.file("raster/ndvi.tif", package = "spDataLarge")) ``` `study_area` is an `sf`\index{sf} polygon representing the outlines of the study area. `random_points` is an `sf` object, and contains the 100 randomly chosen sites. `comm` is a community matrix of the wide data format [@wickham_tidy_2014] where the rows represent the visited sites in the field and the columns the observed species.^[In statistics, this is also called a contingency table or cross-table.] -```{r 14-eco-3, eval=FALSE} +```{r 15-eco-3, eval=FALSE} # sites 35 to 40 and corresponding occurrences of the first five species in the # community matrix comm[35:40, 1:5] @@ -86,10 +97,10 @@ comm[35:40, 1:5] The values represent species cover per site, and were recorded as the area covered by a species in proportion to the site area in percentage points (%; please note that one site can have >100% due to overlapping cover between individual plants). The rownames of `comm` correspond to the `id` column of `random_points`. -`dem` is the digital elevation model\index{digital elevation model} (DEM) for the study area, and `ndvi` is the Normalized Difference Vegetation Index (NDVI) computed from the red and near-infrared channels of a Landsat scene (see Section \@ref(local-operations) and `?ndvi`). +`dem` is the digital elevation model\index{digital elevation model} (DEM) for the study area, and `ndvi` is the Normalized Difference Vegetation Index (NDVI) computed from the red and near-infrared channels of a Landsat scene (see Section \@ref(local-operations) and `?ndvi.tif`). Visualizing the data helps to get more familiar with it, as shown in Figure \@ref(fig:sa-mongon) where the `dem` is overplotted by the `random_points` and the `study_area`. -```{r 14-eco-4, eval=FALSE, echo=FALSE} +```{r 15-eco-4, eval=FALSE, echo=FALSE} # create hillshade hs = hillShade(terrain(dem), terrain(dem, "aspect")) # plot the data @@ -108,7 +119,10 @@ plot(st_geometry(study_area), add = TRUE) ```{r sa-mongon, echo=FALSE, message=FALSE, fig.cap="Study mask (polygon), location of the sampling sites (black points) and DEM in the background.", fig.scap="Study mask, location of the sampling sites."} # library("latticeExtra") # library("grid") -# hs = hillShade(terrain(dem), terrain(dem, "aspect")) +# library("sp") +# hs = terra::shade(terra::terrain(dem, v = "slope", unit = "radians"), +# terra::terrain(dem, v = "aspect", unit = "radians"), +# 40, 270) # spplot(dem, col.regions = terrain.colors(50), alpha.regions = 0.5, # scales = list(draw = TRUE, # tck = c(1, 0)), @@ -126,7 +140,7 @@ plot(st_geometry(study_area), add = TRUE) # under = TRUE) # grid.text("m asl", x = unit(0.8, "npc"), y = unit(0.75, "npc"), # gp = gpar(cex = 0.8)) -knitr::include_graphics("figures/sa-mongon-1.png") +knitr::include_graphics("figures/15_sa_mongon_sampling.png") ``` The next step is to compute variables which we will not only need for the modeling and predictive mapping (see Section \@ref(predictive-mapping)) but also for aligning the Non-metric multidimensional scaling (NMDS)\index{NMDS} axes with the main gradient in the study area, altitude and humidity, respectively (see Section \@ref(nmds)). @@ -135,69 +149,97 @@ Specifically, we will compute catchment slope and catchment area\index{catchment Curvatures might also represent valuable predictors, in the Exercise section you can find out how they would change the modeling result. To compute catchment area\index{catchment area} and catchment slope, we will make use of the `saga:sagawetnessindex` function.^[Admittedly, it is a bit unsatisfying that the only way of knowing that `sagawetnessindex` computes the desired terrain attributes is to be familiar with SAGA\index{SAGA}.] -`get_usage()` returns all function\index{function} parameters and default values of a specific geoalgorithm\index{geoalgorithm}. +`qgis_show_help()` returns all function\index{function} parameters and default values of a specific geoalgorithm\index{geoalgorithm}. Here, we present only a selection of the complete output. -```{r 14-eco-5, eval=FALSE} -get_usage("saga:sagawetnessindex") -#>ALGORITHM: Saga wetness index -#> DEM -#> ... -#> SLOPE_TYPE -#> ... -#> AREA -#> SLOPE -#> AREA_MOD -#> TWI +```{r 15-eco-5, eval=FALSE} +qgisprocess::qgis_show_help("saga:sagawetnessindex") +#> Saga wetness index (saga:sagawetnessindex) +#> ... +#> ---------------- +#> Arguments +#> ---------------- +#> +#> DEM: Elevation +#> Argument type: raster +#> Acceptable values: +#> - Path to a raster layer #> ... -#>SLOPE_TYPE(Type of Slope) -#> 0 - [0] local slope -#> 1 - [1] catchment slope +#> SLOPE_TYPE: Type of Slope +#> Argument type: enum +#> Available values: +#> - 0: [0] local slope +#> - 1: [1] catchment slope +#> ... +#> AREA: Catchment area +#> Argument type: rasterDestination +#> Acceptable values: +#> - Path for new raster layer +#>... +#> ---------------- +#> Outputs +#> ---------------- +#> +#> AREA: +#> Catchment area +#> SLOPE: +#> Catchment slope #> ... ``` Subsequently, we can specify the needed parameters using R named arguments (see Section \@ref(rqgis)). -Remember that we can use a `RasterLayer` living in R's\index{R} global environment to specify the input raster `DEM` (see Section \@ref(rqgis)). +Remember that we can use a `SpatRaster` living in R's\index{R} global environment to specify the input raster `DEM` (see Section \@ref(rqgis)). Specifying 1 as the `SLOPE_TYPE` makes sure that the algorithm will return the catchment slope. The resulting output rasters\index{raster} should be saved to temporary files with an `.sdat` extension which is a SAGA\index{SAGA} raster format. -Setting `load_output` to `TRUE` ensures that the resulting rasters will be imported into R. -```{r 14-eco-6, eval=FALSE} +```{r 15-eco-6, eval=FALSE} # environmental predictors: catchment slope and catchment area -ep = run_qgis(alg = "saga:sagawetnessindex", - DEM = dem, - SLOPE_TYPE = 1, - SLOPE = tempfile(fileext = ".sdat"), - AREA = tempfile(fileext = ".sdat"), - load_output = TRUE, - show_output_paths = FALSE) -``` - -This returns a list named `ep` consisting of two elements: `AREA` and `SLOPE`. -Let us add two more raster objects to the list, namely `dem` and `ndvi`, and convert it into a raster stack\index{raster!stack} (see Section \@ref(raster-classes)). - -```{r 14-eco-7, eval=FALSE} -ep = stack(c(dem, ndvi, ep)) -names(ep) = c("dem", "ndvi", "carea", "cslope") +ep = qgisprocess::qgis_run_algorithm( + alg = "saga:sagawetnessindex", + DEM = dem, + SLOPE_TYPE = 1, + SLOPE = tempfile(fileext = ".sdat"), + AREA = tempfile(fileext = ".sdat"), + .quiet = TRUE) +``` + +This returns a list named `ep` containing the paths to the computed output rasters. +Let us read in catchment area as well as catchment slope into a multilayer `SpatRaster` object (see Section \@ref(raster-classes)). +Additionally, we will add two more raster objects to it, namely `dem` and `ndvi`. + +```{r 15-eco-7, eval=FALSE} +# read in catchment area and catchment slope +ep = ep[c("AREA", "SLOPE")] |> + unlist() |> + terra::rast() +# assign proper names +names(ep) = c("carea", "cslope") +# make sure all rasters share the same origin +terra::origin(ep) = terra::origin(dem) +# add dem and ndvi to the multilayer SpatRaster object +ep = c(dem, ndvi, ep) ``` Additionally, the catchment area\index{catchment area} values are highly skewed to the right (`hist(ep$carea)`). A log10-transformation makes the distribution more normal. -```{r 14-eco-8, eval=FALSE} +```{r 15-eco-8, eval=FALSE} ep$carea = log10(ep$carea) ``` As a convenience to the reader, we have added `ep` to **spDataLarge**: -```{r 14-eco-9, eval=FALSE} -data("ep", package = "spDataLarge") +```{r 15-eco-9, eval=FALSE} +ep = terra::rast(system.file("raster/ep.tif", package = "spDataLarge")) ``` Finally, we can extract the terrain attributes to our field observations (see also Section \@ref(raster-extraction)). -```{r 14-eco-10, eval=FALSE} -random_points[, names(ep)] = raster::extract(ep, random_points) +```{r 15-eco-10, eval=FALSE} +random_points[, names(ep)] = + # terra::extract adds automatically a for our purposes unnecessary ID column + terra::extract(ep, terra::vect(random_points)) |> + dplyr::select(-ID) ``` ## Reducing dimensionality {#nmds} @@ -230,9 +272,9 @@ Often ordinations\index{ordination} using presence-absence data yield better res Ordination techniques such as NMDS\index{NMDS} require at least one observation per site. Hence, we need to dismiss all sites in which no species were found. -```{r 14-eco-11, eval=FALSE} +```{r 15-eco-11, eval=FALSE} # presence-absence matrix -pa = decostand(comm, "pa") # 100 rows (sites), 69 columns (species) +pa = vegan::decostand(comm, "pa") # 100 rows (sites), 69 columns (species) # keep only sites in which at least one species was found pa = pa[rowSums(pa) != 0, ] # 84 rows, 69 columns ``` @@ -242,9 +284,9 @@ The resulting output matrix serves as input for the NMDS\index{NMDS}. NMDS\index{NMDS} is an iterative procedure trying to make the ordinated space more similar to the input matrix in each step. To make sure that the algorithm converges, we set the number of steps to 500 (`try` parameter). -```{r 14-eco-12, eval=FALSE, message=FALSE} +```{r 15-eco-12, eval=FALSE, message=FALSE} set.seed(25072018) -nmds = metaMDS(comm = pa, k = 4, try = 500) +nmds = vegan::metaMDS(comm = pa, k = 4, try = 500) nmds$stress #> ... #> Run 498 stress 0.08834745 @@ -257,11 +299,11 @@ nmds$stress #> 0.08831395 ``` -```{r 14-eco-13, eval=FALSE, echo=FALSE} +```{r 15-eco-13, eval=FALSE, echo=FALSE} saveRDS(nmds, "extdata/15-nmds.rds") ``` -```{r 14-eco-14, include=FALSE, eval=FALSE} +```{r 15-eco-14, include=FALSE, eval=FALSE} nmds = readRDS("extdata/15-nmds.rds") ``` @@ -273,23 +315,23 @@ Since humidity is highly correlated with elevation, we rotate the NMDS axes\inde Plotting the result reveals that the first axis is, as intended, clearly associated with altitude (Figure \@ref(fig:xy-nmds)). ```{r xy-nmds-code, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%", eval=FALSE} -elev = dplyr::filter(random_points, id %in% rownames(pa)) %>% +elev = dplyr::filter(random_points, id %in% rownames(pa)) |> dplyr::pull(dem) # rotating NMDS in accordance with altitude (proxy for humidity) -rotnmds = MDSrotate(nmds, elev) +rotnmds = vegan::MDSrotate(nmds, elev) # extracting the first two axes -sc = scores(rotnmds, choices = 1:2) +sc = vegan::scores(rotnmds, choices = 1:2) # plotting the first axis against altitude plot(y = sc[, 1], x = elev, xlab = "elevation in m", ylab = "First NMDS axis", cex.lab = 0.8, cex.axis = 0.8) ``` ```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%"} -knitr::include_graphics("figures/xy-nmds-1.png") +knitr::include_graphics("figures/15_xy_nmds.png") ``` -```{r 14-eco-15, eval=FALSE, echo=FALSE} -# scores and rotated scores in one figure +```{r 15-eco-15, eval=FALSE, echo=FALSE} +# scores and rotated scores in one figure p1 = xyplot(scores(rotnmds)[, 2] ~ scores(rotnmds)[, 1], pch = 16, col = "lightblue", xlim = c(-3, 2), ylim = c(-2, 2), xlab = list("Dimension 1", cex = 0.8), @@ -326,8 +368,8 @@ plot(scores(nmds, choices = 1:2)) points(scores(rotnmds, choices = 1:2), col = "lightblue", pch = 16) -sc = scores(nmds, choices = 1:2) %>% as.data.frame -sc$id = rownames(sc) %>% as.numeric +sc = scores(nmds, choices = 1:2) |> as.data.frame() +sc$id = rownames(sc) |> as.numeric() rp = inner_join(select(sc, id), st_drop_geometry(random_points)) fit_1 = envfit(nmds, select(rp, dem)) fit_2 = envfit(rotnmds, select(rp, dem)) @@ -344,14 +386,14 @@ To spatially visualize them, we can model the NMDS\index{NMDS} scores with the p ## Modeling the floristic gradient To predict the floristic gradient spatially, we will make use of a random forest\index{random forest} model [@hengl_random_2018]. -Random forest\index{random forest} models are frequently used in environmental and ecological modeling, and often provide the best results in terms of predictive performance [@schratz_performance_nodate]. +Random forest\index{random forest} models are frequently used in environmental and ecological modeling, and often provide the best results in terms of predictive performance [@schratz_hyperparameter_2019]. Here, we shortly introduce decision trees and bagging, since they form the basis of random forests\index{random forest}. We refer the reader to @james_introduction_2013 for a more detailed description of random forests\index{random forest} and related techniques. To introduce decision trees by example, we first construct a response-predictor matrix by joining the rotated NMDS\index{NMDS} scores to the field observations (`random_points`). -We will also use the resulting data frame for the **mlr**\index{mlr (package)} modeling later on. +We will also use the resulting data frame for the **mlr3**\index{mlr3 (package)} modeling later on. -```{r 14-eco-16, eval=FALSE} +```{r 15-eco-16, eval=FALSE} # construct response-predictor matrix # id- and response variable rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1]) @@ -362,19 +404,19 @@ rp = inner_join(random_points, rp, by = "id") Decision trees split the predictor space into a number of regions. To illustrate this, we apply a decision tree to our data using the scores of the first NMDS\index{NMDS} axis as the response (`sc`) and altitude (`dem`) as the only predictor. -```{r 14-eco-17, eval=FALSE} +```{r 15-eco-17, eval=FALSE} library("tree") -tree_mo = tree(sc ~ dem, data = rp) +tree_mo = tree::tree(sc ~ dem, data = rp) plot(tree_mo) text(tree_mo, pretty = 0) ``` -```{r 14-eco-18, echo=FALSE, eval=FALSE} +```{r 15-eco-18, echo=FALSE, eval=FALSE} library("tree") -tree_mo = tree(sc ~ dem, data = rp) +tree_mo = tree::tree(sc ~ dem, data = rp) ``` -```{r 14-eco-19, eval=FALSE, echo=FALSE} +```{r 15-eco-19, eval=FALSE, echo=FALSE} png("figures/15_tree.png", width = 1100, height = 700, units = "px", res = 300) par(mar = rep(1, 4)) plot(tree_mo) @@ -410,7 +452,7 @@ for each split of the tree—in other words, that bagging should be done. @james_introduction_2013 --> -### **mlr** building blocks +### **mlr3** building blocks The code in this section largely follows the steps we have introduced in Section \@ref(svm). The only differences are the following: @@ -418,7 +460,7 @@ The only differences are the following: 1. The response variable is numeric, hence a regression\index{regression} task will replace the classification\index{classification} task of Section \@ref(svm). 1. Instead of the AUROC\index{AUROC} which can only be used for categorical response variables, we will use the root mean squared error (RMSE\index{RMSE}) as performance measure. 1. We use a random forest\index{random forest} model instead of a support vector machine\index{SVM} which naturally goes along with different hyperparameters\index{hyperparameter}. -1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see Exercises). +1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see Exercises). Instead we show how to tune hyperparameters\index{hyperparameter} for (spatial) predictions. Remember that 125,500 models were necessary to retrieve bias-reduced performance estimates when using 100-repeated 5-fold spatial cross-validation\index{cross-validation!spatial CV} and a random search of 50 iterations (see Section \@ref(svm)). @@ -432,42 +474,26 @@ This means, the inner hyperparameter\index{hyperparameter} tuning level is no lo Therefore, we tune the hyperparameters\index{hyperparameter} for a good spatial prediction on the complete dataset via a 5-fold spatial CV\index{cross-validation!spatial CV} with one repetition. -The preparation for the modeling using the **mlr**\index{mlr (package)} package includes the construction of a response-predictor matrix containing only variables which should be used in the modeling and the construction of a separate coordinate data frame. +Having already constructed the input variables (`rp`), we are all set for specifying the **mlr3**\index{mlr3 (package)} building blocks (task, learner, and resampling). +For specifying a spatial task, we use again the **mlr3spatiotempcv** [@schratz_mlr3spatiotempcv_2021] package (section \@ref(spatial-cv-with-mlr3)). +Since our response (`sc`) is numeric, we use a regression\index{regression} task. -```{r 14-eco-20, eval=FALSE} -# extract the coordinates into a separate data frame -coords = sf::st_coordinates(rp) %>% - as.data.frame() %>% - rename(x = X, y = Y) -# only keep response and predictors which should be used for the modeling -rp = dplyr::select(rp, -id, -spri) %>% - st_drop_geometry() +```{r 15-eco-20, eval=FALSE} +# create task +task = mlr3spatiotempcv::TaskRegrST$new( + id = "mongon", backend = dplyr::select(rp, -id, -spri), target = "sc") ``` -Having constructed the input variables, we are all set for specifying the **mlr**\index{mlr (package)} building blocks (task, learner, and resampling). -We will use a regression\index{regression} task since the response variable is numeric. -The learner is a random forest\index{random forest} model implementation from the **ranger** package. +Using an `sf` object as the backend automatically provides the geometry information needed for the spatial partitioning later on. +Additionally, we got rid of the columns `id` and `spri` since these variables should not be used as predictors in the modeling. +Next, we go on to contruct the a random forest\index{random forest} learner from the **ranger** package. -```{r 14-eco-21, eval=FALSE} -# create task -task = makeRegrTask(data = rp, target = "sc", coordinates = coords) -# learner -lrn_rf = makeLearner(cl = "regr.ranger", predict.type = "response") +```{r 15-eco-21, eval=FALSE} +lrn_rf = lrn("regr.ranger", predict_type = "response") ``` As opposed to for example support vector machines\index{SVM} (see Section \@ref(svm)), random forests often already show good performances when used with the default values of their hyperparameters (which may be one reason for their popularity). Still, tuning often moderately improves model results, and thus is worth the effort [@probst_hyperparameters_2018]. -Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)). -Specifically, we will use a five-fold spatial partitioning with only one repetition (`makeResampleDesc()`). -In each of these spatial partitions, we run 50 models (`makeTuneControlRandom()`) to find the optimal hyperparameter\index{hyperparameter} combination. - -```{r 14-eco-22, eval=FALSE} -# spatial partitioning -perf_level = makeResampleDesc("SpCV", iters = 5) -# specifying random search -ctrl = makeTuneControlRandom(maxit = 50L) -``` - In random forests\index{random forest}, the hyperparameters\index{hyperparameter} `mtry`, `min.node.size` and `sample.fraction` determine the degree of randomness, and should be tuned [@probst_hyperparameters_2018]. `mtry` indicates how many predictor variables should be used in each tree. If all predictors are used, then this corresponds in fact to bagging (see beginning of Section \@ref(modeling-the-floristic-gradient)). @@ -476,55 +502,73 @@ Smaller fractions lead to greater diversity, and thus less correlated trees whic The `min.node.size` parameter indicates the number of observations a terminal node should at least have (see also Figure \@ref(fig:tree)). Naturally, as trees and computing time become larger, the lower the `min.node.size`. -Hyperparameter\index{hyperparameter} combinations will be selected randomly but should fall inside specific tuning limits (`makeParamSet()`). +Hyperparameter\index{hyperparameter} combinations will be selected randomly but should fall inside specific tuning limits (created with `paradox::ps()`). `mtry` should range between 1 and the number of predictors -(4) -`sample.fraction` -should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10. +(4), `sample.fraction` should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10. ```{r 14-eco-23, eval=FALSE} # specifying the search space -ps = makeParamSet( - makeIntegerParam("mtry", lower = 1, upper = ncol(rp) - 1), - makeNumericParam("sample.fraction", lower = 0.2, upper = 0.9), - makeIntegerParam("min.node.size", lower = 1, upper = 10) +search_space = paradox::ps( + mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1), + sample.fraction = paradox::p_dbl(lower = 0.2, upper = 0.9), + min.node.size = paradox::p_int(lower = 1, upper = 10) ) ``` -Finally, `tuneParams()` runs the hyperparameter\index{hyperparameter} tuning, and will find the optimal hyperparameter\index{hyperparameter} combination for the specified parameters. +Having defined the search space, we are all set for specifying our tuning via the `AutoTuner()` function. +Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)). +Specifically, we will use a five-fold spatial partitioning with only one repetition (`rsmp()`). +In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination. The performance measure is the root mean squared error (RMSE\index{RMSE}). +```{r 15-eco-22, eval=FALSE} +at = mlr3tuning::AutoTuner$new( + learner = lrn_rf, + # spatial partitioning + resampling = mlr3::rsmp("spcv_coords", folds = 5), + # performance measure + measure = mlr3::msr("regr.rmse"), + # specify 50 iterations + terminator = mlr3tuning::trm("evals", n_evals = 50), + # predefined hyperparameter search space + search_space = search_space, + # specify random search + tuner = mlr3tuning::tnr("random_search") +) +``` + +Calling the `train()`-method of the `AutoTuner`-object finally runs the hyperparameter\index{hyperparameter} tuning, and will find the optimal hyperparameter\index{hyperparameter} combination for the specified parameters. + ```{r 14-eco-24, eval=FALSE} # hyperparamter tuning -set.seed(02082018) -tune = tuneParams(learner = lrn_rf, - task = task, - resampling = perf_level, - par.set = ps, - control = ctrl, - measures = mlr::rmse) +set.seed(0412022) +at$train(task) #>... -#> [Tune-x] 49: mtry=3; sample.fraction=0.533; min.node.size=5 -#> [Tune-y] 49: rmse.test.rmse=0.5636692; time: 0.0 min -#> [Tune-x] 50: mtry=1; sample.fraction=0.68; min.node.size=5 -#> [Tune-y] 50: rmse.test.rmse=0.6314249; time: 0.0 min -#> [Tune] Result: mtry=4; sample.fraction=0.887; min.node.size=10 : -#> rmse.test.rmse=0.5104918 +#> INFO [11:39:31.375] [mlr3] Finished benchmark +#> INFO [11:39:31.427] [bbotk] Result of batch 50: +#> INFO [11:39:31.432] [bbotk] mtry sample.fraction min.node.size regr.rmse warnings errors runtime_learners +#> INFO [11:39:31.432] [bbotk] 2 0.2059293 10 0.5118128 0 0 0.149 +#> INFO [11:39:31.432] [bbotk] uhash +#> INFO [11:39:31.432] [bbotk] 81dfa6f8-0109-410d-bdb5-a1490e7caf8e +#> INFO [11:39:31.448] [bbotk] Finished optimizing after 50 evaluation(s) +#> INFO [11:39:31.451] [bbotk] Result: +#> INFO [11:39:31.455] [bbotk] mtry sample.fraction min.node.size learner_param_vals x_domain regr.rmse +#> INFO [11:39:31.455] [bbotk] 4 0.8999753 7 0.3751501 ``` -```{r 14-eco-25, eval=FALSE, echo=FALSE} -saveRDS(tune, "extdata/15-tune.rds") +```{r 15-eco-25, eval=FALSE, echo=FALSE} +saveRDS(at, "extdata/15-tune.rds") ``` -```{r 14-eco-26, echo=FALSE, eval=FALSE} +```{r 15-eco-26, echo=FALSE, eval=TRUE} tune = readRDS("extdata/15-tune.rds") ``` -An `mtry` of 4, a `sample.fraction` of 0.887, and a `min.node.size` of 10 represent the best hyperparameter\index{hyperparameter} combination. +An `mtry` of 4, a `sample.fraction` of 0.9, and a `min.node.size` of 7 represent the best hyperparameter\index{hyperparameter} combination. An RMSE\index{RMSE} of - -0.51 + +0.38 is relatively good when considering the range of the response variable which is 3.04 @@ -533,85 +577,67 @@ is relatively good when considering the range of the response variable which is ### Predictive mapping The tuned hyperparameters\index{hyperparameter} can now be used for the prediction. -We simply have to modify our learner using the result of the hyperparameter tuning, and run the corresponding model. - -```{r 14-eco-27, eval=FALSE} -# learning using the best hyperparameter combination -lrn_rf = makeLearner(cl = "regr.ranger", - predict.type = "response", - mtry = tune$x$mtry, - sample.fraction = tune$x$sample.fraction, - min.node.size = tune$x$min.node.size) -# doing the same more elegantly using setHyperPars() -# lrn_rf = setHyperPars(makeLearner("regr.ranger", predict.type = "response"), -# par.vals = tune$x) -# train model -model_rf = train(lrn_rf, task) -# to retrieve the ranger output, run: -# mlr::getLearnerModel(model_rf) -# which corresponds to: -# ranger(sc ~ ., data = rp, -# mtry = tune$x$mtry, -# sample.fraction = tune$x$sample.fraction, -# min.node.sie = tune$x$min.node.size) -``` - -The last step is to apply the model to the spatially available predictors, i.e., to the raster stack\index{raster!stack}. -So far, `raster::predict()` does not support the output of **ranger** models, hence, we will have to program the prediction ourselves. -First, we convert `ep` into a prediction data frame which secondly serves as input for the `predict.ranger()` function. -Thirdly, we put the predicted values back into a `RasterLayer`\index{raster} (see Section \@ref(raster-subsetting) and Figure \@ref(fig:rf-pred)). - -```{r 14-eco-28, eval=FALSE} -# convert raster stack into a data frame -new_data = as.data.frame(as.matrix(ep)) -# apply the model to the data frame -pred_rf = predict(model_rf, newdata = new_data) -# put the predicted values into a raster -pred = dem -# replace altitudinal values by rf-prediction values -pred[] = pred_rf$data$response +To do so, we only need to run the `predict` method of our fitted `AutoTuner` object. + +```{r 15-eco-27, eval=FALSE} +# predicting using the best hyperparameter combination +at$predict(task) +``` + +The `predict` method will apply the model to all observations used in the modeling. +Given a multilayer `SpatRaster` containing rasters named as the predictors used in the modeling, `terra::predict()` will also make spatial predictions, i.e., predict to new data. + +```{r 15-eco-28, eval=FALSE} +pred = terra::predict(ep, model = at, fun = predict) ``` ```{r rf-pred, echo=FALSE, fig.cap="Predictive mapping of the floristic gradient clearly revealing distinct vegetation belts.", fig.width = 10, fig.height = 10, fig.scap="Predictive mapping of the floristic gradient."} -# library("latticeExtra") -# library("grid") -# -# # create a color palette -# blue = rgb(0, 0, 146, maxColorValue = 255) -# lightblue = rgb(0, 129, 255, maxColorValue = 255) -# turquoise = rgb(0, 233, 255, maxColorValue = 255) -# green = rgb(142, 255, 11, maxColorValue = 255) -# yellow = rgb(245, 255, 8, maxColorValue = 255) -# orange = rgb(255, 173, 0, maxColorValue = 255) -# lightred = rgb(255, 67, 0, maxColorValue = 255) -# red = rgb(170, 0, 0, maxColorValue = 255) -# pal = colorRampPalette(c(blue, lightblue, turquoise, green, yellow, -# orange, lightred, red)) -# # # restrict the prediction to your study area -# pred = mask(pred, study_area) %>% -# trim +# pred = terra::mask(pred, terra::vect(study_area)) |> +# terra::trim() # # create a hillshade -# hs = hillShade(terrain(dem), terrain(dem, "aspect")) %>% -# mask(., study_area) -# spplot(extend(pred, 2), col.regions = pal(50), alpha.regions = 0.7, +# hs = terra::shade(terra::terrain(ep$dem, v = "slope", unit = "radians"), +# terra::terrain(ep$dem, v = "aspect", unit = "radians"), +# 40, 270) |> +# terra::mask(terra::vect(study_area)) +# +# # png(file = "figures/15_rf_pred.png", width = 20, height = 20, units = "cm", +# # res = 300) +# spplot(terra::extend(pred, 2), col.regions = pal(50), alpha.regions = 0.7, # scales = list(draw = TRUE, -# tck = c(1, 0), +# tck = c(1, 0), # cex = 0.8), # colorkey = list(space = "right", width = 0.5, height = 0.5, # axis.line = list(col = "black")), # sp.layout = list( # # list("sp.points", as(random_points, "Spatial"), pch = 16, # # col = "black", cex = 0.8, first = FALSE), -# list("sp.polygons", as(study_area, "Spatial"), -# col = "black", first = FALSE, lwd = 3) +# list("sp.polygons", as(study_area, "Spatial"), +# col = "black", first = FALSE, lwd = 5) # ) -# ) + -# latticeExtra::as.layer(spplot(hs, col.regions = gray(0:100 / 100)), +# ) + +# latticeExtra::as.layer(spplot(hs, col.regions = gray(0:100 / 100)), # under = TRUE) -# grid.text("NMDS1", x = unit(0.75, "npc"), y = unit(0.75, "npc"), -# gp = gpar(cex = 0.8)) -knitr::include_graphics("figures/rf-pred-1.png") +# grid.text("NMDS1", x = unit(0.92, "npc"), y = unit(0.77, "npc"), +# gp = gpar(cex = 0.9)) +# # dev.off() + +knitr::include_graphics("figures/15_rf_pred.png") +``` + +In case, `terra::predict()` does not support a model algorithm, you can still make the predictions manually. + +```{r 15-eco-29, eval=FALSE} +newdata = as.data.frame(as.matrix(ep)) +colSums(is.na(newdata)) # 0 NAs +# but assuming there were 0s results in a more generic approach +ind = rowSums(is.na(newdata)) == 0 +tmp = at$predict_newdata(newdata = newdata[ind, ], task = task) +newdata[ind, "pred"] = data.table::as.data.table(tmp)[["response"]] +pred_2 = ep$dem +# now fill the raster with the predicted values +pred_2[] = newdata$pred +identical(values(pred), values(pred_2)) # TRUE ``` The predictive mapping clearly reveals distinct vegetation belts (Figure \@ref(fig:rf-pred)). @@ -628,7 +654,7 @@ Interestingly, the spatial prediction clearly reveals that the bromeliad belt is In this chapter we have ordinated\index{ordination} the community matrix of the **lomas** Mt. Mongón with the help of a NMDS\index{NMDS} (Section \@ref(nmds)). The first axis, representing the main floristic gradient in the study area, was modeled as a function of environmental predictors which partly were derived through R-GIS\index{GIS} bridges (Section \@ref(data-and-data-preparation)). -The **mlr**\index{mlr (package)} package provided the building blocks to spatially tune the hyperparameters\index{hyperparameter} `mtry`, `sample.fraction` and `min.node.size` (Section \@ref(mlr-building-blocks)). +The **mlr3**\index{mlr3 (package)} package provided the building blocks to spatially tune the hyperparameters\index{hyperparameter} `mtry`, `sample.fraction` and `min.node.size` (Section \@ref(mlr3-building-blocks)). The tuned hyperparameters\index{hyperparameter} served as input for the final model which in turn was applied to the environmental predictors for a spatial representation of the floristic gradient (Section \@ref(predictive-mapping)). The result demonstrates spatially the astounding biodiversity in the middle of the desert. Since **lomas** mountains are heavily endangered, the prediction map can serve as basis for informed decision-making on delineating protection zones, and making the local population aware of the uniqueness found in their immediate neighborhood. @@ -653,24 +679,7 @@ However, this does not imply that the random forest\index{random forest} model h ## Exercises -1. Run a NMDS\index{NMDS} using the percentage data of the community matrix. -Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data. -What might explain the observed difference? - -1. Compute all the predictor rasters\index{raster} we have used in the chapter (catchment slope, catchment area), and put them into a raster stack. -Add `dem` and `ndvi` to the raster stack\index{raster!stack}. -Next, compute profile and tangential curvature as additional predictor rasters and add them to the raster stack (hint: `grass7:r.slope.aspect`). -Finally, construct a response-predictor matrix. -The scores of the first NMDS\index{NMDS} axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to `random_points` (use an inner join). -To complete the response-predictor matrix, extract the values of the environmental predictor raster stack to `random_points`. - -1. Use the response-predictor matrix of the previous exercise to fit a random forest\index{random forest} model. -Find the optimal hyperparameters\index{hyperparameter} and use them for making a prediction map. - - -1. Retrieve the bias-reduced RMSE of a random forest\index{random forest} model using spatial cross-validation\index{cross-validation!spatial CV} including the estimation of optimal hyperparameter\index{hyperparameter} combinations (random search with 50 iterations) in an inner tuning loop (see Section \@ref(svm)). -Parallelize\index{parallelization} the tuning level (see Section \@ref(svm)). -Report the mean RMSE\index{RMSE} and use a boxplot to visualize all retrieved RMSEs. - -1. Retrieve the bias-reduced RMSE\index{RMSE} of a simple linear model\index{regression!linear} using spatial cross-validation\index{cross-validation!spatial CV}. -Compare the result to the result of the random forest model\index{random forest} by making RMSE\index{RMSE} boxplots for each modeling approach. +```{r, echo=FALSE, results='asis'} +res = knitr::knit_child('_15-ex.Rmd', quiet = TRUE, options = list(include = FALSE, eval = FALSE)) +cat(res, sep = '\n') +``` diff --git a/_12-ex.Rmd b/_12-ex.Rmd new file mode 100644 index 000000000..f7c05238a --- /dev/null +++ b/_12-ex.Rmd @@ -0,0 +1,230 @@ +The solutions assume the following packages are attached (other packages will be attached when needed): + +```{r 12-ex-e0, message=FALSE, warning=FALSE} +library(dplyr) +# library(kernlab) +library(mlr3) +library(mlr3learners) +library(mlr3extralearners) +library(mlr3spatiotempcv) +library(mlr3tuning) +library(qgisprocess) +library(raster) +# library(rlang) +library(sf) +library(tmap) +``` + +E1. Compute the following terrain attributes from the `elev` dataset loaded with `terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))$elev` with the help of R-GIS bridges (see this [Chapter](https://geocompr.robinlovelace.net/gis.html#gis)): + - Slope + - Plan curvature + - Profile curvature + - Catchment area + +```{r, eval=FALSE} +# attach data +dem = terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))$elev + +algs = qgisprocess::qgis_algorithms() +dplyr::filter(algs, grepl("curvature", algorithm)) +alg = "saga:slopeaspectcurvature" +qgisprocess::qgis_show_help(alg) +qgisprocess::qgis_arguments(alg) +# terrain attributes (ta) +out_nms = paste0(tempdir(), "/", c("slope", "cplan", "cprof"), + ".sdat") +args = rlang::set_names(out_nms, c("SLOPE", "C_PLAN", "C_PROF")) +out = qgis_run_algorithm(alg, ELEVATION = dem, METHOD = 6, + UNIT_SLOPE = "[1] degree", + !!!args, + .quiet = TRUE + ) +ta = out[names(args)] |> unlist() |> terra::rast() +names(ta) = c("slope", "cplan", "cprof") +# catchment area +dplyr::filter(algs, grepl("[Cc]atchment", algorithm)) +alg = "saga:catchmentarea" +qgis_show_help(alg) +qgis_arguments(alg) +carea = qgis_run_algorithm(alg, + ELEVATION = dem, + METHOD = 4, + FLOW = file.path(tempdir(), "carea.sdat")) +# transform carea +carea = terra::rast(carea$FLOW[1]) +log10_carea = log10(carea) +names(log10_carea) = "log10_carea" +# add log_carea and dem to the terrain attributes +ta = c(ta, dem, log10_carea) +``` + +E2. Extract the values from the corresponding output rasters to the `lsl` data frame (`data("lsl", package = "spDataLarge"`) by adding new variables called `slope`, `cplan`, `cprof`, `elev` and `log_carea` (see this [section](https://geocompr.robinlovelace.net/spatial-cv.html#case-landslide) for details). + +```{r, eval=FALSE} +# attach terrain attribute raster stack (in case you have skipped the previous +# exercise) +data("lsl", package = "spDataLarge") +ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) +lsl = dplyr::select(lsl, x, y, lslpts) +# extract values to points, i.e., create predictors +lsl[, names(ta)] = terra::extract(ta, lsl[, c("x", "y")]) |> + dplyr::select(-ID) +``` + +E3. Use the derived terrain attribute rasters in combination with a GLM to make a spatial prediction map similar to that shown in this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:lsl-susc). +Running `data("study_mask", package = "spDataLarge")` attaches a mask of the study area. + +```{r, eval=FALSE} +# attach data (in case you have skipped exercises 1) and 2) +# landslide points with terrain attributes and terrain attribute raster stack +data("lsl", "study_mask", package = "spDataLarge") +ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) + +# fit the model +fit = glm(lslpts ~ slope + cplan + cprof + elev + log10_carea, + data = lsl, family = binomial()) + +# make the prediction +pred = terra::predict(object = ta, model = fit, type = "response") + +# make the map +lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717) +study_mask = terra::vect(study_mask) +lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717) +hs = terra::shade(ta$slope * pi / 180, + terra::terrain(ta$elev, v = "aspect", unit = "radians")) +rect = tmaptools::bb_poly(raster::raster(hs)) +bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.00001, 1), + ylim = c(-0.00001, 1), relative = TRUE) +# white raster to only plot the axis ticks, otherwise gridlines would be visible +tm_shape(hs, bbox = bbx) + + tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, + labels.rot = c(0, 90)) + + tm_raster(palette = "white", legend.show = FALSE) + + # hillshade + tm_shape(terra::mask(hs, study_mask), bbox = bbx) + + tm_raster(palette = gray(0:100 / 100), n = 100, + legend.show = FALSE) + + # prediction raster + tm_shape(terra::mask(pred, study_mask)) + + tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE, + title = "Susceptibility") + + # rectangle and outer margins + qtm(rect, fill = NULL) + + tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE, + legend.position = c("left", "bottom"), + legend.title.size = 0.9) +``` + +E4. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling strategies with the help of boxplots (see this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:boxplot-cv). +Hint: You need to specify a non-spatial resampling strategy. +Another hint: You might want to Excercises 4 to 6 in one go with the help of `mlr3::benchmark()` and `mlr3::benchmark_grid()` (for more information, please refer to https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking). +When doing so, keep in mind that the computation can take very long, probably several days. +This, of course, depends on your system. +Computation time will be shorter the more RAM and cores you have at your disposal. + +```{r, eval=FALSE} +# attach data (in case you have skipped exercises 1) and 2) +data("lsl", package = "spDataLarge") # landslide points with terrain attributes + +# create task +task = TaskClassifST$new( + id = "lsl_ecuador", + backend = mlr3::as_data_backend(lsl), target = "lslpts", positive = "TRUE", + extra_args = list( + coordinate_names = c("x", "y"), + coords_as_features = FALSE, + crs = 32717) +) + +# construct learners (for all subsequent exercises) +# GLM +lrn_glm = lrn("classif.log_reg", predict_type = "prob") + +# SVM +# construct SVM learner (using ksvm function from the kernlab package) +lrn_ksvm = lrn("classif.ksvm", predict_type = "prob", kernel = "rbfdot", + type = "C-svc") +# specify nested resampling and adjust learner accordingly +# five spatially disjoint partitions +tune_level = rsmp("spcv_coords", folds = 5) +# use 50 randomly selected hyperparameters +terminator = trm("evals", n_evals = 50) +tuner = tnr("random_search") +# define the outer limits of the randomly selected hyperparameters +ps = ps( + C = p_dbl(lower = -12, upper = 15, trafo = function(x) 2^x), + sigma = p_dbl(lower = -15, upper = 6, trafo = function(x) 2^x) +) +at_ksvm = AutoTuner$new( + learner = lrn_ksvm, + resampling = tune_level, + measure = msr("classif.auc"), + search_space = ps, + terminator = terminator, + tuner = tuner +) + +# QDA +lrn_qda = lrn("classif.qda", predict_type = "prob") + +# SVM without tuning hyperparameters +vals = lrn_ksvm$param_set$values +lrn_ksvm_notune = lrn_ksvm$clone() +lrn_ksvm_notune$param_set$values = c(vals, C = 1, sigma = 1) + +# define resampling strategies +# specify the reampling method, i.e. spatial CV with 100 repetitions and 5 folds +# -> in each repetition dataset will be splitted into five folds +# method: repeated_spcv_coords -> spatial partioning +rsmp_sp = rsmp("repeated_spcv_coords", folds = 5, repeats = 100) +# method: repeated_cv -> non-spatial partitioning +rsmp_nsp = rsmp("repeated_cv", folds = 5, repeats = 100) + +# (spatial) cross-validataion +#**************************** +# create your design +grid = benchmark_grid(tasks = task, + learners = list(lrn_glm, at_ksvm, lrn_qda, + lrn_ksvm_notune), + resamplings = list(rsmp_sp, rsmp_nsp)) +# execute the cross-validation +library(future) +# execute the outer loop sequentially and parallelize the inner loop +future::plan(list("sequential", "multisession"), + workers = floor(availableCores() / 2)) +set.seed(021522) +bmr = benchmark(grid, store_backends = FALSE) +# stop parallelization +future:::ClusterRegistry("stop") +# save your result, e.g. to +# saveRDS(bmr, file = "extdata/12-bmr.rds") + +# plot your result +autoplot(bmr, measure = msr("classif.auc")) +``` + +E5. Model landslide susceptibility using a quadratic discriminant analysis (QDA). +Assess the predictive performance of the QDA. +What is the a difference between the spatially cross-validated mean AUROC value of the QDA and the GLM? + +```{r, eval=FALSE} +# attach data (in case you have skipped exercise 4) +bmr = readRDS("extdata/12-bmr.rds") + +# plot your result +autoplot(bmr, measure = msr("classif.auc")) +# QDA has higher AUROC values on average than GLM which indicates moderately +# non-linear boundaries +``` + +E6. Run the SVM without tuning the hyperparameters. +Use the `rbfdot` kernel with $\sigma$ = 1 and *C* = 1. +Leaving the hyperparameters unspecified in **kernlab**'s `ksvm()` would otherwise initialize an automatic non-spatial hyperparameter tuning. + +```{r, eval=FALSE} +# attach data (in case you have skipped exercise 4) +bmr = readRDS("extdata/12-bmr.rds") +# plot your result +autoplot(bmr, measure = msr("classif.auc")) +``` diff --git a/_15-ex.Rmd b/_15-ex.Rmd new file mode 100644 index 000000000..f6ca71363 --- /dev/null +++ b/_15-ex.Rmd @@ -0,0 +1,217 @@ +The solutions assume the following packages are attached (other packages will be attached when needed): + +```{r 15-ex-e0, message=FALSE, warning=FALSE} +library(data.table) +library(dplyr) +library(future) +library(ggplot2) +library(lgr) +library(mlr3) +library(mlr3learners) +library(mlr3spatiotempcv) +library(mlr3tuning) +library(mlr3viz) +library(progressr) +library(qgisprocess) +library(terra) +library(tictoc) +library(sf) +library(vegan) +``` + +E1. Run a NMDS\index{NMDS} using the percentage data of the community matrix. +Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data. +What might explain the observed difference? + +```{r 15-ex-e1, eval=FALSE} +data("comm", package = "spDataLarge") +pa = decostand(comm, "pa") +pa = pa[rowSums(pa) != 0, ] +comm = comm[rowSums(comm) != 0, ] +set.seed(25072018) +nmds_pa = metaMDS(comm = pa, k = 4, try = 500) +nmds_per = metaMDS(comm = comm, k = 4, try = 500) +nmds_pa$stress +nmds_per$stress +``` + +The NMDS using the presence-absence values yields a better result (`nmds_pa$stress`) than the one using percentage data (`nmds_per$stress`). +This might seem surprising at first sight. +On the other hand, the percentage matrix contains both more information and more noise. +Another aspect is how the data was collected. +Imagine a botanist in the field. +It might seem feasible to differentiate between a plant which has a cover of 5% and another species that covers 10%. +However, what about a herbal species that was only detected three times and consequently has a very tiny cover, e.g., 0.0001%. +Maybe another herbal species was detected 6 times, is its cover then 0.0002%? +The point here is that percentage data as specified during a field campaign might reflect a precision that the data does not have. +This again introduces noise which in turn will worsen the ordination result. +Still, it is a valuable information if one species had a higher frequency or coverage in one plot than another compared to just presence-absence data. +One compromise would be to use a categorical scale such as the Londo scale. + +E2. Compute all the predictor rasters\index{raster} we have used in the chapter (catchment slope, catchment area), and put them into a `SpatRaster`-object. +Add `dem` and `ndvi` to it. +Next, compute profile and tangential curvature and add them as additional predictor rasters (hint: `grass7:r.slope.aspect`). +Finally, construct a response-predictor matrix. +The scores of the first NMDS\index{NMDS} axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to `random_points` (use an inner join). +To complete the response-predictor matrix, extract the values of the environmental predictor raster object to `random_points`. + +```{r 15-ex-e2, eval=FALSE} +# first compute the terrain attributes we have also used in the chapter +library(dplyr) +library(terra) +library(qgisprocess) +library(vegan) +data("comm", "random_points", package = "spDataLarge") +dem = terra::rast(system.file("raster/dem.tif", package = "spDataLarge")) +ep = qgisprocess::qgis_run_algorithm( + alg = "saga:sagawetnessindex", + DEM = dem, + SLOPE_TYPE = 1, + SLOPE = tempfile(fileext = ".sdat"), + AREA = tempfile(fileext = ".sdat"), + .quiet = TRUE) +# read in catchment area and catchment slope +ep = ep[c("AREA", "SLOPE")] |> + unlist() |> + terra::rast() +# assign proper names +names(ep) = c("carea", "cslope") +# make sure all rasters share the same origin +origin(ep) = origin(dem) +# add dem and ndvi to the multilayer SpatRaster object +ep = c(dem, ndvi, ep) +ep$carea = log10(ep$carea) + +# computing the curvatures +qgis_show_help("grass7:r.slope.aspect") +curvs = qgis_run_algorithm( + "grass7:r.slope.aspect", + elevation = dem, + .quiet = TRUE) +# adding curvatures to ep +curv_nms = c("pcurvature", "tcurvature") +curvs = curvs[curv_nms] |> + unlist() |> + terra::rast() +curvs = terra::app(curvs, as.numeric) +names(curvs) = curv_nms +ep = c(ep, curvs) +random_points[, names(ep)] = + # terra::extract adds an ID column, we don't need + terra::extract(ep, vect(random_points)) |> + select(-ID) +elev = dplyr::filter(random_points, id %in% rownames(pa)) %>% + dplyr::pull(dem) +# rotating NMDS in accordance with altitude (proxy for humidity) +rotnmds = MDSrotate(nmds_pa, elev) +# extracting the first two axes +sc = scores(rotnmds, choices = 1:2) +rp = data.frame(id = as.numeric(rownames(sc)), + sc = sc[, 1]) +# join the predictors (dem, ndvi and terrain attributes) +rp = inner_join(random_points, rp, by = "id") +``` + +E3. Retrieve the bias-reduced RMSE of a random forest\index{random forest} and a linear model using spatial cross-validation\index{cross-validation!spatial CV}. +The random forest modeling should include the estimation of optimal hyperparameter\index{hyperparameter} combinations (random search with 50 iterations) in an inner tuning loop (see Section \@ref(svm)). +Parallelize\index{parallelization} the tuning level (see Section \@ref(svm)). +Report the mean RMSE\index{RMSE} and use a boxplot to visualize all retrieved RMSEs. +Please not that this exercise is best solved using the mlr3 functions `benchmark_grid()` and `benchmark()` (see https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking for more information). + +```{r 15-ex-e3, eval=FALSE} +library(dplyr) +library(future) +library(mlr3) +library(mlr3spatiotempcv) +library(mlr3learners) +library(mlr3viz) +library(paradox) +# define the task +task = mlr3spatiotempcv::TaskRegrST$new( + id = "mongon", + backend = select(rp, -id, -spri), + target = "sc" + ) +# define the learners +mlr3::mlr_learners +# linear model +lrn_lm = mlr3::lrn("regr.lm", predict_type = "response") +# random forest +lrn_rf = mlr3::lrn("regr.ranger", predict_type = "response") +# now define the AutoTuner of the random forest +search_space = paradox::ps( + mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1), + sample.fraction = paradox::p_dbl(lower = 0.2, upper = 0.9), + min.node.size = paradox::p_int(lower = 1, upper = 10) +) +at_rf = mlr3tuning::AutoTuner$new( + learner = lrn_rf, + # spatial partitioning + resampling = mlr3::rsmp("spcv_coords", folds = 5), + # performance measure + measure = mlr3::msr("regr.rmse"), + search_space = search_space, + # random search with 50 iterations + terminator = mlr3tuning::trm("evals", n_evals = 50), + tuner = mlr3tuning::tnr("random_search") +) +# define the resampling strategy +rsmp_sp = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100) + +# create the benchmark design +design_grid = mlr3::benchmark_grid( + tasks = task, + learners = list(lrn_lm, at_rf), + resamplings = rsmp_sp) +print(design_grid) +# execute the outer loop sequentially and parallelize the inner loop +future::plan(list("sequential", "multisession"), + workers = floor(future::availableCores() / 2)) +set.seed(04132022) +# reduce verbosity +lgr::get_logger("mlr3")$set_threshold("warn") +lgr::get_logger("bbotk")$set_threshold("info") +# BE CAREFUL: Running the benchmark might take quite some time +tictoc::tic() +progressr::with_progress(expr = { + bmr = mlr3::benchmark( + design = design_grid, + # New argument `encapsulate` for `resample()` and `benchmark()` to + # conveniently enable encapsulation and also set the fallback learner to the + # respective featureless learner. This is simply for convenience, configuring + # each learner individually is still possible and allows a more fine-grained + # control + encapsulate = "evaluate", + store_backends = FALSE, + store_models = FALSE) +}) +tictoc::toc() + +# stop parallelization +future:::ClusterRegistry("stop") +# save your result, e.g. to +saveRDS(bmr, file = "extdata/15_bmr.rds") + +# mean RMSE +bmr$aggregate(measures = msr("regr.rmse")) +# or computed manually +purrr::map(agg$resample_result, ~ mean(.$score(msr("regr.rmse"))$regr.rmse)) + +# make a boxplot (when using autoplot, mlr3viz needs to be attached!) +# library(mlr3viz) +autoplot(bmr, measure = msr("regr.rmse")) + +# or doing it "manually" +agg = bmr$aggregate(measures = msr("regr.rmse")) +# extract the AUROC values and put them into one data.table +d = purrr::map_dfr(agg$resample_result, ~ .$score(msr("regr.rmse"))) +# create the boxplots +library(ggplot2) +ggplot(data = d, mapping = aes(x = learner_id, y = regr.rmse)) + + geom_boxplot(fill = c("lightblue2", "mistyrose2")) + + theme_bw() + + labs(y = "RMSE", x = "model") +``` + +In fact, `lm` performs at least as good the random forest model, and thus should be preferred since it is much easier to understand and computationally much less demanding (no need for fitting hyperparameters). +But keep in mind that the used dataset is small in terms of observations and predictors and that the response-predictor relationships are also relatively linear. diff --git a/code/12-cv.R b/code/12-cv.R new file mode 100644 index 000000000..fcc9f37ef --- /dev/null +++ b/code/12-cv.R @@ -0,0 +1,252 @@ +# Filename: 12-cv.R (2022-02-16) +# +# TO DO: Introduce spatial cross-validation with the help of mlr3 +# +# Author(s): Jannes Muenchow +# +#********************************************************** +# contents---- +#********************************************************** +# +# 1 attach packages and data +# 2 modeling +# 3 spatial prediction +# +#********************************************************** +# 1 attach packages and data---- +#********************************************************** + +# attach packages +library(dplyr) +library(ggplot2) +library(mlr3) +library(mlr3extralearners) +library(mlr3learners) +library(mlr3spatiotempcv) +library(mlr3tuning) +library(raster) +library(sf) +library(tmap) + +#********************************************************** +# 2 modeling---- +#********************************************************** + +# attach data +data("lsl", "study_mask", package = "spDataLarge") +ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) + +# 2.1 create a mlr3 task==== +#********************************************************** +# id = name of the task +# target = response variable +# spatial: setting up the scene for spatial cv (hence x and y will not be used +# as predictors but as coordinates for kmeans clustering) +# all variables in data will be used in the model + +task = TaskClassifST$new( + id = "lsl_glm_sp", + backend = mlr3::as_data_backend(lsl), target = "lslpts", positive = "TRUE", + extra_args = list( + coordinate_names = c("x", "y"), + coords_as_features = FALSE, + crs = 32717) +) +# pls note that you can use a task created by TaskClassifST for both the spatial +# and the conventional CV approach + +# 2.2 construct a learner and run the model==== +#********************************************************** + +# glm learner +lrn_glm = lrn("classif.log_reg", predict_type = "prob") +lrn_glm$fallback = lrn("classif.featureless", predict_type = "prob") + +# construct SVM learner (using ksvm function from the kernlab package) +lrn_ksvm = lrn("classif.ksvm", predict_type = "prob", kernel = "rbfdot", + type = "C-svc") +lrn_ksvm$fallback = lrn("classif.featureless", predict_type = "prob") +# rbfdot Radial Basis kernel "Gaussian" +# the list of hyper-parameters (kernel parameters). This is a list which +# contains the parameters to be used with the kernel function. For valid +# parameters for existing kernels are : +# sigma inverse kernel width for the Radial Basis kernel function "rbfdot" and +# the Laplacian kernel "laplacedot". +# C cost of constraints violation (default: 1) this is the ‘C’-constant of the +# regularization term in the Lagrange formulation. + +# specify nested resampling and adjust learner accordingly +# five spatially disjoint partitions +tune_level = rsmp("spcv_coords", folds = 5) +# use 50 randomly selected hyperparameters +terminator = trm("evals", n_evals = 50) +tuner = tnr("random_search") +# define the outer limits of the randomly selected hyperparameters +ps = ps( + C = p_dbl(lower = -12, upper = 15, trafo = function(x) 2^x), + sigma = p_dbl(lower = -15, upper = 6, trafo = function(x) 2^x) +) + +at_ksvm = AutoTuner$new( + learner = lrn_ksvm, + resampling = tune_level, + measure = msr("classif.auc"), + search_space = ps, + terminator = terminator, + tuner = tuner +) +at_ksvm$fallback = lrn("classif.featureless", predict_type = "prob") +# 2.3 cross-validation==== +#********************************************************** + +# specify the reampling method, i.e. spatial CV with 100 repetitions and 5 folds +# -> in each repetition dataset will be splitted into five folds +# method: repeated_spcv_coords -> spatial partioning +# method: repeated_cv -> non-spatial partitioning +rsmp_sp = rsmp("repeated_spcv_coords", folds = 5, repeats = 100) +rsmp_nsp = rsmp("repeated_cv", folds = 5, repeats = 100) + +# create your design +design_grid = benchmark_grid( + tasks = task, + learners = list(lrn_glm, at_ksvm), + resamplings = list(rsmp_sp, rsmp_nsp)) +print(design_grid) +# run the cross-validations (spatial and conventional) +# execute the outer loop sequentially and parallelize the inner loop +future::plan(list("sequential", "multisession"), + # use half of all available cores + # workers = floor(future::availableCores()) / 2 + workers = 8) + +# why do we need the seed? +# the seed is needed when to make sure +# that always the same spatial partitions are used when re-running the code +set.seed(012348) +# reduce verbosity +lgr::get_logger("mlr3")$set_threshold("warn") +lgr::get_logger("bbotk")$set_threshold("info") +tictoc::tic() +progressr::with_progress(expr = { + # New argument `encapsulate` for `resample()` and `benchmark()` to + # conveniently enable encapsulation and also set the fallback learner to the + # respective featureless learner. This is simply for convenience, configuring + # each learner individually is still possible and allows a more fine-grained + # control + bmr = benchmark(design_grid, + encapsulate = "evaluate", + store_backends = FALSE, + store_models = FALSE) +}) +tictoc::toc() +# stop the parallelization plan +future:::ClusterRegistry("stop") + +# plot your result +# library(mlr3viz) +# p1 = autoplot(bmr, measure = msr("classif.auc")) +# p1$labels$y = "AUROC" +# p1$layers[[1]]$aes_params$fill = c("lightblue2", "mistyrose2") +# p1 + +# scale_x_discrete(labels=c("spatial CV", "conventional CV")) + +# ggplot2::theme_bw() + +# instead of using autoplot, it might be easier to create the figure yourself +agg = bmr$aggregate(measures = msr("classif.auc")) +# # extract the AUROC values and put them into one data.table +# d = purrr::map_dfr(agg$resample_result, ~ .$score(msr("classif.auc"))) +# # rename the levels of resampling_id +# d[, resampling_id := as.factor(resampling_id) |> +# forcats::fct_recode("conventional CV" = "repeated_cv", +# "spatial CV" = "repeated_spcv_coords") |> +# forcats::fct_rev()] +# create the boxplot which shows the overfitting in the nsp-case +# ggplot2::ggplot(data = d, mapping = aes(x = resampling_id, y = classif.auc)) + +# geom_boxplot(fill = c("lightblue2", "mistyrose2")) + +# theme_bw() + +# labs(y = "AUROC", x = "") + +# save your result +saveRDS(agg, file = "/home/jannes.muenchow/rpackages/misc/geocompr/extdata/12-sp_conv_cv_test.rds") +# read in if necessary +# agg = readRDS("extdata/12-sp_conv_cv.rds") + +#********************************************************** +# 3 spatial prediction---- +#********************************************************** + +#********************************************************** +# 3.1 make the prediction using the glm==== +#********************************************************** + +# lrn_glm$train(task) +# fit = lrn_glm$model +# # according to lrn_glm$help() the default for predictions was adjusted to FALSE, +# # since we would like to have TRUE predictions, we have to change back +# # fit$coefficients = fit$coefficients * -1 +# +# pred = terra::predict(object = ta, model = fit, fun = predict, +# type = "response") +# +# # make the prediction "manually" +# ta_2 = ta +# newdata = as.data.frame(as.matrix(ta_2)) +# colSums(is.na(newdata)) # ok, there are NAs +# ind = rowSums(is.na(newdata)) == 0 +# tmp = lrn_glm$predict_newdata(newdata = newdata[ind, ], task = task) +# newdata[ind, "pred"] = as.data.table(tmp)[["prob.TRUE"]] +# # check +# all.equal(as.numeric(values(pred)), newdata$pred) +# pred_2 = ta$slope +# values(pred_2) = newdata$pred +# all.equal(pred, pred_2) +# plot(c(pred, pred_2)) +# +# #********************************************************** +# # 3.2 plot the prediction==== +# #********************************************************** +# +# hs = terra::shade(ta$slope * pi / 180, +# terra::terrain(ta$elev, v = "aspect", unit = "radians"), +# 40, 270) +# plot(hs, col = gray(seq(0, 1, length.out = 100)), legend = FALSE) +# plot(pred, col = RColorBrewer::brewer.pal(name = "Reds", 9), add = TRUE, +# +# # or using tmap +# # white raster to only plot the axis ticks, otherwise gridlines would be visible +# study_mask = terra::vect(study_mask) +# lsl_sf = st_as_sf(lsl, coords = c("x", "y"), crs = 32717) +# rect = tmaptools::bb_poly(raster::raster(hs)) +# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.02, 1), ylim = c(-0.02, 1), +# relative = TRUE) +# tm_shape(hs, bbox = bbx) + +# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, +# labels.rot = c(0, 90)) + +# tm_raster(palette = "white", legend.show = FALSE) + +# # hillshade +# tm_shape(terra::mask(hs, study_mask), bbox = bbx) + +# tm_raster(palette = gray(0:100 / 100), n = 100, +# legend.show = FALSE) + +# # prediction raster +# tm_shape(terra::mask(pred, study_mask)) + +# tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE, +# title = "Susceptibility") + +# # rectangle and outer margins +# qtm(rect, fill = NULL) + +# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE, +# legend.position = c("left", "bottom"), +# legend.title.size = 0.9) + +# # only GLM +design_grid = benchmark_grid( + tasks = task, + learners = lrn_glm, + resamplings = list(rsmp_sp, rsmp_nsp)) +bmr = benchmark(design = design_grid, store_backends = FALSE) +# bmr_dt = as.data.table(bmr) +# saveRDS(bmr, "extdata/12-glm_sp_nsp_bmr.rds") +# agg = bmr$aggregate(measures = mlr3::msr("classif.auc")) +# saveRDS(agg, "extdata/12-glm_sp_nsp.rds") +score = bmr$score(measures = msr("classif.auc")) +saveRDS(score[, .(task_id, resampling_id, learner_id, classif.auc)], + "extdata/12-glm_score_sp_nsp.rds") diff --git a/code/12-spatial-cv-jm.R b/code/12-spatial-cv-jm.R deleted file mode 100644 index f0d2b7f9c..000000000 --- a/code/12-spatial-cv-jm.R +++ /dev/null @@ -1,234 +0,0 @@ -# Filename: spatial_cv.R (2018-02-06) -# -# TO DO: Introducing spatial cross-validation with the help of mlr -# -# Author(s): Jannes Muenchow -# -#********************************************************** -# CONTENTS------------------------------------------------- -#********************************************************** -# -# 1. ATTACH PACKAGES AND DATA -# 2. DATA PREPROCESSING -# 3. TERRAIN ATTRIBUTES -# 4. MODELING -# 5. SPATIAL PREDICTION -# -#********************************************************** -# 1 ATTACH PACKAGES AND DATA------------------------------- -#********************************************************** - -# attach packages -library(RSAGA) -library(RQGIS) -library(sf) -library(mlr) -library(raster) -library(dplyr) - -# attach data -data("landslides", package = "RSAGA") -# in sperrorest and mlr there is also landslide data available, however, this -# corresponds to the year 2000 and uses another DEM -# data("ecuador", package = "sperrorest") -# or using the mlr package -# slides = getTaskData(spatial.task) -# coords = spatial.task$coordinates - -#********************************************************** -# 2 DATA PREPROCESSING------------------------------------- -#********************************************************** - -# landslide points -non_pts = dplyr::filter(landslides, lslpts == FALSE) -# select landslide points -lsl_pts = dplyr::filter(landslides, lslpts == TRUE) -# randomly select 175 non-landslide points -set.seed(11042018) -non_pts_sub = sample_n(non_pts, size = nrow(lsl_pts)) -# create smaller landslide dataset (lsl) -lsl = bind_rows(non_pts_sub, lsl_pts) -# digital elevation model -dem = - raster(dem$data, - crs = "+proj=utm +zone=17 +south +datum=WGS84 +units=m +no_defs", - xmn = dem$header$xllcorner, - xmx = dem$header$xllcorner + dem$header$ncols * dem$header$cellsize, - ymn = dem$header$yllcorner, - ymx = dem$header$yllcorner + dem$header$nrows * dem$header$cellsize) -# plot(dem) -# plot(dplyr::filter(lsl, lslpts == TRUE), add = TRUE, pch = 16) - -# # compare with SAGA version -# dem$header -# env = rsaga.env(path = "C:/OSGeo4W64/apps/saga") -# # Write the DEM to a SAGA grid: -# write.sgrd(data = dem, file = "dem", header = dem$header, env = env) -# tmp_2 = raster("dem.sdat") -# # header is the same, but some numeric/decimal place inconsistencies -# all.equal(tmp, tmp_2) -# tmp - tmp_2 - -#********************************************************** -# 3 COMPUTING TERRAIN ATTRIBUTES--------------------------- -#********************************************************** - -# slope, aspect, curvatures -set_env(dev = FALSE) -find_algorithms("curvature") -alg = "saga:slopeaspectcurvature" -get_usage(alg) -# terrain attributes (ta) -out = run_qgis(alg, ELEVATION = dem, METHOD = 6, UNIT_SLOPE = "degree", - UNIT_ASPECT = "degree", - ASPECT = file.path(tempdir(), "aspect.tif"), - SLOPE = file.path(tempdir(), "slope.tif"), - C_PLAN = file.path(tempdir(), "cplan.tif"), - C_PROF = file.path(tempdir(), "cprof.tif"), - load_output = TRUE) -# hillshade (needs radians) -# hs = hillShade(out$SLOPE * pi / 180, out$ASPECT * pi / 180, 40, 270) -# plot(hs, col = gray(seq(0, 1, length.out = 100)), legend = FALSE) -# plot(dem, add = TRUE, alpha = 0.6) -# plot(dplyr::filter(lsl, lslpts == TRUE), add = TRUE, pch = 16) - - -# use brick because then the layers will be in memory and not on disk -ta = brick(out[names(out) != "ASPECT"]) -names(ta) = c("slope", "cplan", "cprof") -# catchment area -find_algorithms("[Cc]atchment") -alg = "saga:flowaccumulationtopdown" -get_usage(alg) -carea = run_qgis(alg, ELEVATION = dem, METHOD = 4, - FLOW = file.path(tempdir(), "carea.tif"), - load_output = TRUE) -# transform carea -log10_carea = log10(carea) -names(log10_carea) = "log10_carea" -names(dem) = "elev" -# add log_carea -ta = addLayer(x = ta, dem, log10_carea) -# extract values to points, i.e., create predictors -lsl[, names(ta)] = raster::extract(ta, lsl[, c("x", "y")]) - -# corresponding data is available in spDataLarge -# data("lsl", package = "spDataLarge") -# data("ta", package = "spDataLarge") - -#********************************************************** -# 4 MODELING----------------------------------------------- -#********************************************************** - -# attach data -# data("lsl", package = "spDataLarge") -# data("ta", package = "spDataLarge") - -coords = lsl[, c("x", "y")] -data = dplyr::select(lsl, -x, -y) - -# 3.1 create a mlr task==================================== -#********************************************************** -# id = name of the task -# target = response variable -# spatial: setting up the scene for spatial cv (hence x and y will not be used -# as predictors but as coordinates for kmeans clustering) -# all variables in data will be used in the model - -task = makeClassifTask(id = "lsl_glm_spatial", data = data, - target = "lslpts", positive = "TRUE", - coordinates = coords) - - -task_nsp = makeClassifTask(id = "lsl_glm_nonspatial", - data = data, - target = "lslpts", positive = "TRUE") - -# 3.2 Construct a learner and run the model================ -#********************************************************** -# cl: class of learner, here binomial model GLM from the stats package -# link: logit link -# predict.type: probability -# fix.factors.prediction: ? - -lrn = makeLearner(cl = "classif.binomial", - link = "logit", - predict.type = "prob", - # I guess we don't need it, right? - fix.factors.prediction = FALSE) - -# training the model (percentage of test and training datasets?) -mod_sp = train(learner = lrn, task = task) -head(predict(mod_sp, task = task)) -# exactly the same as, and needed for model interpretation -fit = glm(lslpts ~ ., data = data, family = "binomial") -head(predict(fit, type = "response")) -mod_nsp = train(learner = lrn, task = task_nsp) -# unpacking the model -m_sp = getLearnerModel(mod_sp) -summary(m_sp) -m_nsp = getLearnerModel(mod_nsp) -summary(m_nsp) - -# coefficients are just the same -coefficients(m_sp) -coefficients(m_nsp) -identical(coefficients(m_sp), coefficients(m_nsp)) - -# 3.3 Spatial cross-validation============================= -#********************************************************** - -# specify the reampling method, i.e. spatial CV with 10 repetitions and 5 folds -# -> in each repetition dataset will be splitted into five folds (kmeans) -# method: RepCV -> non-spatial partitioning -# method: SpRepCV -> spatial partioning -resampling = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 100) -resampling_nsp = makeResampleDesc(method = "RepCV", folds = 5, reps = 100) -# apply the reampling by calling the resample function needed for using the same -# (randomly) selected folds/partitioning in different models spatial vs. -# non-spatial or glm vs. GAM -set.seed(012348) -# why do we need the seed? -# the seed is needed when to make sure -# that always the same spatial partitions are used when reruning the code -sp_cv = mlr::resample(learner = lrn, task = task, - resampling = resampling, - measures = mlr::auc) -conv_cv = mlr::resample(learner = lrn, task = task_nsp, - resampling = resampling_nsp, - measures = mlr::auc) -# Visualization of overfitting in the nsp-case -boxplot(sp_cv$measures.test$auc, - conv_cv$measures.test$auc) - -# save your result -# saveRDS(sp_cv, file = "extdata/sp_cv.rds") -# saveRDS(conv_cv, file = "extdata/conv_cv.rds") - -#********************************************************** -# 4 SPATIAL PREDICTION------------------------------------- -#********************************************************** - -pred = raster::predict(object = ta, model = m_sp, fun = predict, - type = "response") -hs = hillShade(out$SLOPE * pi / 180, out$ASPECT * pi / 180, 40, 270) -plot(hs, col = gray(seq(0, 1, length.out = 100)), legend = FALSE) -# plot(pred, col = RColorBrewer::brewer.pal("YlOrRd", n = 9), add = TRUE, -# alpha = 0.6) -plot(pred, col = RColorBrewer::brewer.pal(name = "Reds", 9), add = TRUE, - alpha = 0.6) -vignette(package = "RSAGA") - -# make the prediciton "manually" -ta_2 = stack(ta) -newdata = as.data.frame(as.matrix(ta_2)) -colSums(is.na(newdata)) # ok, there are NAs -ind = rowSums(is.na(newdata)) == 0 -tmp = mlr::predictLearner(.learner = lrn, mod_sp, newdata[ind, ]) -newdata[ind, "pred"] = tmp[, 2] -# check -identical(values(pred), newdata$pred) -pred_2 = ta$slope -values(pred_2) = newdata$pred -all.equal(pred, pred_2) -plot(stack(pred, pred_2)) diff --git a/code/12-svm.R b/code/12-svm.R deleted file mode 100644 index 1e807c21d..000000000 --- a/code/12-svm.R +++ /dev/null @@ -1,132 +0,0 @@ -# Filename: 13-svm.R (2018-03-22) -# -# TO DO: Spatial cross-validation of SVM -# -# Author(s): Jannes Muenchow, Patrick Schratz -# -#********************************************************** -# CONTENTS------------------------------------------------- -#********************************************************** -# -# 1. ATTACH PACKAGES AND DATA -# 2. MLR BUILDING BLOCKS -# -#********************************************************** -# 1 ATTACH PACKAGES AND DATA------------------------------- -#********************************************************** - -# attach packages -library(raster) -library(mlr) -library(sf) -library(tmap) -library(parallelMap) - -# attach data -data("lsl", package = "spDataLarge") - -#********************************************************** -# 2 MLR BUILDING BLOCKS------------------------------------ -#********************************************************** - -# put coordinates in an additional data frame -coords = lsl[, c("x", "y")] -data = dplyr::select(lsl, -x, -y) - -# 2.1 Create MLR task====================================== -#********************************************************** -task = makeClassifTask(data = data, - target = "lslpts", - positive = "TRUE", - coordinates = coords) -# find out about possible learners for our task -listLearners(task) - -# 2.2 Specify learner====================================== -#********************************************************** -# construct SVM learner (using ksvm function from the kernlab package) -lrn_ksvm = makeLearner("classif.ksvm", - predict.type = "prob", - kernel = "rbfdot") -getLearnerPackages(lrn_ksvm) -helpLearner(lrn_ksvm) -# rbfdot Radial Basis kernel "Gaussian" -# the list of hyper-parameters (kernel parameters). This is a list which -# contains the parameters to be used with the kernel function. For valid -# parameters for existing kernels are : -# sigma inverse kernel width for the Radial Basis kernel function "rbfdot" and -# the Laplacian kernel "laplacedot". -# C cost of constraints violation (default: 1) this is the ‘C’-constant of the -# regularization term in the Lagrange formulation. - -# 2.3 Specify resampling strategy========================== -#********************************************************** -# performance level (outer resampling loop) -perf_level = makeResampleDesc("SpRepCV", folds = 5, reps = 100) - -# tuning level (inner reseampling loop) -tune_level = makeResampleDesc("SpCV", iters = 5) -# hyperparameter tuning in the inner resampling loop -# use 50 randomly selected hyperparameter combinations... -ctrl = makeTuneControlRandom(maxit = 50) -# ... and restrict them to the following tuning space -ps = makeParamSet( - makeNumericParam("C", lower = -12, upper = 15, trafo = function(x) 2^x), - makeNumericParam("sigma", lower = -15, upper = 6, trafo = function(x) 2^x) - ) - -wrapper_ksvm = makeTuneWrapper(learner = lrn_ksvm, - resampling = tune_level, - par.set = ps, - control = ctrl, - show.info = FALSE, - measures = mlr::auc) - -# 2.4 Run the resampling (spatial cv)====================== -#********************************************************** -# before starting the parallelization, make sure that failed models will be -# accessible after the processing (and does not lead to the interruption of the -# cv) -configureMlr(on.learner.error = "warn", on.error.dump = TRUE) -# initialize the parallelization -if (Sys.info()["sysname"] %in% c("Linux, Darwin")) { - parallelStart(mode = "multicore", - # parallelize the hyperparameter tuning level - level = "mlr.tuneParams", - # just use half of the available cores - cpus = round(parallel::detectCores() / 2), - mc.set.seed = TRUE) -} - -if (Sys.info()["sysname"] == "Windows") { - parallelStartSocket(level = "mlr.tuneParams", - cpus = round(parallel::detectCores() / 2)) -} - -# run the resampling -set.seed(12345) -resa_svm_spatial = mlr::resample(learner = wrapper_ksvm, - task = task, - resampling = perf_level, - extract = getTuneResult, - show.info = TRUE, - measures = mlr::auc) - -# Aggregated Result: auc.test.mean=0.7583375 -parallelStop() -# save your result -saveRDS(resa_svm_spatial, "extdata/spatial_cv_result.rds") - -# Exploring the results -# run time in minutes -resa_svm_spatial$runtime / 60 -# final aggregated AUROC -resa_svm_spatial$aggr -# same as -mean(resa_svm_spatial$measures.test$auc) -# used hyperparameters for the outer fold, i.e. the best combination out of 50 * -# 5 models -resa_svm_spatial$extract[[1]] -# and here one can observe that the AUC of the tuning data is usually higher -# than for the model on the outer fold -resa_svm_spatial$measures.test[1, ] diff --git a/code/15-eco.R b/code/15-eco.R deleted file mode 100644 index 6f38f1e1a..000000000 --- a/code/15-eco.R +++ /dev/null @@ -1,129 +0,0 @@ -# Filename: casma.R (2017-09-26) -# -# TO DO: Model ordination scores of Mt. Mongón -# -# Author(s): Jannes Muenchow -# -#********************************************************** -# CONTENTS------------------------------------------------- -#********************************************************** -# -# 1. ATTACH PACKAGES AND DATA -# 2. ORDINATIONS -# 3. MODELING -# 4. SPATIAL PREDICTION -# -#********************************************************** -# 1 ATTACH PACKAGES AND DATA------------------------------- -#********************************************************** - -# attach packages -library(vegan) -library(dplyr) -library(RQGIS) -library(raster) -library(mgcv) -library(sf) - -# attach data -data("random_points", "study_area", "comm", "dem", "ndvi") -# add altitude to random_points -random_points$dem = raster::extract(dem, random_points) - -#********************************************************** -# 2 ORDINATIONS-------------------------------------------- -#********************************************************** - -# presence absence matrix -pa = decostand(comm, "pa") - -# DCA -dca = decorana(pa) -# proportion of variance -dca$evals / sum(dca$evals) -# cumulative proportion -cumsum(dca$evals / sum(dca$evals)) # 44% and 72% - -# NMDS -# nmds = metaMDS(comm, k = 3, try = 500) -nmds = metaMDS(pa, k = 3, try = 500) -# best approach: -# pa, k = 4 -# stress: 8.8 -# explained variance first axis: 61%, first 2 axes: 0.77 -# and plot(scores_1 ~ elev) looks reasonable - -nmds -# Standard deviations of the axes -apply(scores(nmds), 2, sd) - -# goodness of fit expressed as explained variance -cor(vegdist(pa), dist(scores(nmds)[, 1:2]))^2 # 0.83 first two axes -cor(vegdist(pa), dist(scores(nmds)[, 1]))^2 # 0.62 only the first axis - -elev = dplyr::filter(random_points, id %in% rownames(pa)) %>% - dplyr::pull(dem) -plot(scores(nmds), type = "n", xlim = c(-1, 1)) -# text(scores(nmds), labels = rownames(pa)) -text(scores(nmds), labels = elev) - -# Rotates NMDS result so that the first dimension is parallel to an -# environmental variable (works quite good) - -rotnmds = MDSrotate(nmds, elev) # proxy for elevation -cor(vegdist(pa), dist(scores(rotnmds)[, 1]))^2 # 0.59 only the first axis -cor(vegdist(pa), dist(scores(rotnmds)[, 1:2]), method = "spearman") -plot(rotnmds, type = "n", xlim = c(-1, 1)) -text(rotnmds, labels = elev, cex = 0.8) - -# plot(x = elev, y = scores(rotnmds)[, 1]) -# plot(elev, y = scores(rotnmds)[, 2]) -plot(x = elev, y = scores(nmds)[, 1]) -# plot(elev, scores(nmds)[, 2]) -resp = scores(nmds)[, 1] -resp = scores(rotnmds)[, 1] -fit = gam(resp ~ s(elev)) -summary(fit) # deviance explained unrotated: 80%; rotated: 77.9% -plot(resp ~ elev) -# avoid spaghetti plot -I1 = order(elev) -lines(x = elev[I1], y = predict(fit)[I1], col = "red") -# text(x = elev, y = resp, labels = rownames(d)) - -# ISOMAP -source(file.path("D:/uni/science/projects/ecology/asia/Mongolia/", - "TINN-R/functions/bestisomap.r")) -# Isomap -pa = decostand(d, "pa") -bestiso = bestisomap(vegdist(d, "altGower")) -# geo distances -geod = isomapdist(vegdist(d, "bray"), k = 69, ndim = 3) -# Secondly, subject the geodesic distances to a multidimensional scaling, i.e. perform an isomap ordination -iso = cmdscale(geod, k = 3, eig = TRUE) -# cmdscale does not automatically provide species scores, a function of the BiodiversityR package does the trick -library(BiodiversityR) -# species scores -isospec = - as.data.frame(add.spec.scores(iso, d, - method = "pcoa.scores", multi = 0.15, - Rscale = TRUE)$cproj) - - - -# percentage data: first axis 70%, first and second axis 83% -# my bad; this was the result when the first col was the id... -plot(elev, bestiso$Scores[, 1]) -# pa: explained variance first two axes = 70.6 % (k = 32) -resp = bestiso$Scores[, 3] -fit = gam(resp ~ s(elev)) -summary(fit) # deviance explained 91% -plot(resp ~ elev) -# avoid spaghetti plot -I1 = order(elev) -lines(x = elev[I1], y = predict(fit)[I1], col = "red") -gam.check(fit) -# model validation only necessary if we want to do statistical inference -resids = residuals(fit, type = "pearson") -plot(resids ~ elev) # maybe slight indication of heterogeneity, but ok -hist(resids) # ok -plot(resids ~ fitted(fit)) # ok diff --git a/code/15-rf.R b/code/15-rf.R deleted file mode 100644 index 1b01d9174..000000000 --- a/code/15-rf.R +++ /dev/null @@ -1,90 +0,0 @@ -library(sf) -library(raster) -library(mlr) -library(tidyverse) -library(parallelMap) -library(ranger) # needed for random forest regression - -# construct response-prediction matrix -d = data.frame(id = as.numeric(rownames(scores(rotnmds))), - sc = scores(rotnmds)[, 1]) -d = inner_join(random_points, d, by = "id") - -coords = sf::st_coordinates(d) %>% - as.data.frame %>% - rename(x = X, y = Y) -d = dplyr::select(d, sc, dem) %>% - st_drop_geometry() -# create task -task = makeRegrTask(data = d, target = "sc", coordinates = coords) - -# create learner -lrns = listLearners(task, warn.missing.packages = FALSE) %>% - dplyr::select(class, name, short.name, package) - -lrn = makeLearner(cl = "regr.ranger", - predict.type = "response", - fix.factors.prediction = TRUE) -# performance level (outer loop) -perf_level = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 100) - -# tuning level (inner loop) -# five spatially disjoint partitions -tune_level = makeResampleDesc("SpCV", iters = 5) -# use 50 randomly selected hyperparameters -ctrl = makeTuneControlRandom(maxit = 50) -# define the outer limits of the randomly selected hyperparameters -ps = makeParamSet( - # as long as we are only using one predictor, we can only use mtry = 1 - makeIntegerParam("mtry", lower = 1, upper = 1), - makeIntegerParam("num.trees", lower = 10, upper = 10000)) - -wrapped_lrn_rf = makeTuneWrapper(learner = lrn, - resampling = tune_level, - par.set = ps, - control = ctrl, - show.info = TRUE, - measures = mlr::rmse) -configureMlr(on.learner.error = "warn", on.error.dump = TRUE) - -library(parallelMap) -if (Sys.info()["sysname"] %in% c("Linux, Darwin")) { - parallelStart(mode = "multicore", - # parallelize the hyperparameter tuning level - level = "mlr.tuneParams", - # just use half of the available cores - cpus = round(parallel::detectCores() / 2), - mc.set.seed = TRUE) -} - -if (Sys.info()["sysname"] == "Windows") { - parallelStartSocket(level = "mlr.tuneParams", - cpus = round(parallel::detectCores() / 2)) -} -result = mlr::resample(learner = wrapped_lrn_rf, - task = task, - resampling = perf_level, - extract = getTuneResult, - measures = mlr::rmse) - -#********************************************************** -# 2 SPATIAL PREDICTION------------------------------------- -#********************************************************** - -lrn_rf = makeLearner(cl = "regr.ranger", - predict.type = "response", - mtry = 1, num.trees = 3500) -# train model -model_rf = train(lrn_rf, task) -# prediction -new_d = data.frame(dem = values(dem)) -pred_rf <- predict(model_rf, newdata = new_d) -pred_r = dem -pred_r[] = pred_rf$data$response -plot(pred_r) - -# just for comparison -gam_1 = gam(sc ~ s(dem), data = d) -names(dem) = "dem" -pred = raster::predict(object = dem, model = gam_1) -plot(pred) diff --git a/code/15-rf_mlr3.R b/code/15-rf_mlr3.R new file mode 100644 index 000000000..ff931a2ee --- /dev/null +++ b/code/15-rf_mlr3.R @@ -0,0 +1,134 @@ +# Filename: 15-rf_mlr3.R (2022-04-14) + +# TO DO: use spatially cross-validated tuned hyperparameters to make a spatial prediction of the floristic composition of the Mount Mongón + +# Author(s): jannes.muenchow + +#********************************************************** +# CONTENTS---- +#********************************************************** + +# 1 attach packages and data +# 2 preprocessing +# 3 modeling + +#********************************************************** +# 1 attach packages and data---- +#********************************************************** + +# attach packages +library(dplyr) +library(mlr3) +library(mlr3extralearners) +library(mlr3learners) +library(mlr3tuning) +library(mlr3spatiotempcv) +library(qgisprocess) +library(raster) +library(terra) +library(sf) +library(vegan) + +# attach data +data("study_area", "random_points", "comm", package = "spDataLarge") +dem = terra::rast(system.file("raster/dem.tif", package = "spDataLarge")) +ndvi = terra::rast(system.file("raster/ndvi.tif", package = "spDataLarge")) + +#********************************************************** +# 2 preprocessing---- +#********************************************************** + +alg = "saga:sagawetnessindex" +args = qgis_arguments(alg) +qgis_show_help(alg) + +ep = qgis_run_algorithm(alg = "saga:sagawetnessindex", + DEM = dem, + SLOPE_TYPE = 1, + SLOPE = tempfile(fileext = ".sdat"), + AREA = tempfile(fileext = ".sdat"), + .quiet = TRUE) +# read in catchment area and catchment slope +ep = ep[c("AREA", "SLOPE")] |> + unlist() |> + terra::rast() +names(ep) = c("carea", "cslope") +# make sure all rasters share the same origin +origin(ep) = origin(dem) +ep = c(dem, ndvi, ep) +ep$carea = log10(ep$carea) +random_points[, names(ep)] = + # first column is an ID column we don't need + terra::extract(ep, terra::vect(random_points))[, -1] + +# presence-absence matrix +pa = decostand(comm, "pa") # 100 rows (sites), 69 columns (species) +# keep only sites in which at least one species was found +pa = pa[rowSums(pa) != 0, ] # 84 rows, 69 columns +nmds = readRDS("extdata/15-nmds.rds") + +elev = dplyr::filter(random_points, id %in% rownames(pa)) %>% + dplyr::pull(dem) +# rotating NMDS in accordance with altitude (proxy for humidity) +rotnmds = MDSrotate(nmds, elev) +# extracting the first axes +sc = scores(rotnmds, choices = 1:2) +# construct response-predictor matrix +# id- and response variable +rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1]) +# join the predictors (dem, ndvi and terrain attributes) +rp = inner_join(random_points, rp, by = "id") + +#********************************************************** +# 3 modeling---- +#********************************************************** + +# create task +task = TaskRegrST$new(id = "mongon", backend = dplyr::select(rp, -id, -spri), + target = "sc") +rp = dplyr::select(rp, -id, -spri) +rp[, c("x", "y")] = st_coordinates(rp) +rp = st_drop_geometry(rp) +task = TaskRegrST$new(id = "mongon", backend = rp, target = "sc", + extra_args = list(coordinate_names = c("x", "y"))) +lrn_rf = lrn("regr.ranger", predict_type = "response") + +search_space = ps( + mtry = p_int(lower = 1, upper = ncol(task$data()) - 1), + sample.fraction = p_dbl(lower = 0.2, upper = 0.9), + min.node.size = p_int(lower = 1, upper = 10) +) +at = AutoTuner$new( + learner = lrn_rf, + # spatial partitioning + resampling = rsmp("spcv_coords", folds = 5), + # performance measure + measure = msr("regr.rmse"), + search_space = search_space, + # random search with 50 iterations + terminator = trm("evals", n_evals = 50), + tuner = tnr("random_search") +) +at$train(task) +at$tuning_result +at$predict(task) + +# predict to new data +pred_terra = terra::predict(ep, model = at) +names(pred_terra) = "pred" + +# doing it "manually" +newdata = as.data.frame(as.matrix(ep)) +colSums(is.na(newdata)) # 0 NAs +# but assuming there were results in a more generic approach +ind = rowSums(is.na(newdata)) == 0 +tmp = at$predict_newdata(newdata = newdata[ind, ], task = task) +newdata[ind, "pred"] = as.data.table(tmp)[["response"]] +pred_2 = pred_terra +pred_2[] = newdata$pred +# same as +# values(pred_2) = newdata$pred +# all.equal(pred_terra, pred_2) # does not work, don't know why +identical(values(pred_terra), values(pred_2)) # TRUE +plot(pred_terra - pred_2) # just 0s, perfect +plot(c(pred, pred_2)) \ No newline at end of file diff --git a/extdata/.gitignore b/extdata/.gitignore index 08692eecf..3172644ce 100644 --- a/extdata/.gitignore +++ b/extdata/.gitignore @@ -1 +1,2 @@ l_all.rds +12-sp_conv_cv_test.rds diff --git a/extdata/12-sp_conv_cv.rds b/extdata/12-sp_conv_cv.rds new file mode 100644 index 000000000..be6e1234d Binary files /dev/null and b/extdata/12-sp_conv_cv.rds differ diff --git a/extdata/spatial_cv_result.rds b/extdata/12-spatial_cv_result.rds similarity index 100% rename from extdata/spatial_cv_result.rds rename to extdata/12-spatial_cv_result.rds diff --git a/extdata/14-tune.rds b/extdata/14-tune.rds deleted file mode 100644 index 263874054..000000000 Binary files a/extdata/14-tune.rds and /dev/null differ diff --git a/extdata/14-nmds.rds b/extdata/15-nmds.rds similarity index 100% rename from extdata/14-nmds.rds rename to extdata/15-nmds.rds diff --git a/extdata/15-tune.rds b/extdata/15-tune.rds new file mode 100644 index 000000000..6414eac0c Binary files /dev/null and b/extdata/15-tune.rds differ diff --git a/extdata/conv_cv.rds b/extdata/conv_cv.rds deleted file mode 100644 index 74dbf6d8d..000000000 Binary files a/extdata/conv_cv.rds and /dev/null differ diff --git a/extdata/sp_cv.rds b/extdata/sp_cv.rds deleted file mode 100644 index 7bfabffce..000000000 Binary files a/extdata/sp_cv.rds and /dev/null differ diff --git a/figures/15_rf_pred.png b/figures/15_rf_pred.png new file mode 100644 index 000000000..78de8c445 Binary files /dev/null and b/figures/15_rf_pred.png differ diff --git a/figures/sa-mongon-1.png b/figures/15_sa_mongon_sampling.png similarity index 100% rename from figures/sa-mongon-1.png rename to figures/15_sa_mongon_sampling.png diff --git a/figures/study-area-mongon.png b/figures/15_study_area_mongon.png similarity index 100% rename from figures/study-area-mongon.png rename to figures/15_study_area_mongon.png diff --git a/figures/xy-nmds-1.png b/figures/15_xy_nmds.png similarity index 100% rename from figures/xy-nmds-1.png rename to figures/15_xy_nmds.png diff --git a/figures/lsl-map-1.png b/figures/lsl-map-1.png index 2eb3c18c4..855a4629a 100644 Binary files a/figures/lsl-map-1.png and b/figures/lsl-map-1.png differ diff --git a/figures/lsl-susc-1.png b/figures/lsl-susc-1.png index 580c1efce..e92b1c07c 100644 Binary files a/figures/lsl-susc-1.png and b/figures/lsl-susc-1.png differ diff --git a/figures/rf-pred-1.png b/figures/rf-pred-1.png deleted file mode 100644 index fecd2906e..000000000 Binary files a/figures/rf-pred-1.png and /dev/null differ diff --git a/geocompr.bib b/geocompr.bib index 4fde4f472..2059b75bf 100644 --- a/geocompr.bib +++ b/geocompr.bib @@ -2155,4 +2155,28 @@ @book{zuur_mixed_2009 keywords = {\#nosource} } +@article{schratz_hyperparameter_2019, + title = {Hyperparameter Tuning and Performance Assessment of Statistical and Machine-Learning Algorithms Using Spatial Data}, + author = {Schratz, Patrick and Muenchow, Jannes and Iturritxa, Eugenia and Richter, Jakob and Brenning, Alexander}, + year = {2019}, + month = aug, + volume = {406}, + pages = {109--120}, + issn = {0304-3800}, + doi = {10.1016/j.ecolmodel.2019.06.002}, + journal = {Ecological Modelling}, + keywords = {Hyperparameter tuning,Machine-learning,Spatial autocorrelation,Spatial cross-validation,Spatial modeling}, + language = {en} +} + +@article{schratz_mlr3spatiotempcv_2021, + title = {Mlr3spatiotempcv: {{Spatiotemporal}} Resampling Methods for Machine Learning in {{R}}}, + shorttitle = {Mlr3spatiotempcv}, + author = {Schratz, Patrick and Becker, Marc and Lang, Michel and Brenning, Alexander}, + year = {2021}, + journal = {arXiv preprint arXiv:2110.12674} +} + + +