diff --git a/.Rbuildignore b/.Rbuildignore index 86338b3f..251d0f33 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -31,3 +31,4 @@ check_package.R ^docs$ ^pkgdown$ ^README\.Rmd$ +lib/ diff --git a/DESCRIPTION b/DESCRIPTION index f435d273..2f4476c8 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -2,12 +2,12 @@ Package: caretEnsemble Type: Package Title: Ensembles of Caret Models Version: 4.0.0 -Date: 2024-08-12 +Date: 2024-08-13 Authors@R: c(person(c("Zachary", "A."), "Deane-Mayer", role = c("aut", "cre", "cph"), email = "zach.mayer@gmail.com"), person(c("Jared", "E.", "Knowles"), role="ctb", email="jknowles@gmail.com"), - person("Antón", "López", role="ctb", email="zach.mayer+get-correct-email@gmail.com") + person("Antón", "López", role="ctb", email="anton.gomez.lopez@rai.usc.es") ) -URL: https://github.com/zachmayer/caretEnsemble +URL: http://zachmayer.github.io/caretEnsemble/, https://github.com/zachmayer/caretEnsemble BugReports: https://github.com/zachmayer/caretEnsemble/issues Description: Functions for creating ensembles of caret models: caretList() and caretStack(). caretList() is a convenience function for fitting multiple diff --git a/Makefile b/Makefile index 331c60d1..f793a882 100644 --- a/Makefile +++ b/Makefile @@ -99,6 +99,7 @@ check: .PHONY: check-win check-win: + rm -rf lib/ Rscript -e "devtools:::check_win()" .PHONY: fix-style @@ -150,7 +151,7 @@ dev-guide: clean: rm -rf *.Rcheck rm -f *.tar.gz - rm -f man/*.Rd + rm -rf man/ rm -f README.md rm -f coverage.rds rm -f cobertura.xml diff --git a/README.md b/README.md index 41b25b15..aa84322f 100644 --- a/README.md +++ b/README.md @@ -46,10 +46,10 @@ print(summary(models)) #> The following models were ensembled: rf, glmnet #> #> Model accuracy: -#> model_name metric value sd -#> -#> 1: rf RMSE 1127.974 83.50596 -#> 2: glmnet RMSE 1138.137 202.80472 +#> model_name metric value sd +#> +#> 1: rf RMSE 1168.108 422.9558 +#> 2: glmnet RMSE 1152.298 138.8310 ``` Then, use caretEnsemble to make a greedy ensemble of these models @@ -67,23 +67,20 @@ print(greedy_stack) #> Summary of sample sizes: 400, 400, 400, 400, 400 #> Resampling results: #> -#> RMSE Rsquared MAE -#> 1003.391 0.9357974 577.2895 +#> RMSE Rsquared MAE +#> 1096.53 0.933482 631.215 #> #> Tuning parameter 'max_iter' was held constant at a value of 100 #> #> Final model: #> Greedy MSE -#> RMSE: 1010.776 +#> RMSE: 1067.635 #> Weights: #> [,1] -#> rf 0.52 -#> glmnet 0.48 -ggplot2::autoplot(greedy_stack, training_data = dat, xvars = c("carat", "table")) +#> rf 0.43 +#> glmnet 0.57 ``` - - You can also use caretStack to make a non-linear ensemble ``` r @@ -101,7 +98,7 @@ print(rf_stack) #> Resampling results: #> #> RMSE Rsquared MAE -#> 894.1208 0.9518165 455.1419 +#> 1005.629 0.9423501 490.6683 #> #> Tuning parameter 'mtry' was held constant at a value of 2 #> @@ -113,12 +110,23 @@ print(rf_stack) #> Number of trees: 500 #> No. of variables tried at each split: 2 #> -#> Mean of squared residuals: 800137.3 -#> % Var explained: 94.96 +#> Mean of squared residuals: 1007817 +#> % Var explained: 94.29 +``` + +Use autoplot from ggplot2 to plot ensemble diagnostics: + +``` r +ggplot2::autoplot(greedy_stack, training_data = dat, xvars = c("carat", "table")) +``` + +6 panel plot of an ensemble of models fit to the diamonds dataset. The RF model is the best and has the highest weight. The residual plots look good. RMSE is about `r round(min(greedy_stack$ens_model$results$RMSE))`. + +``` r ggplot2::autoplot(rf_stack, training_data = dat, xvars = c("carat", "table")) ``` - +6 panel plot of an ensemble of models fit to the diamonds dataset. The RF model is the best and has the highest weight. The residual plots look good. RMSE is about `r round(min(rf_stack$ens_model$results$RMSE))`. # Installation diff --git a/README.rmd b/README.rmd index e370e1ea..da08947f 100644 --- a/README.rmd +++ b/README.rmd @@ -54,13 +54,20 @@ Then, use caretEnsemble to make a greedy ensemble of these models ```{r} greedy_stack <- caretEnsemble::caretEnsemble(models) print(greedy_stack) -ggplot2::autoplot(greedy_stack, training_data = dat, xvars = c("carat", "table")) ``` You can also use caretStack to make a non-linear ensemble ```{r} rf_stack <- caretEnsemble::caretStack(models, method = "rf") print(rf_stack) +``` + +Use autoplot from ggplot2 to plot ensemble diagnostics: +```{r greedy-stack-6-plot, fig.alt="6 panel plot of an ensemble of models fit to the diamonds dataset. The RF model is the best and has the highest weight. The residual plots look good. RMSE is about `r round(min(greedy_stack$ens_model$results$RMSE))`."} +ggplot2::autoplot(greedy_stack, training_data = dat, xvars = c("carat", "table")) +``` + +```{r, fig.alt="6 panel plot of an ensemble of models fit to the diamonds dataset. The RF model is the best and has the highest weight. The residual plots look good. RMSE is about `r round(min(rf_stack$ens_model$results$RMSE))`."} ggplot2::autoplot(rf_stack, training_data = dat, xvars = c("carat", "table")) ``` diff --git a/_pkgdown.yml b/_pkgdown.yml index d8775525..2d549257 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -1,4 +1,14 @@ url: http://zachmayer.github.io/caretEnsemble/ template: bootstrap: 5 + light-switch: true + bslib: + primary: "#0054AD" + border-radius: 0.5rem + btn-border-radius: 0.25rem + danger: "#A6081A" + opengraph: + image: + src: man/figures/README-greedy-stack-6-plot.png + alt: "6 panel plot of an ensemble of models fit to the diamonds dataset. The RF model is the best and has the highest weight. The residual plots look good." diff --git a/inst/WORDLIST b/inst/WORDLIST index bc3f6f44..e64f8da2 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -10,6 +10,7 @@ CMD CodeFactor codebase coercible +dat Deane defaultControl dev @@ -51,12 +52,14 @@ prob probs readme resid +rf roxygen rpart 's savePredictions scalability scikit +setosa SDs trainControl travis diff --git a/man/caretEnsemble.Rd b/man/caretEnsemble.Rd index 1ad62756..f497b724 100644 --- a/man/caretEnsemble.Rd +++ b/man/caretEnsemble.Rd @@ -56,6 +56,7 @@ summary(ens) \seealso{ Useful links: \itemize{ + \item \url{http://zachmayer.github.io/caretEnsemble/} \item \url{https://github.com/zachmayer/caretEnsemble} \item Report bugs at \url{https://github.com/zachmayer/caretEnsemble/issues} } @@ -67,7 +68,7 @@ Useful links: Other contributors: \itemize{ \item Jared E. Knowles \email{jknowles@gmail.com} [contributor] - \item Antón López \email{zach.mayer+get-correct-email@gmail.com} [contributor] + \item Antón López \email{anton.gomez.lopez@rai.usc.es} [contributor] } } diff --git a/man/figures/README-greedy-stack-6-plot-1.png b/man/figures/README-greedy-stack-6-plot-1.png new file mode 100644 index 00000000..60c024eb Binary files /dev/null and b/man/figures/README-greedy-stack-6-plot-1.png differ diff --git a/man/figures/README-unnamed-chunk-2-1.png b/man/figures/README-unnamed-chunk-2-1.png deleted file mode 100644 index f1e082db..00000000 Binary files a/man/figures/README-unnamed-chunk-2-1.png and /dev/null differ diff --git a/man/figures/README-unnamed-chunk-3-1.png b/man/figures/README-unnamed-chunk-3-1.png deleted file mode 100644 index c5d7f079..00000000 Binary files a/man/figures/README-unnamed-chunk-3-1.png and /dev/null differ diff --git a/man/figures/README-unnamed-chunk-4-1.png b/man/figures/README-unnamed-chunk-4-1.png deleted file mode 100644 index a18a2f8e..00000000 Binary files a/man/figures/README-unnamed-chunk-4-1.png and /dev/null differ diff --git a/man/figures/README-unnamed-chunk-5-1.png b/man/figures/README-unnamed-chunk-5-1.png new file mode 100644 index 00000000..a174daed Binary files /dev/null and b/man/figures/README-unnamed-chunk-5-1.png differ diff --git a/vignettes/Version-4.0-New-Features.Rmd b/vignettes/Version-4.0-New-Features.Rmd index 8a9c6f51..2c86f365 100644 --- a/vignettes/Version-4.0-New-Features.Rmd +++ b/vignettes/Version-4.0-New-Features.Rmd @@ -1,5 +1,5 @@ --- -title: "Version-4.0-New-Features" +title: "Version 4.0 New Features" author: "Zach Deane-Mayer" date: "`r Sys.Date()`" output: rmarkdown::html_vignette @@ -45,7 +45,13 @@ caretStack (and by extension, caretEnsemble) now supports various S3 methods: ```{r} print(ens) print(summary(ens)) +``` + +```{r, fig.alt="A dot and whisker plot of ROC for glmnet, rpart, and an ensemble. The ensemble has the highest ROC and is slighly better than the glmnet. The rpart model is bad."} plot(ens) +``` + +```{r, fig.alt="A 4-panel plot for glmnet, rpart, and an ensemble. The ensemble has the highest ROC and is slighly better than the glmnet. The rpart model is bad. The glmnet has the highest weight, and the residuals look biased."} ggplot2::autoplot(ens) ``` @@ -127,14 +133,16 @@ print(transfer_ens) We can also predict on new data: ```{r} preds <- predict(transfer_ens, newdata = head(new_data)) -print(preds) +knitr::kable(preds, format = "markdown") ``` # Permutation Importance Permutation importance is now the default method for variable importance in caretLists and caretStacks: ```{r} importance <- caret::varImp(transfer_ens) -print(importance) +print(round(importance, 2L)) ``` +Note that the ensemble uses rpart to classify the easy class (setosa) and then uses the rf to distinguish between the 2 more difficult classes. + This completes our demonstration of the key new features in caretEnsemble 4.0. These enhancements provide greater flexibility, improved performance, and easier usage for ensemble modeling in R. diff --git a/vignettes/caretEnsemble-intro.Rmd b/vignettes/caretEnsemble-intro.Rmd index 206ec8bc..86f1fd7b 100644 --- a/vignettes/caretEnsemble-intro.Rmd +++ b/vignettes/caretEnsemble-intro.Rmd @@ -49,7 +49,7 @@ print(summary(model_list)) We can use the `predict` function to extract predictions from this object for new data: ```{r} p <- predict(model_list, newdata = head(testing)) -print(p) +knitr::kable(p, format = "markdown") ``` If you desire more control over the model fit, use the `caretModelSpec` to construct a list of model specifications for the `tuneList` argument. This argument can be used to fit several different variants of the same model, and can also be used to pass arguments through `train` down to the component functions (e.g. `trace=FALSE` for `nnet`): @@ -71,7 +71,7 @@ Finally, you should note that `caretList` does not support custom caret models. ## caretEnsemble `caretList` is the preferred way to construct list of caret models in this package, as it will ensure the resampling indexes are identical across all models. Lets take a closer look at our list of models: -```{r} +```{r, fig.alt="X/Y scatter plot of rpart vs glmnet AUCs on the Sonar dataset. The glmnet model is better for 4 out of 5 resamples."} lattice::xyplot(caret::resamples(model_list)) ``` @@ -100,7 +100,8 @@ The ensemble has an AUC on the training set resamples of `r round(auc[1, 'ensemb Note that the levels for the Sonar Data are "M" and "R", where M is level 1 and R is level 2. "M" stands for "metal cylinder" and "R" stands for rock. M is the positive class, so we exclude class 2L from our predictions. You can set excluded_class_id = 0L ```{r} -predict(greedy_ensemble, newdata = head(testing), excluded_class_id = 0L) +p <- predict(greedy_ensemble, newdata = head(testing), excluded_class_id = 0L) +knitr::kable(p, format = "markdown") ``` We can also use varImp to extract the variable importances from each member of the ensemble, as well as the final ensemble model: