-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aorsf; accelerated oblique random survival forests #532
Comments
Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type |
🚀 The following problem was found in your submission template:
👋 |
Just updated to fix this. I'm not sure if reviewers will think that 'gold' is the right statsgrade, but might as well aim high. |
Checks for aorsf (v0.0.0.9000)git hash: c73bb98c
Package License: MIT + file LICENSE 1. rOpenSci Statistical Standards (
|
type | package | ncalls |
---|---|---|
internal | base | 341 |
internal | aorsf | 152 |
internal | utils | 13 |
internal | stats | 11 |
internal | methods | 1 |
imports | table.glue | 3 |
imports | Rcpp | NA |
imports | data.table | NA |
suggests | glmnet | 1 |
suggests | survival | NA |
suggests | survivalROC | NA |
suggests | ggplot2 | NA |
suggests | testthat | NA |
suggests | knitr | NA |
suggests | rmarkdown | NA |
suggests | covr | NA |
suggests | units | NA |
linking_to | Rcpp | NA |
linking_to | RcppArmadillo | NA |
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.
base
attr (40), names (35), for (23), length (21), c (20), paste (16), list (14), paste0 (12), mode (10), seq_along (10), vector (10), which (9), as.matrix (8), rep (7), as.integer (6), drop (6), seq (6), switch (6), order (5), match (4), min (4), setdiff (4), colnames (3), inherits (3), ncol (3), Sys.time (3), all (2), any (2), as.factor (2), cbind (2), data.frame (2), grepl (2), lapply (2), levels (2), matrix (2), nchar (2), nrow (2), rev (2), rle (2), row.names (2), sum (2), suppressWarnings (2), try (2), all.vars (1), as.data.frame (1), class (1), deparse (1), formals (1), grep (1), if (1), ifelse (1), intersect (1), is.na (1), max (1), mean (1), print (1), return (1), rownames (1), sapply (1), t (1), typeof (1), unique (1)
aorsf
paste_collapse (8), fctr_info (5), get_fctr_info (5), get_names_x (5), f_oobag_eval (3), get_n_obs (3), get_n_tree (3), get_names_y (3), get_numeric_bounds (3), last_value (3), orsf_fit (3), ref_code (3), unit_info (3), check_var_types (2), get_f_oobag_eval (2), get_importance (2), get_leaf_min_events (2), get_leaf_min_obs (2), get_max_time (2), get_mtry (2), get_n_events (2), get_n_leaves_mean (2), get_oobag_eval_every (2), get_type_oobag_eval (2), get_unit_info (2), is_empty (2), list_init (2), orsf_control_net (2), orsf_pd_summary (2), orsf_train_ (2), select_cols (2), check_arg_bound (1), check_arg_gt (1), check_arg_gteq (1), check_arg_is (1), check_arg_is_integer (1), check_arg_is_valid (1), check_arg_length (1), check_arg_lt (1), check_arg_lteq (1), check_arg_type (1), check_arg_uni (1), check_control_cph (1), check_control_net (1), check_new_data_fctrs (1), check_new_data_names (1), check_new_data_types (1), check_oobag_fun (1), check_orsf_inputs (1), check_pd_inputs (1), check_predict (1), check_units (1), contains_oobag (1), contains_vi (1), f_beta (1), fctr_check (1), fctr_check_levels (1), fctr_id_check (1), get_cph_do_scale (1), get_cph_eps (1), get_cph_iter_max (1), get_cph_method (1), get_cph_pval_max (1), get_f_beta (1), get_n_retry (1), get_n_split (1), get_net_alpha (1), get_net_df_target (1), get_oobag_pred (1), get_oobag_time (1), get_orsf_type (1), get_split_min_events (1), get_split_min_obs (1), get_tree_seeds (1), get_types_x (1), insert_vals (1), is_aorsf (1), is_error (1), is_trained (1), leaf_kaplan_testthat (1), lrt_multi_testthat (1), newtraph_cph_testthat (1), oobag_c_harrell_testthat (1), orsf (1), orsf_control_cph (1), orsf_oob_vi (1), orsf_pd_ (1), orsf_pd_ice (1), orsf_pred_multi (1), orsf_pred_uni (1), orsf_scale_cph (1), orsf_summarize_uni (1), orsf_time_to_train (1), orsf_train (1), orsf_unscale_cph (1), orsf_vi_ (1), x_node_scale_exported (1)
utils
data (13)
stats
formula (4), dt (2), terms (2), family (1), time (1), weights (1)
table.glue
round_spec (1), round_using_magnitude (1), table_value (1)
glmnet
glmnet (1)
methods
new (1)
3. Statistical Properties
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
Details of statistical properties (click to open)
The package has:
- code in C++ (48% in 2 files) and R (52% in 21 files)
- 1 authors
- 3 vignettes
- 1 internal data file
- 3 imported packages
- 15 exported functions (median 15 lines of code)
- 216 non-exported functions in R (median 3 lines of code)
- 48 R functions (median 38 lines of code)
Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:
loc
= "Lines of Code"fn
= "function"exp
/not_exp
= exported / not exported
All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown()
function
The final measure (fn_call_network_size
) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.
measure | value | percentile | noteworthy |
---|---|---|---|
files_R | 21 | 82.3 | |
files_src | 2 | 79.1 | |
files_vignettes | 3 | 92.4 | |
files_tests | 18 | 95.7 | |
loc_R | 2139 | 85.4 | |
loc_src | 1982 | 75.6 | |
loc_vignettes | 342 | 67.9 | |
loc_tests | 1532 | 91.7 | |
num_vignettes | 3 | 94.2 | |
data_size_total | 9034 | 70.4 | |
data_size_median | 9034 | 78.5 | |
n_fns_r | 231 | 91.6 | |
n_fns_r_exported | 15 | 58.5 | |
n_fns_r_not_exported | 216 | 94.0 | |
n_fns_src | 48 | 66.0 | |
n_fns_per_file_r | 7 | 77.6 | |
n_fns_per_file_src | 24 | 94.8 | |
num_params_per_fn | 3 | 33.6 | |
loc_per_fn_r | 3 | 1.1 | TRUE |
loc_per_fn_r_exp | 15 | 35.6 | |
loc_per_fn_r_not_exp | 3 | 1.5 | TRUE |
loc_per_fn_src | 38 | 88.9 | |
rel_whitespace_R | 50 | 96.1 | TRUE |
rel_whitespace_src | 58 | 89.6 | |
rel_whitespace_vignettes | 56 | 83.5 | |
rel_whitespace_tests | 45 | 96.8 | TRUE |
doclines_per_fn_exp | 46 | 58.1 | |
doclines_per_fn_not_exp | 0 | 0.0 | TRUE |
fn_call_network_size | 364 | 93.5 |
3a. Network visualisation
Click to see the interactive network visualisation of calls between objects in package
4. goodpractice
and other checks
Details of goodpractice and other checks (click to open)
3a. Continuous Integration Badges
GitHub Workflow Results
name | conclusion | sha | date |
---|---|---|---|
Commands | skipped | 069021 | 2022-04-22 |
pages build and deployment | success | c2efe9 | 2022-04-28 |
pkgcheck | success | c73bb9 | 2022-04-28 |
pkgdown | success | c73bb9 | 2022-04-28 |
R-CMD-check | success | c73bb9 | 2022-04-28 |
test-coverage | success | c73bb9 | 2022-04-28 |
3b. goodpractice
results
R CMD check
with rcmdcheck
R CMD check generated the following note:
- checking installed package size ... NOTE
installed size is 6.9Mb
sub-directories of 1Mb or more:
libs 6.0Mb
R CMD check generated the following check_fails:
- no_import_package_as_a_whole
- rcmdcheck_reasonable_installed_size
Test coverage with covr
Package coverage: 97.13
Cyclocomplexity with cyclocomp
The following functions have cyclocomplexity >= 15:
function | cyclocomplexity |
---|---|
orsf | 33 |
check_orsf_inputs | 28 |
orsf_pd_ | 22 |
ref_code | 17 |
check_new_data_names | 15 |
check_predict | 15 |
Static code analyses with lintr
lintr found the following 223 potential issues:
message | number of times |
---|---|
Avoid using sapply, consider vapply instead, that's type safe | 10 |
Lines should not be more than 80 characters. | 182 |
Use <-, not =, for assignment. | 31 |
Package Versions
package | version |
---|---|
pkgstats | 0.0.4.30 |
pkgcheck | 0.0.3.11 |
srr | 0.0.1.149 |
Editor-in-Chief Instructions:
This package is in top shape and may be passed on to a handling editor
@ropensci-review-bot assign @tdhock as editor |
Assigned! @tdhock is now the editor |
@ropensci-review-bot seeking reviewers |
Please add this badge to the README of your package repository: [](https://github.com/ropensci/software-review/issues/532) Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news |
Done! Looking forward to the review. |
Hi @tdhock, Thank you for acting as the editor of this submission. I have recently added a paper.md file to the repository in hopes that |
@bcjaeger The |
Thanks, @noamross! That makes sense. |
Hi @tdhock, I see we are still seeking reviewers. Is there anything I can do to help find reviewers for this submission? |
I have asked a few people to review but no one has agreed yet, do you have any idea for potential reviewers to ask? |
Thanks! Assuming folks in the ROpenSci circle have been asked already, I'll offer some names (username) that may not have been asked yet but would be good reviewers for Terry Therneau (therneau) would be a good reviewer - a lot of the C code in aorsf is based on his Torsten Hothorn (thothorn) would also be a good reviewer - Torsten is the author of the Hannah Frick (hfrick), Emil Hvitfeldt (EmilHvitfeldt), Max Kuhn (topepo), Davis Vaughan (DavisVaughan), and Julia Silge Raphael Sonabend (RaphaelS1), Andreas Bender (adibender), Michel Lang (mllg), and Patrick Schratz (pat-s) would all be good reviewers - they are developers/contributors to the |
Hi @tdhock, I am thinking about reaching out to the folks I listed in my last comment. Before I try to reach them, I was wondering if you had already contacted them? If you have, I will not bother them again with a review request. |
Hi @bcjaeger actually I have not asked any of those reviewers yet, sorry! I just got back to the office from traveling. Yes that would be helpful if you could ask them to review. (although it is suggested to not use more than one of them, https://devguide.ropensci.org/editorguide.html?q=reviewers#where-to-look-for-reviewers) |
Hi @tdhock, thanks! I will reach out to the folks in my earlier post and let you know if anyone is willing to review |
@ropensci-review-bot add @chjackson to reviewers |
@mnwright, thank you so much for your review! The suggestions you've made to improve Once I have finished all of the updates, I will post and tag you.
Whoops, sorry I missed that. I will update with
I like this idea. I think making
Agreed - I think using
Whoops - yes, that needs to be reversed. Thank you! 😅
That is a great question - I think I need to include a high-dimensional case in the benchmark above to answer it. I'd also like to experiment with adding an
Agreed, this code needs better documentation. I will go through each function and add comments that include description of the function's purpose and a description of each input for the function, including global variables that the function depends on.
Yes, using OOP would be a much better design and would make generalizations to oblique forests for regression and classification much cleaner. The long term plan for
Agreed! I focused on survival initially because I wasn't aware of any R packages for oblique survival trees (the
Ahh, I understand. Sorry I did not control for this in my initial benchmark. I'd like to update my benchmark code so that
Agreed - I think I can revert to using 500 trees assuming the memory issue is fixed by restricting the number of unique event times.
Good point - if
Thank you! I sincerely appreciate the time you invested in reviewing |
I am happy to share an updated response to the excellent review from @mnwright:
I am committed to re-writing the The benchmark of
With these updates, it is clear that If you aggregate over all 9 panels (see second figure), |
Package Review
DocumentationThe package includes all the following forms of documentation:
Functionality
Estimated hours spent reviewing: 4
Review CommentsThe majority of my review are based on the state of the package as of commit 2251066.
Overall my impression of the package is very positive! PerformanceI have not conducted my own exhaustive experiments and have no additional notes regarding the public benchmark at The author also kindly contributed to A very minor nitpicky comment regarding code efficiency: # Example data
data <- as.data.frame(matrix(rnorm(100), ncol = 10))
data$V4 <- as.character(data$V4)
data$V8 <- as.character(data$V8)
data$V10 <- as.character(data$V10)
# Recreating .names argument
.names <- names(data)
# character detection without a for loop and a growing vector
which_char <- vapply(data, is.character, logical(1), USE.NAMES = FALSE)
chrs <- .names[which(which_char)]
chrs
#> [1] "V4" "V8" "V10" Also note that the output will likely be excessive if a user were to accidentally supply a very large dataset with thousands of MethodsThe inclusion of multiple easy to discover/use variable importance methods is very nice to see! Partial dependence plots: Due to their known limitations/assumptions, it may be appropriate to Permutation feature importance: This method has drawbacks as well, see Hooker et. al., so the previous comment applies with the exception that I cannot suggest a direct alternative. I was not familiar with the ANOVA- and negation-based importance measures, and I suspect many users may feel the same way. In that regard I think it would be useful to elaborate on these methods slightly, e.g. in |
Thanks for your review @jemus42! You are right it is not good practice for large data to grow vectors by over-writing the variable in a for loop. vapply is one option, and another option is to use a for loop, but write to a list element in each iteration (constant time operation), instead of over-writing the variable in each iteration (linear time operation), see https://tdhock.github.io/blog/2017/rbind-inside-outside/ for an example involving data frames, where the difference is noticeable even when there are only a few dozen things to combine. |
@jemus42, thank you! I will write a more complete response to your feedback and a list of proposed actions in the next few days, but for now I just wanted to say how much I appreciate your review. You found a very nice solution to speed up n <- 500
p <- 1000
data_check <- list()
characters <- letters[1:5]
for( i in seq(p) ) {
data_check[[i]] <- sample(characters, size = n, replace = T)
names(data_check)[i] <- paste('x', i, sep = '_')
}
data_check <- as.data.frame(data_check)
c_loop <- function(data, .names){
chrs <- c()
for( .name in .names ){
if(is.character(data[[.name]])){
chrs <- c(chrs, .name)
}
}
chrs
}
v_apply <- function(data, .names){
chr_index <- which(
vapply(data[, .names], is.character, logical(1), USE.NAMES = FALSE)
)
.names[chr_index]
}
testthat::expect_equal(
c_loop(data_check, names(data_check)),
v_apply(data_check, names(data_check))
)
microbenchmark::microbenchmark(
c_loop = c_loop(data_check, names(data_check)),
v_apply = v_apply(data_check, names(data_check))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> c_loop 8948.500 9234.001 9747.49 9380.5010 9530.401 13457.202 100 b
#> v_apply 328.001 339.701 484.56 346.6005 385.701 5247.201 100 a Created on 2022-09-16 by the reprex package (v2.0.1) |
@jemus42, here is the more complete response to your comments, with planned action items. If you feel like any of the proposed actions could be improved, or if I didn't address all of your comments in my response, just let me know. Otherwise, I should have updates in
Thank you! I will continue to prioritize clear documentation and tab-friendly names in future updates.
I agree regarding
There probably is something I could do here but I also am not sure how to use
Excellent! Thank you for checking these.
Good point -
I like this idea. Would the test be something like graphing the predictions from
I'm glad to hear the
Thank you! I found this very helpful and have implemented your suggestion.
This is a good point. I will add notes in the documentation about the limitations of PD and will recommend users engage with the
I agree - I will add text to my documentation highlighting this and refer to the paper you linked as well as to the disadvantages of PD section in the iml book
That is a good point. ANOVA importance was developed and evaluated by Menze et al for oblique classification random forests. Negation importance is developed and evaluated in the ArXiv paper you mentioned. According to the simulation study in the ArXiv paper, all three methods (negation, ANOVA, and permutation) appear to be valid. It seems as if negation has an edge in the general comparison based on our simulation study, but that isn't quite enough evidence for me to recommend negation as a default approach. I will add some notes to the documentation about the origin of these three methods and will add pros/cons for each one.
I am thrilled to hear that. =] Thank you for reviewing closely and providing so many useful ideas to improve |
Regarding the library(mlr3verse)
#> Loading required package: mlr3
library(mlr3proba)
library(mlr3extralearners)
library(mlr3viz)
library(aorsf)
library(ggplot2)
# Less logging during training
lgr::get_logger("mlr3")$set_threshold("warn")
# Some example tasks without missing values
tasks <- tsks(c("actg", "rats"))
# Learners with default parameters
learners <- lrns(
c("surv.ranger", "surv.rfsrc", "surv.aorsf"),
predict_sets = c("train", "test")
)
# Brier (Graf) score, c-index and training time as measures
measures <- msrs(c("surv.graf", "surv.cindex", "time_train"))
# Benchmark with 10-fold CV
design <- benchmark_grid(
tasks = tasks,
learners = learners,
resamplings = rsmps("cv", folds = 10)
)
# Parallelization (optional)
future::plan("multisession", workers = 5)
benchmark_result <- benchmark(design)
bm_scores <- benchmark_result$score(measures, predict_sets = "test")
bm_scores |>
dplyr::select(task_id, learner_id, surv.graf, surv.cindex, time_train) |>
tidyr::pivot_longer(cols = surv.graf:time_train, names_to = "measure") |>
ggplot(aes(x = learner_id, y = value)) +
facet_wrap(vars(measure), nrow = 1, scales = "free_y") +
geom_violin(draw_quantiles = c(.25, .5, .75)) +
theme_minimal() bm_summary <- benchmark_result$aggregate(measures)
bm_summary[, c("task_id", "learner_id", "surv.graf", "surv.cindex")]
#> task_id learner_id surv.graf surv.cindex
#> 1: actg surv.ranger 0.05933293 0.7265903
#> 2: actg surv.rfsrc 0.05776890 0.7373370
#> 3: actg surv.aorsf 0.05803357 0.7282324
#> 4: rats surv.ranger 0.07233919 0.7992290
#> 5: rats surv.rfsrc 0.08087658 0.7636673
#> 6: rats surv.aorsf 0.08338740 0.7874255 Created on 2022-09-19 with reprex v2.0.2 |
📆 @mnwright you have 2 days left before the due date for your review (2022-09-21). |
📆 @jemus42 you have 2 days left before the due date for your review (2022-09-21). |
@ropensci-review-bot submit review #532 (comment) time 4 |
Logged review for jemus42 (hours: 4) |
@ropensci-review-bot submit review #532 (comment) time 3 |
Logged review for mnwright (hours: 3) |
The I am happy to share some updates!
# Similar to obliqueRSF?
suppressPackageStartupMessages({
library(aorsf)
library(obliqueRSF)
})
set.seed(50)
fit_aorsf <- orsf(pbc_orsf,
formula = Surv(time, status) ~ . - id,
n_tree = 100)
fit_obliqueRSF <- ORSF(pbc_orsf, ntree = 100, verbose = FALSE)
risk_aorsf <- predict(fit_aorsf, new_data = pbc_orsf, pred_horizon = 3500)
risk_obliqueRSF <- 1-predict(fit_obliqueRSF, newdata = pbc_orsf, times = 3500)
cor(risk_obliqueRSF, risk_aorsf)
#> [,1]
#> [1,] 0.9747177
plot(risk_obliqueRSF, risk_aorsf) Created on 2022-09-20 by the reprex package (v2.0.1) @jemus42, do you feel these updates (in addition to the earlier update using |
Yes, I have nothing more to add I think :) Regarding the regression test, I don't think there's a meaningful cutoff one could define, but if for example you were to introduce a bug on the |
Excellent! Totally agree with you on the regression test. Thanks again for the suggestion! |
Since you have sufficiently addressed the reviewer comments, this package can now be accepted. |
@ropensci-review-bot approve aorsf |
Approved! Thanks @bcjaeger for submitting and @chjackson, @mnwright, @jemus42 for your reviews! 😁 To-dos:
Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions. We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved. Last but not least, you can volunteer as a reviewer via filling a short form. |
@ropensci-review-bot finalize transfer of bcjaeger/aorsf |
Can't find repository |
@ropensci-review-bot finalize transfer of aorsf |
Transfer completed. |
Thanks everyone for your help improving |
Date accepted: 2022-09-22
Submitting Author Name: Byron C Jaeger
Due date for @chjackson: 2022-07-29Submitting Author Github Handle: @bcjaeger
Other Package Authors Github handles: (comma separated, delete if none) @nmpieyeskey, @sawyerWeld
Repository: https://github.com/bcjaeger/aorsf
Version submitted: 0.0.0.9000
Submission type: Stats
Badge grade: gold
Editor: @tdhock
Reviewers: @chjackson, @mnwright, @jemus42
Due date for @mnwright: 2022-09-21
Due date for @jemus42: 2022-09-21
Archive: TBD
Version accepted: TBD
Language: en
Pre-submission Inquiry
General Information
Target audience: people who want to fit and interpret a risk prediction model, i.e., a prediction model for right-censored time-to-event outcomes.
Applications: fit an oblique random survival forest, compute predicted risk at a given time, estimate the importance of individual variables, and compute partial dependence to depict relationships between specific predictors and predicted risk.
Paste your responses to our General Standard G1.1 here, describing whether your software is:
Please include hyperlinked references to all other relevant software. The obliqueRSF package was the original R package for oblique random forests. I wrote it and it is very slow. I wrote
aorsf
because I had multiple ideas about how to makeobliqueRSF
faster, specifically using a partial Newton Raphson algorithm instead of using penalized regression to derive linear combinations of variables in decision nodes. It would have been possible to rewriteobliqueRSF
, but it would have been difficult to make the re-write backwards compatible with the version ofobliqueRSF
on CRAN.(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Not applicable
Badging
Gold
aorsf
complies with over 100 combined standards in the general and ML categories.aorsf
uses an optimized routine to partially complete Newton Raphson scoring for the Cox proportional hazards model and also an optimized routine to compute likelihood ratio tests. Both of these routines are heavily used when fitting oblique random survival forests, and both demonstrate the exact same answers as corresponding functions in thesurvival
package (see tests inaorsf
) while running at least twice as fast (thanks to Rcpparmadillo).Technical checks
Confirm each of the following by checking the box.
autotest
checks on the package, and ensured no tests fail.srr_stats_pre_submit()
function confirms this package may be submitted.pkgcheck()
function confirms this package may be submitted - alternatively, please explain reasons for any checks which your package is unable to pass.I think
aorsf
is passingautotest
andsrr_stats_pre_submit()
. I am having some issues running these on R 4.2. Currently, autotest is returning NULL, which I understand to be a good thing, and srr_stats_pre_submit is not able to run (not sure why; but it was fine before I updated to R 4.2).This package:
Publication options
Code of conduct
The text was updated successfully, but these errors were encountered: