-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap for new R interface #9810
Comments
@trivialfis From the previous thread, you mentioned you might be able to work on categorical feature support - would you be able to take on the first two tasks here? @dfsnow You mentioned that you were willing to help in the earlier topic - would you be interested in taking on some of the issues here, particularly around DMatrix topics? @jameslamb Would you be interested in taking on some task such as removing the handle class from the public interface? @mayer79 Are you familiar with C++ and R's C interface? Would you be able to help with some of these topics? |
@david-cortes: fantastic road map, thank you so much. Unfortunately, you have spotted my biggest weakness! For the C part, we might ask the data.table team. For the C++ part, Dirk Edelbüttel? |
Let me handle the primitive support for data frame first. Categorical data can follow. |
This is probably going to help with other interfaces as well. We need to have missing data for each column. |
With the amount of custom C++ code in the R package, I think we need to set up CI tests with sanitizer for R (hopefully not Valgrind, which is slow). |
Another task which doesn't require modifying any C/C++ functions (only .R files): currently, @mayer79 would you be interested in contributing a fix? |
Good idea. I even remember this issue from somewhere. |
Yes definitely! But it will be about 1-2 weeks until I'm able to spend any time on it, as I'm focusing right now on trying to get I'm also happy to help with reviews on any PRs here if you want, just |
Since the current master branch now supports multi-quantile regression, I guess it's now time to update the example in the docs where it says
... and maybe it'd be worth it to add an equivalent R example, if someone would like to take on this task. |
@david-cortes Out of curiosity, do you want to become the CRAN maintainer after having the new interface (regardless of whether the two interfaces coexist)? At the moment, I'm maintaining the CRAN package but only doing the chores instead of having actual development, it would be great if there's a real expert can take over. |
We are done with all the must-haves by now. Only remaining thing is categorical encoding support once it gets implemented in the core library, but that should not be a blocker for a new release. I'll hand it over to XGBoost maintainers to proceed with the CRAN release. I'd of course prefer to retain the |
Pinging a few people who might have opinions on the redesigned R interface for XGBoost before the next CRAN release. Feel free to comment on the following thread if you'd like to do so: Summary: the R bindings for XGBoost have been largely rewritten into a more idiomatic and user-friendly interface, which involved lots of breaking changes throughout the package. There is now an x/y interface through function See docs here to install the latest version: Plus the documentation page: And new vignette: CC @sebffischer @giuseppec @pbiecek @bgreenwell @eddelbuettel @simonpcouch |
Today, I have tested the new package version on some of my code (lecture notes and some R packages) that use
The first point can be adapted already now. The other points require the new {xgboost} release and are harder to anticipate. The fitted models have changed as well. Not sure why, maybe due to random seed changes? |
Most likely due to two changes:
|
Ah, that might be possible, I forgot about the initial estimate (I was already using histogram binning). Should we draft a compact guide "How to switch to XGBoost 3" and list the most likely changes? |
@mayer79 That sounds like a great idea. When I started looking into mlr3 integration, I worked on it a little bit but have to move to other things (the checks are only for myself when I was looking into it):
@david-cortes Might help fill in the blanks in terms of what to look out for. |
ref #9734
ref #9475
This issue is intended as a roadmap tracker for progress in bringing xgboost's R interface up to date and discussions around these tasks and coordination.
From the previous tasks, here I've made a list of potential tasks to take on, but I might be missing some things, and I've put the biggest task (new
xgboost()
function) under a single bullet point while in practice it'll likely involve multiple rounds of PRs. Please feel free to add more tasks to this list.I've taken the liberty of classifying these issues in terms of whether they'd be blockers for releasing a new xgboost version or not, albeit some people might disagree with my assessments.
DMatrix
constructors (matrix
,dgCMatrix
,dgRMatrix
).data.frame
objects, automatically settingfactor
variables to be of categorical type in the DMatrix. (Support dataframe data format in native XGBoost. #9828)int
(int32_t),double
(float64), and potentiallyint64_t
from packagebit64
.XGDMatrixNumNonMissing
.XGDMatrixGetDataAsCSR
.DMatrix
object fromarrow
objects (from package "arrow"). Like for data frames, should automatically recognize categorical columns from the categorical arrow type.QuantileDMatrix
objects from R, accepting the same kinds of inputs asDMatrix
(data.frame
,matrix
,dgCMatrix
,dgRMatrix
,arrow
if implemented, maybefloat::float32
), and also auto-recognizing categorical features for objects that have them (data frames and arrow tables).DMatrix
objects that are currently missing from the R package, such asget_quantile_cut
(guess this is just a call toXGDMatrixGetQuantileCut
?).DMatrix
parameters that reference data towardsxgb.DMatrix()
function arguments, such asqid
,group
,label_lower_bound
,label_upper_bound
, etc.DMatrix
creation function for R matrices towards the C function that usesarray_interface
.predict
method for the current booster to use "inplace predict" or other more efficientDMatrix
creators when appropriate.Booster.handle
class, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now.xgb.Booster.complete
, which wouldn't be needed anymore.(Low priority) Implement serialization forThis idea was discarded (thread)DMatrix
handles through the same ALTREP system as above.xgboost()
function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.).data.frame
and categorical features is added, then create a newxgboost()
function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper overDMatrix
+xgb.train
but implementing the kind of idiomatic R interface (x/y only, no formula) described in the earlier thread, either with a separate function for the parameters or everything being passed in the main function.xgb.train
(perhaps the class could be named "xgboost").predict
method, again with a different interface than the booster's predict, as described in the first message here.xgboost()
x/y interface gets implemented, then modify other functions to accept these objects - e.g.:xgboost()
function.xgboost()
interface.DataIter
.The text was updated successfully, but these errors were encountered: