Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for new R interface #9810

Open
22 of 27 tasks
david-cortes opened this issue Nov 27, 2023 · 25 comments
Open
22 of 27 tasks

Roadmap for new R interface #9810

david-cortes opened this issue Nov 27, 2023 · 25 comments

Comments

@david-cortes
Copy link
Contributor

david-cortes commented Nov 27, 2023

ref #9734
ref #9475

This issue is intended as a roadmap tracker for progress in bringing xgboost's R interface up to date and discussions around these tasks and coordination.

From the previous tasks, here I've made a list of potential tasks to take on, but I might be missing some things, and I've put the biggest task (new xgboost() function) under a single bullet point while in practice it'll likely involve multiple rounds of PRs. Please feel free to add more tasks to this list.

I've taken the liberty of classifying these issues in terms of whether they'd be blockers for releasing a new xgboost version or not, albeit some people might disagree with my assessments.

  • (Blocker) Enable categorical features for current DMatrix constructors (matrix, dgCMatrix, dgRMatrix).
  • (Blocker) Add support for creating DMatrices from R data.frame objects, automatically setting factor variables to be of categorical type in the DMatrix. (Support dataframe data format in native XGBoost. #9828)
    • Note: these objects are a list of arrays which aren't necessarily in a single memory chunk, and which can have types int (int32_t), double (float64), and potentially int64_t from package bit64.
    • I guess this and the first point could be done in the same PR since they might be touching similar code sections.
  • (Blocker) Fix plotting and trees-to-table with categorical splits.
  • Add XGDMatrixNumNonMissing.
  • Add XGDMatrixGetDataAsCSR.
  • (Blocker) Enable multi-output input labels and predictions.
  • (Low priority) Add a mechanism to create a DMatrix object from arrow objects (from package "arrow"). Like for data frames, should automatically recognize categorical columns from the categorical arrow type.
    • Note: the idea here is to exploit functions that work directly on arrow format, without converting to base R arrays (which do not support all the arrow types) along the way.
  • Add an interface to create QuantileDMatrix objects from R, accepting the same kinds of inputs as DMatrix (data.frame, matrix, dgCMatrix, dgRMatrix, arrow if implemented, maybe float::float32), and also auto-recognizing categorical features for objects that have them (data frames and arrow tables).
  • (Low priority) Add methods to get additional info from DMatrix objects that are currently missing from the R package, such as get_quantile_cut (guess this is just a call to XGDMatrixGetQuantileCut?).
  • (Blocker) Move more DMatrix parameters that reference data towards xgb.DMatrix() function arguments, such as qid, group, label_lower_bound, label_upper_bound , etc.
    • Potentially a good reference could be the DMatrix python class.
  • Switch the current DMatrix creation function for R matrices towards the C function that uses array_interface.
  • Switch the predict method for the current booster to use "inplace predict" or other more efficient DMatrix creators when appropriate.
  • (Blocker) Remove all the public interface (functions, docs, tests, examples) around the Booster.handle class, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now.
  • (Blocker) After the task above is done, switch the handle serialization mechanism to ALTREP and remove xgb.Booster.complete, which wouldn't be needed anymore.
    • This increases the R requirement to >= 4.3, so it requires modifying the CI jobs to update them all to this version of R and drop the older ones.
  • (Low priority) Implement serialization for DMatrix handles through the same ALTREP system as above. This idea was discarded (thread)
  • (Blocker) Remove the current xgboost() function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.).
  • (Blocker) After support for data.frame and categorical features is added, then create a new xgboost() function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper over DMatrix + xgb.train but implementing the kind of idiomatic R interface (x/y only, no formula) described in the earlier thread, either with a separate function for the parameters or everything being passed in the main function.
    • It should return objects of a different class than xgb.train (perhaps the class could be named "xgboost").
    • This class should have its own predict method, again with a different interface than the booster's predict, as described in the first message here.
    • If this class needs to keep additional attributes, perhaps they could be kept as part of the JSON that gets serialized, otherwise should have a note about serialization and transferability with other interfaces.
    • This is probably the largest PR in terms of code (especially tests!!), so might need to be split into different batches. For example, support for custom objectives could be left out from the first PR.
  • (Blocker) After the new xgboost() x/y interface gets implemented, then modify other functions to accept these objects - e.g.:
    • Plotting function.
    • Feature importance function.
    • Serialization functions that are aimed at transferring models between interfaces.
    • All of these should keep in mind small details like base-1 indexing for tree numbers and similar.
  • (Blocker) Create examples and vignettes for the new xgboost() function.
  • (Low priority) Perhaps create a higher-level cv function for the new xgboost() interface.
  • Support creation of external memory objects with DataIter.
  • (Blocker) Enable quantile regression with multiple quantiles.
  • Switch the R package build system to CMake instead of autotools.
  • (Low priority) Distributed training, perhaps integration with RSpark.
  • Documentation and unified tests for 1-based indexing.
  • (Blocker) Fix misrendered documentation: [R] Docs for function arguments format second paragraph as code block #10329
  • (Blocker) Update introductory vignette to reflect current XGBoost capabilities [R] Introductory vignette is outdated #10746
@david-cortes
Copy link
Contributor Author

@trivialfis From the previous thread, you mentioned you might be able to work on categorical feature support - would you be able to take on the first two tasks here?

@dfsnow You mentioned that you were willing to help in the earlier topic - would you be interested in taking on some of the issues here, particularly around DMatrix topics?

@jameslamb Would you be interested in taking on some task such as removing the handle class from the public interface?

@mayer79 Are you familiar with C++ and R's C interface? Would you be able to help with some of these topics?

@mayer79
Copy link
Contributor

mayer79 commented Nov 27, 2023

@david-cortes: fantastic road map, thank you so much. Unfortunately, you have spotted my biggest weakness! For the C part, we might ask the data.table team. For the C++ part, Dirk Edelbüttel?

@trivialfis
Copy link
Member

Let me handle the primitive support for data frame first. Categorical data can follow.

@trivialfis
Copy link
Member

Let me handle the primitive support for data frame first. Categorical data can follow.

This is probably going to help with other interfaces as well. We need to have missing data for each column.

@trivialfis
Copy link
Member

With the amount of custom C++ code in the R package, I think we need to set up CI tests with sanitizer for R (hopefully not Valgrind, which is slow).

@david-cortes
Copy link
Contributor Author

Another task which doesn't require modifying any C/C++ functions (only .R files): currently, xgb.cv will error out with objective survival:aft. This is due to the function checking that the DMatrix object has label property, but this objective works instead with label_lower_bound and label_upper_bound.

@mayer79 would you be interested in contributing a fix?

@mayer79
Copy link
Contributor

mayer79 commented Dec 3, 2023

Good idea. I even remember this issue from somewhere.

@jameslamb
Copy link
Contributor

@jameslamb Would you be interested in taking on some task such as removing the handle class from the public interface?

Yes definitely!

But it will be about 1-2 weeks until I'm able to spend any time on it, as I'm focusing right now on trying to get {lightgbm} 4.x out to CRAN (and keeping {lightgbm} from being archived there 😬 ).

I'm also happy to help with reviews on any PRs here if you want, just @ me.

@david-cortes
Copy link
Contributor Author

Since the current master branch now supports multi-quantile regression, I guess it's now time to update the example in the docs where it says

The feature is only supported using the Python package

... and maybe it'd be worth it to add an equivalent R example, if someone would like to take on this task.

@trivialfis
Copy link
Member

@david-cortes Out of curiosity, do you want to become the CRAN maintainer after having the new interface (regardless of whether the two interfaces coexist)? At the moment, I'm maintaining the CRAN package but only doing the chores instead of having actual development, it would be great if there's a real expert can take over.

@david-cortes
Copy link
Contributor Author

We are done with all the must-haves by now. Only remaining thing is categorical encoding support once it gets implemented in the core library, but that should not be a blocker for a new release.

I'll hand it over to XGBoost maintainers to proceed with the CRAN release. I'd of course prefer to retain the xgboost name, but coordinating with the reverse dependencies is a humungous amount of work and I'm not the one who will have to deal with it, so easier said than done.

@david-cortes
Copy link
Contributor Author

Pinging a few people who might have opinions on the redesigned R interface for XGBoost before the next CRAN release. Feel free to comment on the following thread if you'd like to do so:
#9734

Summary: the R bindings for XGBoost have been largely rewritten into a more idiomatic and user-friendly interface, which involved lots of breaking changes throughout the package. There is now an x/y interface through function xgboost() geared towards interactive usage, and a low-level interface through xgb.train() that follows the same function from the Python package. Besides those, there were also changes in e.g. the serialization logic (which now uses ALTLIST), among others.

See docs here to install the latest version:
https://xgboost.readthedocs.io/en/latest/build.html#building-r-package-from-source

Plus the documentation page:
https://xgboost--11166.org.readthedocs.build/en/11166/r_docs/R-package/docs/reference/index.html

And new vignette:
https://xgboost--11166.org.readthedocs.build/en/11166/R-package/xgboost_introduction.html

CC @sebffischer @giuseppec @pbiecek @bgreenwell @eddelbuettel @simonpcouch

@mayer79
Copy link
Contributor

mayer79 commented Jan 19, 2025

Today, I have tested the new package version on some of my code (lecture notes and some R packages) that use xgb.train() and xgb.cv(), but not xgboost() . The main adaptions to make the codes run again:

  1. Arguments like objective, num_class or nthread can be passed only via params list anymore, not as separate arguments.
  2. watchlist is now called evals.
  3. Callback functions now start with "xgb.": xgb.cb.print.evaluation instead of cb.print.evaluation etc.

The first point can be adapted already now. The other points require the new {xgboost} release and are harder to anticipate.

The fitted models have changed as well. Not sure why, maybe due to random seed changes?

@trivialfis
Copy link
Member

trivialfis commented Jan 21, 2025

The fitted models have changed as well. Not sure why, maybe due to random seed changes?

Most likely due to two changes:

  • A new initial estimation using MLE. In older versions, there was no initial estimation, then we used one step newton uniformly. Then @david-cortes suggested using MLE for selective objectives that have close solution.
  • Default tree method was changed from approx to hist in 2.0.

@mayer79
Copy link
Contributor

mayer79 commented Jan 21, 2025

The fitted models have changed as well. Not sure why, maybe due to random seed changes?

Most likely due to two changes:

  • A new initial estimation using MLE. In older versions, there was no initial estimation, then we used one step newton uniformly. Then @david-cortes suggested using MLE for selective objectives that have close solution.
  • Default tree method was changed from approx to hist in 2.0.

Ah, that might be possible, I forgot about the initial estimate (I was already using histogram binning).

Should we draft a compact guide "How to switch to XGBoost 3" and list the most likely changes?

@trivialfis
Copy link
Member

@mayer79 That sounds like a great idea. When I started looking into mlr3 integration, I worked on it a little bit but have to move to other things (the checks are only for myself when I was looking into it):

  • QDM
  • device, remove gpu_hist
  • watchlist -> evals
  • factor
  • Remember the factor encoding
  • Inplace predict
  • reshape param of predict
  • Remove uri inputs.
  • quantile regression

@david-cortes Might help fill in the blanks in terms of what to look out for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants