Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for new R interface (discussion thread) #9734

Open
david-cortes opened this issue Oct 30, 2023 · 72 comments
Open

Proposal for new R interface (discussion thread) #9734

david-cortes opened this issue Oct 30, 2023 · 72 comments

Comments

@david-cortes
Copy link
Contributor

ref #9475

I'm opening this issue in order to propose a high-level overview of how an idiomatic R interface for xgboost should ideally look like and some thoughts on how it might be implemented (note: I'm not very familiar with xgboost internals so have several questions). Would be ideal to hear comments from other people regarding these ideas. I'm not aware of everything that xgboost can do so perhaps I might be missing something important here.

(Perhaps a GitHub issue is not the most convenient format for this though, but I don't where else to put it)

Looking at the python and C interfaces, I see that there's a low-level interface with objects like Booster with associated functions/methods xgb.train that more or less reflects how the C interface works, and then there's a higher-level scikit-learn interface that wraps the lower-level functions into a friendlier interface that works more or less the same way as scikit-learn objects, which is the most common ML framework for python.

In the current R interface, there's to some degree a division into lower/higher-level interfaces with xgb.train() and xgboost(), but there's less of a separation as e.g. both return objects of the same class, and the high-level interface is not very high-level as it doesn't take the same kinds of arguments as base R's stats::glm() or other popular R packages - e.g. it doesn't take formulas, nor data frames.


In my opinion, a more ideal R interface could consist of a mixture of a low-level interface reflecting the way the underling C API works (which ideally most reverse dependencies and very latency-sensitive applications should be using) just like the current python interface, plus a high-level interface that would make it ergonomic to use at the cost of some memory and performance overhead, behaving similarly as core and popular R packages for statistical modeling. Especially important in this higher-level interface would be to make use of all the rich metadata that R objects have (e.g. column names, levels of factor/categorical variables, specialized classes like survival::Surv, etc.), for both inputs and outputs.

In the python scikit-learn interface, there are different classes depending on the task (regression / classification / ranking) and the algorithm mode (boosting / random forest). This is rather uncommon in R packages - for example, randomForest() will switch between regression and classification according to the type of the response variable - and I think it could be better to keep everything in the high-level interface under a single function xgboost() and single class returned from that function, at the expense of more arguments and more internal metadata in the returned objects.

In my view, a good separation between the two would involve:
  • Having the low-level interface work only with DMatrix (which could be created from different sources like files, in-memory objects, etc.), and the high-level interface work only with common R objects (e.g. data.frame, matrix, dgCMatrix, dgRMatrix, etc.), but not with DMatrix (IMO would not only make the codebase and docs simpler but also given that both have different metadata, would make it easier to guess what to expect as outputs from other functions).
  • Having separate classes for the outputs returned from both interfaces, with different associated methods and method signatures, so that e.g. the predict function for the object returned by xgb.train() would mimic the prediction function from the C interface and be mindful of details like not automatically transposing row-major outputs, while the predict function from the object returned from xgboost() would mimic base R and popular R packages and be able to e.g. return named factors for classification objectives.
    • The plot functions perhaps could be shared, with some extra arguments being required for the low-level interface but forbidden for the high-level interface.
  • Having separate associated cv functions for the low-level and the high-level interfaces (e.g. xgboost.cv() or cv.xgboost(), like there is cv.glmnet(); and another xgb.cv() that would work with DMatrix).
  • Keeping extra metadata inside the object returned by the high-level interface, such as column names, types, factor levels, the actual string of the function call (the way base R does), etc.
  • Exporting from the package only what's necessary to support these two modes of operation alongside with their methods - e.g. not exporting internal functions like xgboost::normalize or internal classes/methods like Booster.handle.
A couple extra details **about the high-level interface** that in my opinion would make it nicer to use. I think many people might disagree with some of the points here though so would like to hear opinions on these ideas:
  • It should have an x/y interface, and perhaps also a formula interface (details on the latter to follow).
  • The x/y interface should accept x in the format of common R classes like data.frame, matrix, dgCMatrix, perhaps arrow (although I wouldn't consider it a priority), and perhaps other classes like float32 (from the float package). I don't think it'd be a good idea to support more obscure potential classes like dist that aren't typically used in data pipelines.
    • If I'm not mistaken, CSR is supported for prediction but not for training, so predict should additionally take dgRMatrix and sparseVector like it does currently.
    • I am not into GPU computing, nor into distributed computing, so perhaps I might be missing important classes here.
    • If supporting Spark in R, I guess the best way would be to have a separate package dedicated to it.
  • (Controversial) It should not only accept but also require different classes for y depending on the objective, and by default, if not passed, should select an objective based on the type of y:
    • factor with 2 levels -> binary:logistic.
    • factor with >2 levels -> multi:softmax.
    • survival::Surv -> survival:aft (with normal link), or maybe survival:cox if only right-censored.
    • others -> reg:squarederror.
    • And consequently, require factor types for classification, Surv for survival, and numeric/integer for regression (taking logical (boolean) as 0/1 numeric).
    • If multi-column response/target variables are to be allowed, they should be required to be 2-dimensional (i.e. no vector recycling), as either data.frame or matrix, and keep their column names as metadata if they had.
  • If x is a data.frame, it should automatically recognize factor columns as categorical types, and (controversial) also take character (string) class as categorical, converting them to factor on the fly, just like R's formula interface for GLMs does. Note that not all popular R packages doing the former would do the latter.
  • (Controversial) It should only support categorical columns in the form of an input x being a data.frame with factor/character types. Note: see question below about sparse types and categorical types.
  • If x is either a data.frame or otherwise has column names, then arguments that reference columns in the data should be able to reference them by name.
    • For example, if x has 4 columns [c1, c2, c3, c4] it should allow passing monotone_constraints as either c(-1, 0, 0, 1), or as c("c1" = -1, "c4" = 1), or as list(c1 = -1, c4 = 1); but not as e.g. c("1" = -1, "4" = -1).
    • OTOH if x is a matrix/dgCMatrix without column names, it should accept them as c(-1, 0, 0, 1), or as c("1" = -1, "4" = -1), erroring out on non-integer names, not allowing negative numbers, and not allowing list with length<ncols as the matching would be ambiguous.
    • If there are duplicated column names, I think the most logical scenario here would be to error out with an informative message.
  • Regardless of whether x has column names, column-vector inputs like qid should be taken from what's passed on the arguments, and not guessed from the column names of x like the python scikit-learn interface does.
  • It should have function arguments (for IDE autocomplete) and in-package documentation for all of the accepted training parameters. Given the amount of possible parameters, perhaps the most reasonable way would be to put them into a separate generator function xgb.train_control() or similar (like C50::C5.0Control) that would return a list, or perhaps it should have them all as top-level arguments. I don't have a particular preference here, but:
    • If parameters are moved to a separate train control function, I think at least the following should remain top-level arguments: objective, nthread, verbosity, seed (if not using R's - see next point below), booster; plus arguments that reference columns names or indices: monotone_constraints, interaction_constraints; and data inputs like base_score or qid. nrounds I guess is most logical to put in the list of additional parameters, but given that it's a required piece of metadata for e.g. determining output dimensions, perhaps it should also be top-level.
    • If using a separate train control function, given that R allows inspecting the function call, it might be better to create a list with only the arguments that were explicitly supplied - e.g. if passing only something like xgb.train_control(eta=0.01), it could return only list(eta = 0.01) instead of a full list of parameters. In this case, it might not be strictly required to know what are the default values for every parameter for the purposes of developing this interface, but would be nice if they were easily findable from the signature anyway.
  • (Controversial) Not a high-priority thing, but would be ideal if one could pass seed and only rely on R's PRNG if seed is not passed.
  • It should have more informative print/show and summary methods that would display info such as the objective, booster type, rounds, dimensionality of the data to which it was fitted, whether it had categorical columns, etc.
    • I think the summary and the print method could both print the exact same information. Not sure if there's anything additional worth putting into the summary method that wouldn't be suitable for print. Perhaps evaluation metric or objective function per round could be shown (head + tail) for summary, but maybe it's not useful.
  • It should stick to common R idioms and conventions in terms of inputs and outputs - for example, print should not only print but also return the model object as invisible, methods like predict should take the same named arguments as others like predict(object, newdata, type, ...), return an output with one row per input in the data, keep row names in the output if there were any in the input, use name (Intercept) instead of BIAS, etc.
    • A tricky convention in this regard is base1 indexing - there's many places where it's used like interaction_constraints, iterationrange for prediction, node indices from predict, etc. Perhaps the high-level interface could do the conversion internally but I'm not sure if I'm missing more potential places where indices about something might be getting used, and probably more such parameters and outputs will be added in the future.
  • Following up on the point above, the predict function should:
    • Accept a type argument with potential values like "response", "class", "raw", "leaf", "contrib", "interaction", "approx.contrib", "approx.interaction" (not sure what to name the last 2 though).
    • Output factor types for class type in classification objectives.
    • Use column names of x for outputs of contrib/interaction, and factor levels of y as column names from e.g. response in multi-class classification and class-specific contrib/interaction (named as level:column, like base R).
  • Xgboost functions should avoid doing in-place modifications to R objects other than internal classes like DMatrix or handles - e.g. calling the plotting function should not leave the input with an additional Importance column after the call is done.
  • The plotting and feature importance functions should not require passing additional metadata, using instead the names that the input x had.
    • The function for plotting feature importances should in turn not require passing the calculated feature importances first, calculating them internally on-the-fly if not passed, given that it will have all the required inputs already.
  • There should be a function to convert objects between the low-level and the high-level R interface, which in the absence of metadata, could be done by autofilling column names with "V1", "V2", etc. and factor levels with "1", "2", "3", etc.
  • (Controversial) The predict function for the x/y interface should have a parameter to determine whether it should subset/re-order columns of newdata according to the column names that were used in training, and in this case, it should also recode factor levels. Note that this subsetting and reordering is always done for base R's predict.glm, but factors are not re-coded there (so e.g. if the levels in factor columns of newdata differ from the training data, predict.glm would instead output something that might not make sense).
  • Custom objectives and custom metrics for the high-level interface should be passed the same kinds of response variables that were used for input - e.g. a factor instead of the binarized numeric vector that xgboost will use internally.
    • Perhaps custom objectives and metrics could be required to be passed as a list with necessary fields instead of as separate arguments, and required to have extra metadata like whether they are regression/classification, a potential inverse link function for predict that would be kept inside the output, etc. Don't have a strong preference here though, and I expect most users of custom objectives would be using the low-level interface anyway.
  • (Very low priority) Perhaps the high-level interface should allow using base R's "families" (as returned by e.g. stats::poisson(link = "log")) and the structure they involve as custom objectives (like glmnet does), so that one could pass a family-compliant object from other packages as objective. I see an issue here though in that these families are meant for Fisher scoring, which means in the case of non-cannonical link functions like binomial(link = "probit"), they wouldn't calculate the true Hessian function, but I guess Fisher's should work just as fine with gradient boosting. Not an expert in this topic though.

The low-level interface in turn should support everything that the C interface offers - e.g. creating DMatrix from libsvm files, specifying categorical columns from a non-data-frame array, etc.


As per the formula interface, this is quite tricky to implement in a good way for xgboost.

In the case of linear models, it's quite handy to create these features on-the-fly to find out good transformations and e.g. be able to call stepwise feature selectors on the result, but:

  • Stepwise selection from base R doesn't work with xgboost.
  • Transformations like log(x1) have no effect in decision trees, transformations like x1:x2 don't have the same practical effect as decision trees implicitly create interactions, and transformations like x^2 the way R does them do not make a difference for decision trees compared to simpler I(x^2)+x, for example.
  • Other outputs from formulas like contrasts are not used either way.
  • Transformations like log(y) aren't any harder to do with an x/y interface than with a formula interface.

Nevertheless, a formula interface can still be handy for calls like y ~ x1 + x2 + x3 when there are many columns, and packages like ranger offer such an interface, so perhaps it might be worth having one, even if it's not the recommended way to use.

Some nuances in terms of formulas though:

  • Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept, and has no use for a column named (Intercept) that would have the same value for every row.
  • Base R's formulas will either do one-hot or dummy encoding for factor/character columns according to whether there is an intercept or not, while xgboost now has native support for categorical columns.
  • Using a formula for processing the training data also implies using it for predict - so for example, formulas do not recode levels of factor variables when calling predict, which the x/y interface could potentially do, leading to differences in behaviors between both interfaces.
  • Some packages have their own formula parsers which allow additional interpretations for something like y ~ x | z than what base R would do (for example, lme4 would interpret z here as something that controls mixed effects for x, while base R would interpret this as a feature "x or z"), and in some cases xgboost() would also need a different interpretations of formulas (e.g. for parameter qid, which doesn't fit at either side of the formula).
  • Packages like randomForest don't use the base R formula parser, taking it instead (by copying the code) from a different library e1071 which is GPLv2 licensed, which is incompatible with xgboost's apache license.
  • Refering to columns that are the result of a formula by index in e.g. monotone_constraints could be tricky - e.g. if we remove the auto-added (Intercept) columns, should the numbers re-adjust?

Hence, supporting a formula interface for xgboost() would be tricky:

  • It could be possible to use base R's formula parser, but:
    • In this case it would not be possible to use categorical features (or their interactions) as such, as they will be either dummy- or one-hot encoded.
    • It theoretically could be possible to try to forcibly try to add -1 at the end of the formula (which means "don't add an intercept") by converting it to string and back, in order to get one-hot encoding of factors and avoid adding (Intercept), but I can foresee cases in which this might not work depending on how the formula is inputted.
    • It would not be possible to specify additional columns like qid in these formulas.
  • It could be possible to develop a custom formula parser that would not one-hot-encode categorical features and e.g. interpret | differently, but what should happen in this case if the user requests something like xnum*xcat or f(xcat) (with f being some arbitrary base function) ?

Unless we can find some other package that would better handle formula parsing and that could be reasonable to use as dependency (I'm not aware of any), I think the best way forward here would be to:

  • Use base R's formula parser.
  • Don't support parameters like monotone_constraints or qid in the formula interface.
  • Don't support native categorical columns, relying instead on the one-hot or dummy-encoding from the formula.
  • Try to remove the intercept by trying to create a new formula <curr formula> - 1 and error out if this doesn't succeed.
  • Let the formula handle all the predict post-processing, regardless of how the x/y interface does it.
  • Encourage users and reverse dependencies to avoid the formula interface for serious usage by being clear about all these limitations in the docs and having a warning message in the "description" section (which comes before the signatures in the doc). We're not in control of how reverse dependencies use it, but if any of the larger ones is tempted to use this formula interface anyway, could be followed up in their githubs.

A couple questions for xgboost developers (@hcho3 @trivialfis ?) I have here:
  • Does xgboost support passing a mixture of dense and sparse data together? pandas supports sparse types, but if I understand it correctly from a brief look at the source code of the python interface, it will cast them to dense before creating the DMatrix object. If it supports mixtures of both, I think it'd again be ideal if the R interface could also have a way to create such a DMatrix for the low-level interface. Not sure if there'd be an idiomatic way to incorporate this in the high-level interface though, unless allowing passing something like a list.
    • Is there a C-level DMatrix equivalent of cbind / np.c_ ? I don't see any but this would make things easier.
  • Can sparse columns be taken as categorical without being densified? If so, I guess specifying categorical columns of a sparse matrix should also be supported in the high-level R interface.
  • In other interfaces, does passing a data frame of different types always involve putting everything into a contiguous array of the same type? I see there's a C function XGDMatrixCreateFromDT, but if I understand it correctly from a look at the pandas processing functions in the scikit-learn interface, if the input involves types like int64, these would be casted to a floating-point type. In R, especially when using data.table, it's oftentimes common to have 64-bit integers from the package bit64, which if casted to float64, would lose integral precision beyond $2^52$ (same for int32 which could lose precision when casted to float32) - I am wondering if there should be a way to support these without loss of precision, and whether it's possible to efficiently create a DMatrix from a structure like an R data.frame, which is a list of arrays that aren't contiguous in memory and might have different dtypes.
  • Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification? As it is right now, it'd be quite easy to do this in R, but I can foresee a situation in which in the future someone for example submits a PR adding a new objective like binary:cauchit, adds it to the python XGBClassifier class, but overlooks the R code as it might be unknown to the contributor, and the R interface then won't act properly on receiving this objective.
  • How does the python interface keep the list of parameters up-to-date with the C interface? Is it done manually by maintainers? Is there some way in which the R interface could automatically keep in synch with the C one every time a new parameter is added? Or will it need to manually keep recollection of every possible allowed argument and its default value?
  • How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing? I see there's also a Julia interface for example, which just like R, uses base1 indexing, but from a brief look at the docs it looks like it sticks with base0 indexing for xgboost regardless.
  • Since the R object from the high-level interface in this case would have more metadata than what's kept in the serialized C Booster, this in theory could lead to issues when updating package versions and loading objects from an earlier version - for example, if a new field is added to the object class returned from xgboost(), an earlier object saved with saveRDS will not have such a field, which might lead to issues if it is assumed to exist. It'd be theoretically possible to add some function to auto-fill newly added fields as they are added to the R interface when e.g. restoring the booster handle, but this could potentially translate into a lot of maintenance work and be quite hard to test amnd easy to miss when adding new features.
    • How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster? How does it deal with converting between Booster and scikit-learn classes?
    • I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C Booster, but there could be cases in which the R interface changes independently of the C struct.
    • If one were to load a C Booster into the high-level interface and there's no metadata to take, I guess the most logical thing would be to fill with generic names "V1..N", "1..N", etc. like base R does, but this would not lead to a nice inter-op with the python scikit-learn interface. Does the python interface or other interfaces keep extra metadata that the Booster wouldn't? Is it somehow standardized?
  • How does xgboost determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it?
  • Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a DMatrix? If so, I guess support for arrow in R could be delayed until such a route is implemented. I guess the python interface would also benefit from such a functionality being available at the C level.

Some other questions for R users (@mayer79 ?):
  • Are there any popular data.frame-like objects from packages related to GPU computing (like cuDF for python) or distributed computing (other than spark) that I might perhaps be missing in this post?
  • Is there any popular R type for learning-to-rank tasks that would involve the qid and position information that xgboost uses? I am not aware of any but I'm no expert in this area.
  • Is there perhaps some object class from Bioconductor or from some other domain (like, don't know, geopositioning from e.g. gdal) that should logically map to some specific xgboost objective without undergoing transformation by the user to e.g. factor / numeric / etc. ?
  • Is there some package offering a custom formula parser meant for decision tree models?
@hcho3
Copy link
Collaborator

hcho3 commented Oct 30, 2023

I am the person who added the section a-compatibility-note-for-saveRDS-save to the manual, so let me answer the question regarding serialization.

How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster?

Does the python interface or other interfaces keep extra metadata that the Booster wouldn't?

The sklearn interface of XGBoost extracts the necessary key-value pairs from the object attributes, serialize them as a JSON string, and store the JSON string in the attributes field of the C Booster object.

Admittedly this approach is not foolproof. This is why we ask users to save the model using XGBoost's native serializer (load_model() / save_model()). It must be noted that the base scikit-learn package provides no guarantee of backward compatibility for pickled model objects. Hence, Python users are accustomed to using pickle for only short-term storage and other forms of storage for long-term storage.

It appears that the expectations are somewhat different in the R community. After seeing lots of bug reports arising from calling saveRDS on old model files (#5794), I had to add the big warning about saveRDS in a separate section a-compatibility-note-for-saveRDS-save.

I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C Booster, but there could be cases in which the R interface changes independently of the C struct.

Checking the version of the C Booster is sufficient, because we always bump the version of the native package and the R package in lockstep. For example, when the native package had its version bumped from 1.3.0 to 1.4.0, the R package had its version bumped from 1.3.0.1 to 1.4.0.1.

How does it deal with converting between Booster and scikit-learn classes?

You can extract the Booster class by calling get_booster() method of XGBClassifier / XGBRegressor. To convert Booster to XGBClassifier / XGBRegressor, first persist the Booster to a JSON file and then call load_model().

@trivialfis
Copy link
Member

trivialfis commented Oct 31, 2023

Let me try to clear some of the questions before making any comments. ;-)

Does xgboost support passing a mixture of dense and sparse data together?

No, and it's unlikely in the near future. XGBoost mostly works with row-major data with
some limited support for column-major (like CSC, cuDF), but partitioning data by column is
a different story. One of the goals of categorical data support is to eliminate the need for dummy encoding.

Is there a C-level DMatrix equivalent of cbind / np.c_ ? I don't see any but this would make things easier.

No. The DMatrix object doesn't provide any means of data manipulation, it's just an
internal storage class for a machine learning library. Any feature that is not directly related to
ML (like feature engineering) is considered out-of-scope. There's a slice method for the cv function, which we
would really like to remove. If the input data is a numpy array, use numpy to manipulate it.

Can sparse columns be taken as categorical without being densified?

Yes. Internally, XGBoost marks the maximum coded (like ordinal-encoding) categorical value and
uses it as the number of categories in the respective column. The format of the column is
orthogonal. We want to allow some extra information from users like a user-specified number
of categories in the future, but this should not block the current R interface
development. (we will use the feature_type field)

I am wondering if there should be a way to support these without loss of precision

No. Internally, the decision tree is always split using f32. In the long run, we can probably
support f64 after the removal of binary serialization. But so far it hasn't been a
priority since in most cases some normalization for the data can get us very far.

Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification?

You can query the objective using XGBoosterSaveConfig, which is exposed in the current R
interface xgb.config. It involves some JSON parsing though. You can cache the result in
a R object during model loading and after model training if it's performance-sensitive.

How does the python interface keep the list of parameters up-to-date with the C interface?

Mostly by hand, unfortunately. "Mostly" since we have the native interface, which accepts
a dictionary of parameters and there's no need to copy and paste the parameters for the
native interface. The scikit-learn interface requires explicitly listed parameters for
some scikit-learn magic to work (like Python inspect), and as a result, you see a list of
parameters in the skl interface.

Or will it need to manually keep recollection of every possible allowed argument and its default value?

No need to explicitly set the default value. Default values are set inside libxgboost
(c++), not passing the key into XGBoost means using the default value. The Python interface
uses None as a value to denote a parameter that should not be passed into libxgboost.

How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing?

I don't know, you will have to reach out for questions. @ExpandingMan :-)

How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster

We insist that one should not use pickle for long-term storage. An update to the cPython
interpreter itself might make the pickle object invalid due to invalid
instructions. However, we do maintain a policy in C++ to support it with best-effort, here it's:

  • If the pickle or RDS to be loaded has the same version of the current XGBoost, we load it in full.
  • If it's not the same version, we load only the model and discard all the configuration, see next answer for what are these.

However, what happens in R/Python is beyond our scope. Please take a look at our
document doc/tutorials//saving_model.rst or just continue with the following answers,
which should provide some more hints.

How does it deal with converting between Booster and scikit-learn classes?

get_booster as suggested by @hcho3 . In addition, the scikit-learn interface doesn't
have much metadata and can be fully functioning by loading a Booster object. Unlike
scikit-learn, XGBoost distinguishes model parameters and meta-parameters. The former is
related to model output and shape like objective, number of trees, and number of features,
while the latter is related to training and runtime like learning-rate, number of threads,
etc. The load_model makes sure the former is complete and discards later after
training. We don't need learning-rate after training, and the number of threads should be
discarded since we want the model to be portable across machines.

I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions

hopefully, the previous two answers clear this up as well.

Does the python interface or other interfaces keep extra metadata that the Booster wouldn't

Yes, Python scikit-learn is using attributes. But it's not important, we can remove it
anytime we want, it's a legacy. As we care about only the model, not the hyper-parameter
as mentioned previously. As long as the loaded model can do inference properly (hence
contains the full model), we are somewhat satisfied.

Is it somehow standardized?

No. It took some time before we got where we are, to gradually formalize what should be
saved. Please take a look at our document: doc/tutorials//saving_model.rst. The
save_model and load_model is guaranteed to work across bindings, other custom fields
are not.

If you think a new field should work across bindings, please share, we can always make
additions. A recent example is feature_types, which is saved into the JSON model for
categorical data support.

How does xgboost determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it

At the moment, it's the row order, we can have extra fields later. It's a new feature in
the recently announced 2.0, we are still hoping to collect feedback, it's still a hotly
researched area and nothing is settled (position-aware ltr).

Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a DMatrix

I removed it very recently. (if something is missing, it's me to be blamed. -.- ). I'm
familiar with the internal of arrow for both CPU and GPU memory due to a different
project. It chunks data for each column non-uniformly and has layered buffers. It's quite
difficult for XGBoost to support such data format without first performing a concatenation
for all the chunks. It's probably a good design for data management frameworks like
spark/dask, but not ideal for a traditional ML algorithm. On the other hand, if we
perform a concatenation, there's no point in using the arrow's C interface. We can just
pass the memory buffer's address. We have one good example of how a contiguous (without
chunks) arrow-based table can be passed into XGBoost using cuDF, it dumps out the memory
addresses for each contiguous column as __cuda_array_interface__, and XGBoost can
consume them with zero copy for inplace-predict and QuantileDMatrix.

@trivialfis
Copy link
Member

trivialfis commented Oct 31, 2023

Comments unrelated to questions. I will update this comment when I can put together some more thoughts.

If I'm not mistaken, CSR is supported for prediction but not for training

CSR is well supported for all purposes.

Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept

Depending on the use case, one can fix the intercept to a constant (like 0) value or let xgboost find one based on MLE.

@david-cortes
Copy link
Contributor Author

Thanks for the answers.

So it sounds like a good way to approach serialization and transfers between interfaces would be to keep as many attributes as possible inside of the C booster, then have a function that default-initializes them, and then fills them with info from the booster if available at the moment it gets deserialized. This unfortunately wouldn't work for things like R formulas, so one would strictly need to use saveRDS (as opposed to xgb.save) in such cases however.

As a preliminary step, I think it would make sense to extend the the current optional attributes that are kept inside of the C booster to include categories of both y and categorical columns of X.

Scikit-learn currently accepts pd.Categorical and uses the categories when passed as y in some estimators. It currently doesn't keep recollection of categories for categorical columns of X, but it has slowly been moving in the direction of leveraging more and more pandas in its estimators and transformers.

I think it could even be handy to keep levels of categorical features as part of the metadata when using the scikit-learn interface and passing X that has pandas categorical types, even if they are not currently used (but perhaps could later on be used as column names in outputs from the scikit-learn interface?).

(I've opened a separate issue for the y categories: #9739)

Note that this might not be as straightforward as it sounds, as pandas allows any arbitrary type to be used as categories, not just strings, whereas R only allows strings, so perhaps it could further have the restriction that the categories would be coerced to strings.

@ExpandingMan
Copy link

I didn't bother to read most of this thread, but I was pinged concerning 1-based indexing, as I am one of the maintainers of the Julia wrapper, Julia being a 1-based indexed language.

Nearly all Julia packages whether wrappers or not, stick with purely 1-based index semantics. In the case of wrappers like XGBoost.jl, this simply means that the wrapper functions need to translate the indexing. There is a relatively low level direct wrapper of the libxgboost C-lib that does not do this, but it is considered private to the wrapper package.

but from a brief look at the docs it looks like it sticks with base0 indexing for xgboost regardless.

I'm not sure what you're referring to, but no, that's not the case. The aforementioned low level C wrapper should not show up in the docs, libxgboost is small enough that the intention of XGBoost.jl is that nobody should ever have to use that.

@trivialfis
Copy link
Member

keep as many attributes as possible inside of the C booster

If the object is R-specifc, and R does such a good job of keeping saveRDS portable (in contrast to Python pickle) across versions, we can let saveRDS do its magic instead of saving them into the booster, right? The best code is the code that we don't have to maintain and yet it works. ;-) Fun aside, I'm fine with either approach though, will leave it to you to decide.

I'm not sure what you're referring to, but no, that's not the case.

Thank you for clearing that up @ExpandingMan ! That helps a lot.

@david-cortes
Copy link
Contributor Author

@mayer79 would be nice to hear some comments about the proposed interface here, especially around handling of formulas.

@mayer79
Copy link
Contributor

mayer79 commented Nov 5, 2023

@david-cortes : I was a couple of days afk and will give my feedback soon. Thank you so much for starting this thread. We might get some inspiration of https://github.com/MathiasAmbuehl/modeltuner (no possibility to select how to encode categoricals - it just uses OHE), but otherwise it looks neat.

@mayer79
Copy link
Contributor

mayer79 commented Nov 10, 2023

Curiously, I don't know the answers to the questions to me. But I think they are less important.

My current mood:

  1. Most of what have been suggested above means: Rewrite the xgboost() high level API and add a corresponding xgboost.cv() function. We could do this in the current package, without the need of maintaining a second {xgboost2} package.
  2. Categorical support should only become the default as soon as SHAP values are available for it. The {shap} library has almost as many pypi downloads as {xgboost}. Actually, I think that almost every new functionality should also make SHAP work.
  3. We can use a simple formula parser, see code below. No need for interactions, transformations etc. To control the categorical encoding of factors, we can use an argument encode_factors = c("integer", "target", "ohe"). Maybe this needs to be vectorized, so that users could specify encode_factors = c(Species = "integer", ...) to use different encodings per discrete feature.
  4. I actually like {ranger}'s interface: It accepts either formula + data or x + y. (I don't like its formula parser though).
  5. Monotonicity and interaction constraints could be passed as column names only, like: monotone_inc = c("Sepal.Width", ...), monotone_dec = c(...), interaction_constraints = list(c("Sepal.Width"), ...).
  6. In the high-level API, we would need to subset feature columns so that tibbles and data.tables are fine. This is usually not complicated.

Formula parser (dropping all transformations), just extracting column names:

formula_parser <- function(form, data) {
  form <- as.character(form)
  y <- form[2L]
  all_vars <- all.vars(
    stats::terms.formula(stats::reformulate(form[3L]), data = data[1L, ])
  )
  list(y = y, x = setdiff(all_vars, y))
}

formula_parser(Sepal.Width ~ ., data = iris)
# $y
# [1] "Sepal.Width"
# 
# $x
# [1] "Sepal.Length" "Petal.Length" "Petal.Width"  "Species"

@david-cortes
Copy link
Contributor Author

Regarding point (1), under the proposal here, some of the implications are that:

  • xgboost() would return a different class from xgb.train() (like in the python package, where the scikit-learn class is different from the Booster class), which in turn means objects returned from it would have a different predict method with different signature - i.e. if a reverse dependency uses xgboost() and calls predict with some argument, it would lead to breakage.
  • Some of the exported package methods would disappear (as they'd become internal), which again would lead to breakage if libraries rely on them - for example, a library might check to which xgboost model class does an object belong and try to convert from one to another (handle -> booster) internally, which would not be necessary anymore under the proposal, but would also lead to breakage in said dependency.
  • Some of the current methods (like plotting) might work differently depending on whether the input came from xgboost() or xgb.train(), which could potentially lead to breakage.

Thus, it might not be necessary to create a new package, but then it'd be necessary to work with reverse dependencies to adapt.


Regarding point (2), I do not think that withholding categorical support for now would be beneficial. At some point, the core xgboost library will have wider support for things that involve categorical variables and will hopefully have it enabled by default. We haven't even started work on the R interface yet and we don't know when it will be ready.

Also, in my view, I would not consider SHAP support as a blocker - there's plenty of use-cases for xgboost that benefit from better handling of categorical features and which do not involve calculating SHAP values; and if needed, an external library could be used for it even if it might not be optimal.


Regarding the proposed parser for formulas: I do not feel that such an interface would bring value to users if it's only used to select variables.

In my opinion, one of the most useful aspects of formulas is that they can act as quick data preparation pipelines, which are helpful when e.g. prototyping models in an interactive session - for example:

  • y ~ I(count_of_A + count_of_B + count_of_C) + I(measure1/measure2) + ...
  • y ~ heigh + log(weight) + I(mass/height^2)
  • log(y) ~ .
  • I((var1 - var2)/var3) ~ abs(var4) + var5 + ...

In these cases, the extra processing would be lost, and the selection of variables could be achieved just as easily through e.g. dplyr::select or data.table[,.(...)] and passed to an x/y interface instead.

What's more, if the formula were to be used to one-hot-encode categorical variables, it may be less efficient than what a user could achieve by e.g. using a sparse encoder instead of base R's when appropriate.

Also, choosing the way of handling categorical variables like that would then raise the question of what happens when parameter max_cat_to_onehot is supplied, and I imagine that more such parameters might appear in the future (like cat_smooth in lightgbm).

As such, I am thinking that if a formula parser is too complex to implement and there isn't any readily available parser with a compatible license, it might be better to not develop a formula interface, or maybe leave it for a future version.

@trivialfis
Copy link
Member

A side note, we support SHAP values with categorical data for the original SHAP algorithm (including interaction). The issue is the plotting library needs some additional work to read the model properly. We can work on the SHAP package after clearing up some priorities.

@mayer79
Copy link
Contributor

mayer79 commented Nov 12, 2023

@trivialfis Nice! So we need categorical support in R indeed at higher priority. In R, shapviz already works with categoricals, e.g., for LightGBM. So at least the plotting of SHAP values is ready in R.

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 14, 2023

I just realized that the ALTREP system in newer R versions can be used (albeit in a rather hacky way) to manage custom serialization of arbitrary objects, provided that the user doesn't start interacting with and modifying those objects through R functions.

Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method like saveRDS instead of always materializing the serialized bytes on every model creation call; and to automatically re-create the C++ object and handle when calling readRDS without having to call xgb.Booster.complete, like the python package does with pickle's get/set state.

Here I've composed a small example of how to do this with externalptr objects:
https://github.com/david-cortes/altreppedpointer
(so far, haven't found any R package using this sort of mechanism though)

Note that this will require changing the class and the structure of the current handles, so a first step would be to make these classes internal-only as mentioned in the first message.

@trivialfis
Copy link
Member

trivialfis commented Nov 16, 2023

I just realized that the ALTREP system in newer R versions can be used

Thank you for sharing! That's quite interesting! The flexibility is a bit scary though, probably just due to my very limited understanding. I can learn more about the framework. I think if we were to use it, some documentation for future developers would be nice.

Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method

Indeed, that would be extremely helpful in keeping the states consistent.

As for error handling, I think the only concern is that XGB is not great at forward compatibility, loading a new model with an older version of XGB might result in errors (C API returning -1). OOM is a bit more tricky on lazy OS like Linux based OSs, the exception might not be caught by XGB. But I guess it's a general issue for all packages with CPP.

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 26, 2023

As we haven't received any further comments about these proposals, guess it's reasonable to start with a plan and checklist of things to do based on the current proposal.

As I see it, the following would be a high-level overview of what needs to be done:

  • (High priority) Create an easy-to-use script to create an R package or to otherwise re-arrange files in the kind of folder structure that R uses for packages. Currently, there's a python script to do this which is used in the CI jobs, but it seems to be very problematic to use outside of the docker containers from those jobs - e.g. it won't overwrite files if there are any, doesn't seem to be simply callable from a folder above (where it will generate .pyc files which it will complain about overwriting), etc. Having something like the old makefile script to build and R package would certainly make things easier and faster.
  • Enable parameters and features for the R interface which are currently disabled:
    • (High priority) Parameters for categorical features ("enable_categorical" and "feature_types" - not sure if the latter is just a matter of allowing it in the JSON parser, or whether it additionally needs calls to e.g. XGDMatrixSetStrFeatureInfo or similar).
    • Fitting a model from a DMatrix created from a CSR matrix (currently errors out for fitting but not for predicting).
    • Other potential things that might have been disabled, such as passing a random seed (very low priority).
  • Enable multi-output input labels and predictions.
  • (High priority) Add a mechanism to create a DMatrix from an R data.frame. I guess the most efficient route here would be to call C function XGDMatrixCreateFromDT, but I get the impression from the python interface that this is not the method that ends up getting used when constructing from a pandas DataFrame so want to ask if this is the recommended way for a data structure like R's (basically a list of arrays which are not in one single contiguous memory block).
  • Add a mechanism to create a DMatrix object from arrow objects. Not sure how feasible this is though, and I'd say this is very low priority.
  • Add an interface to create QuantileDMatrix objects from R, accepting the same kinds of inputs as DMatrix (data.frame, matrix, dgCMatrix, dgRMatrix, arrow if implemented, maybe float32). I'd also consider this one very low priority though, and it's probably quite a lot more code than the DMatrix-related tasks.
  • Switch the current DMatrix creation function for R matrices towards the C function that uses array_interface.
    • In this case, the cv function could also be switched to use this same interface but with a pointer offset for matrices instead of using DMatrix slice methods. I got from the previous messages that you guys wanted to remove the DMatrix slice methods, so if this R usage form is the only thing keeping them, perhaps they could also be eliminated afterwards. Actually this is wrong, given that the indexes will not always reflect memory-contiguous slices of a single shuffle.
  • Switch the predict function for the current booster to use inplace predict or other more efficient DMatrix creators (like XGDMatrixCreateFromDT mentioned earlier) when appropriate.
  • Remove all the public interface (functions, docs, tests, examples) around the Booster.handle class, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now.
  • After the task above is done, switch the handle serialization mechanism to ALTREP and remove xgb.Booster.complete (which wouldn't be needed anymore).
    • Perhaps a similar serialization method could be added for DMatrix objects too, which I guess is just a matter of calling XGDMatrixSaveBinary, but I want to ask if that's all it takes or whether it needs to e.g. keep the original R data objects (like matrix) and restore pointer addresses to that data or similar, as I see that the python class DMatrix doesn't have __getstate__/__setstate__.
    • Note that this requires R >= 4.3, so the CI jobs will need to be modified and older R versions dropped.
  • Remove the current xgboost() function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.).
  • (High priority) After support for data.frame and categorical features is added, then create a new xgboost() function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper over DMatrix + xgb.train but implementing the kind of idiomatic R interface (x/y only, no formula) described in this thread, either with a separate function for the parameters or everything being passed in the main function.
    • It should return objects of a different class than xgb.train (perhaps the class could be named xgboost).
    • This class should have its own predict method, again with a different interface than the booster's predict, as described in the first message here.
    • If this class needs to keep additional attributes, perhaps they could be kept as part of the JSON that gets serialized, otherwise should have a note about serialization and transferability with other interfaces.
    • This is probably the largest PR in terms of code (especially tests!!), so might need to be split into different batches. For example, support for custom objectives could be left out from the first PR.
  • After the new xgboost() x/y interface gets implemented, then modify other functions to accept these objects - e.g.:
    • Plotting function.
    • Feature importance function.
    • Serialization functions that are aimed at transferring models between interfaces.
    • All of these should keep in mind small details like base-1 indexing for tree numbers and similar.
  • Create examples and vignettes for the new xgboost() function.
  • Perhaps create a higher-level cv function for the new xgboost() interface. I'd say this is low priority though.

Other topics from the earlier issue which aren't mentioned in the list above:

  • External memory with DataIter.
  • Quantile regression with multiple quantiles.
  • Use CMake instead of autotools when possible, this way we can make it easier to support GPU with R.
  • Distributed training, perhaps integration with RSpark.
    • This one I'm not sure if it'd make sense to put in the same package as xgboost and I'd also consider it a lower priority thhan the others.

Would like to hear some comments about it, particularly around DMatrix-related topics, and would be very helpful if xgboost maintainers could work on the first issue in this list.

After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people. I think there's a github functionality for it but I've never used it and don't know how to so perhaps maintainers could help here.

@trivialfis
Copy link
Member

trivialfis commented Nov 26, 2023

Thank you for writing a comprehensive list! I'm really excited for the proposed changes here. Let me try to comment on some of the items considered to be high priority:

Create an easy-to-use script to create an R package or to otherwise re-arrange files

One can run R CMD install . under the package directory, devtools can load the package from path as well. The script is not necessary for installing from source. I think it's possible to suppress the Python compilation. Lastly, if we were to rewrite it due to the use of Python, it would be great if we can use R instead of make file or shell.

Parameters for categorical features

If memory serves, I think the call to set types is already implemented in the current code base. We need additional support for recognizing factor automatically. I can double check later and share how to set the types.

Fitting a model from a DMatrix created from a CSR matrix

I will debug it, do you have a specific example?

Add a mechanism to create a DMatrix from an R data.frame.

I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?

After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people.

Feel free to pick whichever that fits your workflow best. One can use Todo items, multiple issues, or GitHub project to track progress. Tips: this is an to-do item

  • to do.

I think there's a button in GitHub to split an issue with multiple items into multiple issues. I can help create a GitHub classical project to keep track of issues if needed. Feel free to ping me.

@david-cortes
Copy link
Contributor Author

One can run R CMD install . under the package directory, devtools can load the package from path as well.

Thanks for the hint. Can confirm that R CMD INSTALL . from R-package/ works as expected under the new build system, as do RStudio's analog which uses devtools and the variants for running tests and building docs.

I will debug it, do you have a specific example?

Actually, this seems to have been already fixed when creating a DMatrix first, but this will still error out:

library(xgboost)
library(Matrix)
data("mtcars")
y <- mtcars$mpg
x <- mtcars[, -1]
x.sp <- as(as.matrix(x), "RsparseMatrix")
model <- xgboost(data=x.sp, label=y)

Nevertheless, since we want to remove the code behind this function, I guess it's not worth fixing at this point.

I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?

There is no difference in the arrays, both are just a list (an R array object which in C has internal type VECSXP and whose elements are retrieved through VECTOR_ELT) containing other SEXP objects whose pointers are retrievable through e.g. REAL(x), INTEGER(x), etc. Note that there's also a popular package bit64 that provides int64_t arrays.

Objects from data.table do have some differences though in that they can contain lists of lists and other types, but those are anyway not supported by xgboost; and in other small details which shouldn't matter for DMatrix such as potentially having hashes for some column or not having row names.

In terms of creating DMatrix from data.frame, I guess the most reasonable route would be to only accept columns of types numeric + integer + factor + bit64::integer64, leaving potentially character for the high-level interface. Note that float::float32 cannot be put inside a data.frame, unlike bit64.

I will create a separate roadmap issue with checklists to discuss work on these issues and progress.

@david-cortes
Copy link
Contributor Author

A couple things I am wondering:

A. xgboost supports the sorted indices splitting algorithm (tree_method="exact").

Under this algorithm, infinite values should simply be treated as greater or smaller than every other value, and since the raw values aren't used in any other statistic, it should be possible to fit a model with them.

But it seems that xgboost will not accept infinite values regardless in the DMatrix construction. Are infinite non-missing values never accepted in input features?


B. I see that there are functions like XGBoosterPredictFromDMatrix for which part of the output is a pointer to an array of values. In the R interface, it doesn't look like the pointer that gets passed and filled inside the call would be freed after.

Does that kind of function allocate arrays inside of the booster pointer? If so, does it mean that predicting on larger inputs makes the model heavier in memory?


C. I see there's some parallelization in the R wrapper for things like converting signed integers to unsigned.

Part of the current operations include retrieving a pointer to the data in every iteration of the loop, but if that part were to be removed by pre-retrieving the pointer, would that parallelization actually result in faster operations?

As far as I understand, for a regular amd64 platform, that sort of operation should be so fast that it's basically limited by the speed at which memory is retrieved from RAM, and having multiple threads retrieving from RAM shouldn't make it faster.

@trivialfis
Copy link
Member

But it seems that xgboost will not accept infinite values regardless in the DMatrix construction

I disabled the inf input due to lack of tests and other algorithms running into issues, to properly support it we need to have test cases for all potential uses, including things like SHAP, categorical features, etc. At the moment, it's not a very high priority. A use case would help.

it doesn't look like the pointer that gets passed and filled inside the call would be freed after

It's using thread-local static memory as a return buffer for thread-safe inference. The system will free it.

static thread_local int myvar;

As far as I understand, for a regular amd64 platform, that sort of operation should be so fast

I agree. Feel free to remove them.

but if that part were to be removed by pre-retrieving the pointer

I think we can entirely remove the data copies by using array interface, which is also supported by the SetInfo. As a bonus, it saves memory usage as well thanks to being able to avoid allocating a new buffer. Memory usage has been a huge headache since some users are quite used to deep-learning-like batch training and find it frustrating that gradient boosting needs the entire dataset in memory to be efficient. Most of the type-specific C API are legacy now.

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 30, 2023

Another question: are there any guidelines for using SIMD code in xgboost?

I see for example there's a function in the R wrapper that substracts 1 from elements in an array:

idxvec[i] = INTEGER(idxset)[i] - 1;

I was thinking of adding #pragma omp simd, but was wondering if perhaps xgboost includes some SIMD module with dynamic dispatching per CPU instruction set or similar.

@trivialfis
Copy link
Member

No, we don't have those. Feel free to add it.

I thought about having an aligned allocator, but dropped the work as my attempt of using sse did not speedup the histogram kernel. Another attempt of using it is to speedup allreduce on CPU, the difference was quite insignificant from my benchmark, so it was dropped again.

@trivialfis
Copy link
Member

trivialfis commented Dec 1, 2023

HI @david-cortes , are you familiar with the error from CRAN check caused by calling error/Rf_error:

  xgboost_R.cc:70:3: warning: format string is not a string literal (potentially insecure) [-Wformat-security]

I'm not sure how to fix it without removing the use of the error function.

https://cran.r-project.org/web/checks/check_results_xgboost.html

Update
I will try Rf_error("%s", msg) with rhub.

@jameslamb
Copy link
Contributor

Rf_error("%s", msg) worked for us over in LightGBM when faced with the same warning:

microsoft/LightGBM#6216

We have a CI job there that replicates the Debian clang r-devel build from CRAN.

@david-cortes
Copy link
Contributor Author

Yes, that seems to be the usual clang formatting warning, which is solved by using "%s", <str> instead of just <str>. That %s is also the solution that's being proposed at Rcpp: RcppCore/Rcpp#1287

@david-cortes
Copy link
Contributor Author

@trivialfis In the previous thread, you mentioned this as one of the desired things to implement for the R interface:

quantile regression with multiple quantiles.

Is this supported in the python interface? Looks like it errors out if you try to pass multiple quantiles.

If this is supported in python, could you provide an example of what this means?

@david-cortes
Copy link
Contributor Author

Yet another question: is there something currently preventing using field feature_type in the R interface?

Something like this seems to work right now, but I don't know if the feature types are actually getting used:

library(xgboost)
data(iris)
y <- iris$Sepal.Length
x <- iris[, -1]
x$Species <- as.numeric(x$Species)
x <- as.matrix(x)
dm <- xgb.DMatrix(data=x, label=y)
setinfo(dm, "feature_type", c("q", "q", "q", "c"))
model <- xgb.train(
    data=dm,
    nrounds=5
)
pred_mat <- predict(model, dm)

@david-cortes
Copy link
Contributor Author

A couple perhaps stupid questions that I'm wondering after issue #9869 and which I would like to somehow add to the docs:

  • How is base_score used? Is this something that gets added to all the rows in what predict calls the "base_margin"? Or does it undergo some other process? Is there documentation around it?
  • How does xgboost initialize base_score for objectives like binary:logistic? I had the impression that it would take the value that minimizes the objective function when the prediction is constant, like lightgbm does (so e.g. for logistic, this would be log(sum(y==1)) - log(sum(y==0))), but from a quick test with the agaricus data, this doesn't seem to be the case.
    • For large datasets, it seems to set base_score to the mean of label even for objective binary:logistic, while for smaller datasets, it seems to be close to the mean of label, but still off by some amount.
  • Does base_score (or otherwise the intercept if this isn't it) get updated with each tree that gets added, or is it always set to what was estimated initially?
    • From some tests on my end, it does seem to remain constant over iterations.

@trivialfis
Copy link
Member

Let me add documentation instead of writing with GH comments, always nice to have fresh eyes to catch the lack of documents.

@trivialfis
Copy link
Member

It seems CMake is packed into rtools43 for windows? I found this while wrangling with MSVC:

C:\rtools43\usr\lib\mxe\usr\x86_64-w64-mingw32.static.posix\bin\cmake.exe

@david-cortes
Copy link
Contributor Author

I'm not sure if RTools itself bundles CMake, but it should be possible to have it as a dependency under 'SystemRequirements' and then CRAN will ensure that it is available when it produces its pre-compiled binaries.

There's a section about CMake in the R extensions manual, but I'm not sure what's the right way to get an R package to use CMake:
https://cran.r-project.org/doc/manuals/R-exts.html#Using-cmake

@trivialfis
Copy link
Member

trivialfis commented Jan 5, 2024

Thank you for sharing! I appreciate the reference. As long as CRAN is confortable with the use of CMake, I'm sure there's a way to sort out all the technical details.

@jameslamb
Copy link
Contributor

jameslamb commented Jan 5, 2024

I'm not sure what's the right way to get an R package to use CMake

The CRAN mirror on GitHub is also a good reference for this.

https://github.com/search?q=cmake+org%3Acran+path%3ADESCRIPTION&type=code

Everything you see on there has been accepted to CRAN and met its requirements. Those might not be the best ways to use CMake in a package, but they at least provide some precedents of approaches that CRAN accepted and which pass the CRAN checks.

@trivialfis
Copy link
Member

Somehow saw this thread today: https://www.mail-archive.com/[email protected]/msg09395.html about the use of cmake.

@david-cortes
Copy link
Contributor Author

@trivialfis You previously mentioned that you wanted to remove support for predict and for DMatrix constructors from svmlight files.

I see there is also support for CSV (which I assume is also meant to be removed) and for a binary DMatrix format.

I am wondering how these plans would play out with the binary format for DMatrices, which I assume is not meant to be removed. They'd still need to accept string/character types as constructors, expand user paths, make sure files exist, etc.

Is the idea to remove function XGDMatrixCreateFromURI? I see the older one XGDMatrixCreateFromFile is marked as deprecated, but the new one doesn't mention anything about being deprecated.

@trivialfis
Copy link
Member

I don't think I mentioned I want to remove the predict method? Maybe there's miscommunication?

Yes, the text file reading capability will be deprecated, including CSV.

No, the binary for format will not be removed at the moment, we don't have any plan to change it yet.

We haven't been able to deprecate the URI loading yet since the working in progress federated learning is using them at the moment and there's no alternative. Since R doesn't have such legacy, I thought it might ease the task if we don't need to implement it in the DMatrix class.

Initially we will just deprecate the text loading, but the function will continue to be useful for loading the binary format. So, yes, many of the file path routines will continue to be needed. On the other hand, if you think the high level interface is going to take over and there's no need for a user to interact with normal DMatrix (non-external memory), then feel free to remove the binary format in R. I don't have strong opinion that it must present in R.

@david-cortes
Copy link
Contributor Author

I don't think I mentioned I want to remove the predict method

It currently supports all DMatrix constructor inputs, so it also accepts files as inputs, and has docs around handling of file inputs.

feel free to remove the binary format in R

I'd prefer to keep the functionality for saving DMatrices in binary format if it's not deprecated.

Yes, the text file reading capability will be deprecated, including CSV.

Then I think it should suffice to just change the current reader to the URI reader, and then once the C functions remove or deprecate this functionality, it will be applied to the R interface without needing to change anything.

@trivialfis
Copy link
Member

It currently supports all DMatrix constructor inputs, so it also accepts files as inputs, and has docs around handling of file inputs.

That's unfortunate. Let's remove it.

it will be applied to the R interface without needing to change anything.

Yes.

I will be on holiday next week for roughly two weeks, would really appreciate it if @hcho3 could continue the discussion around R here. Please let me know if there's anything I can help within the next couple days.

@david-cortes
Copy link
Contributor Author

That's unfortunate. Let's remove it.

In that case, let's leave continue this after the PRs for inplace prediction and URI loader have been merged.

@david-cortes
Copy link
Contributor Author

@trivialfis In the other issue you mentioned:

Documentation and unified tests for 1-based indexing.

What kind of tests did you have in mind for this topic?

@trivialfis
Copy link
Member

Any unittest that can show the behavior as described in doc should work. I take tests as a form of document.

@david-cortes
Copy link
Contributor Author

Any unittest that can show the behavior as described in doc should work. I take tests as a form of document.

Aren't there already tests for most of those? Only things I can think of from the list that are currently missing are tests checking categorical features in text/json/table dumps. The rest (e.g. iteration numbers, class encodings, etc.) should already be covered in current tests.

@david-cortes
Copy link
Contributor Author

That's unfortunate. Let's remove it.

On a deeper look, I noticed that removing on-the-fly DMatrix creation in predict would lead to more inconsistencies, given that the R predict method doesn't differentiate between DMatrix prediction and in-place prediction like the python Booster class does.

There are cases like gblinear booster that work only with DMatrix, and cases like dart that may or may not work with non-DMatrix inputs depending on the rest of the parameters, which can lead to very unexpected outcomes if no conversion is done inside predict - for example, the same predict method may or may not accept a matrix depending on which parameters xgb.train received, which I don't think is desirable.

@david-cortes
Copy link
Contributor Author

@hcho3 @trivialfis I see in the AFT tutorial:
https://xgboost.readthedocs.io/en/stable/tutorials/aft_survival_analysis.html

It mentions that left-censored data should be passed as having a lower bound of zero.

Does it support data that is censored at exactly zero or less than zero? If so, how should it be passed? From some experiments, it seems to accept -Inf as a lower bound, but predictions never go below zero.

I'm particularly wondering if survival:aft could be used for a dataset like Tobin's which is typically used in R examples (like ?survival::tobin), where the censoring is left-sided and at exactly zero.

@hcho3
Copy link
Collaborator

hcho3 commented May 27, 2024

XGBoost expects the survival time to be nonnegative, since it's not in the log scale. I think a 0 should be acceptable.

@david-cortes
Copy link
Contributor Author

david-cortes commented May 28, 2024

@trivialfis Something that's not quite clear to me after reading the docs about XGBoosterPredictFromDMatrix: what is the last dimension in predleaf when using strict_shape?

I see the R docs mention for strict_shape:

For predleaf = TRUE, the dimension is (n_trees_in_forest, num_class, n_iterations, nrow(newdata)).

But this n_trees_in_forest always seems to be 1, and the python docs do not make any mention of it.

What does it refer to? Is it ever not equal to 1?

@trivialfis
Copy link
Member

That's the num_parallel_tree parameter.

@trivialfis
Copy link
Member

Can we come up with a good dataframe representation for vector-leaf trees? I plan to support it in this release, but I haven't got an efficient representation format yet. It would be great if the format is consistent between Python and R.

@david-cortes
Copy link
Contributor Author

Can we come up with a good dataframe representation for vector-leaf trees? I plan to support it in this release, but I haven't got an efficient representation format yet. It would be great if the format is consistent between Python and R.

Do you mean the function xgb.model.dt.tree? It's returned as a data.table, which supports list-type columns, so vector values could be represented as a list. Otherwise, there's also the arrow format with ListOf<dtype> if you're willing to add arrow as another optional dependency.

@trivialfis
Copy link
Member

Thank you for sharing. @david-cortes

I know this has been mentioned before, but do let us know if you would like to become a maintainer for XGBoost around the R package. Your work has been really impressive.

@david-cortes
Copy link
Contributor Author

I see there are many R demos for XGBoost of the kind that can be triggered through demo, some of which are outdated after the introduction of functionalities like support for categorical columns and data frames.

I am wondering whether anyone actually checks such examples (I'd say demos are a very obscure R functionality nowadays after the introduction of rmarkdown and similar), and whether it makes sense to have them when there are already vignettes (which are much more likely to be seen by users).

I'm thinking it'd be a good idea to either remove all those demos, or turn them into vignettes. Turning them into vignettes would be quite a lot of work, so I'm leaning towards simply deleting them.

@trivialfis @hcho3 What do you guys think?

@mayer79
Copy link
Contributor

mayer79 commented Aug 24, 2024

Good idea! I am never checking such demos. I think we could delete them and slightly extend existing vignettes or examples.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 24, 2024

Yes, let's remove the demos.

@david-cortes
Copy link
Contributor Author

I just realized that back then when we removed .Rnw files due to being auto-generated, the repository didn't actually include the source file (.Rmd) that is meant to generate the .Rnw file - either the file was missing, or the .Rnw might have been manually written without a .Rmd.

As a result, the vignette "xgboost: eXtreme Gradient Boosting" is effectively missing from the repository now.

I was also just going to suggest to remove it because everything it shows is already covered by the other vignettes and I do not see much of a point of having two introductory vignettes that have a larger overlap with each other, but given that it's already deleted, I propose now to leave it deleted and keep the newer vignette "XGBoost presentation" in its place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants