Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Factor out custom R interface to lib_lightgbm #3016

Closed
jameslamb opened this issue Apr 23, 2020 · 10 comments · Fixed by #4265
Closed

[R-package] Factor out custom R interface to lib_lightgbm #3016

jameslamb opened this issue Apr 23, 2020 · 10 comments · Fixed by #4265

Comments

@jameslamb
Copy link
Collaborator

Summary

The R package calls a C++ library (lib_lightgbm.so) with the R function .Call(). Doing this requires an interface to map R data types to C data types and back. Right now we do not use the SEXP ("S expression") interface that ships with R's built-in headers.

So far our custom implementation has worked well, but there is a risk that future versions of R will not work with it.

To close this issue, remove R_object_helper.h, including LGBM_SER and LGBM_SE. Replace them with the corresponding symbols from R.h, Rinternals.h, and the other R built-in headers.

It is likely possible to do this in pieces....it does not have to be a single pull request.

Motivation

This may not be strictly necessary to get to CRAN, but it will bring the R package into tighter compatibility with the R build tools, so should improve the chances we'll be able to support new versions of R with no changes.

I suspect, though I don't know for sure, that this would also resolve these warnings we're getting from some compilers:

image

See #3009 (comment) and

@jameslamb jameslamb changed the title [R-package] Factor out custom R interface to c_api [R-package] Factor out custom R interface to lib_lightgbm Apr 23, 2020
@jameslamb
Copy link
Collaborator Author

Closing this in favor of being in #2302 , the place where we keep all feature requests. Anyone is welcome to contribute this! If you want to work on it, leave a comment here.

@david-cortes
Copy link
Contributor

I don't get the context in full:

  • Why does the GH version use lib_lightgbm while the CRAN version uses lightgbm? Why not make both lightgbm?
  • What was the reason why it had replacements for R's headers in the first place? Is there something that will break if it were to switch to R's headers?
  • Why even the .Call are wrapped into some lgb.* function?
  • If it were to use R headers, should it use R's external pointers system in order to call the C++ destructors? Or was it intentional that those destructors were put into R6 finalizers?
  • Is there some particular reason why e.g. the pointers are stored in objects of type double? Is the R package meant to be compilable in non-standard CPU architectures?
  • What about things like the logger? Should it also switch to R's functions?
  • Is the R library meant to be compiled in non-standard ways too? e.g. will you create R builds using msvc or something like that?

@jameslamb jameslamb reopened this Apr 22, 2021
@jameslamb
Copy link
Collaborator Author

Why does the GH version use lib_lightgbm while the CRAN version uses lightgbm? Why not make both lightgbm?

Is the R library meant to be compiled in non-standard ways too? e.g. will you create R builds using msvc or something like that?

This is related to the fact that we also offer CMake-based builds of the package (using a different toolchain from CRAN) for some features that are difficult to get to CRAN or prohibited by CRAN's policies, like a GPU-enabled version of the package and builds with MSVC.

You can see https://github.com/microsoft/LightGBM/blob/master/build_r.R and https://github.com/microsoft/LightGBM/tree/master/R-package#install for reference.


What was the reason why it had replacements for R's headers in the first place? Is there something that will break if it were to switch to R's headers?

When the LightGBM R package was first created, the original authors were concerned that pieces of R.h were GPL-licensed, and so they created this custom re-implementation. We've since come to the understanding that that isn't problematic, but haven't yet done the work to replace LightGBM's custom interface to R.

#629 (comment)

* We previously did not want to use R's headers because of license concerns. This is no longer a concern:
* https://github.com/microsoft/LightGBM/issues/629#issuecomment-474995635
* For now, this wrapper is LightGBM's interface from R to C.
* If R changes the way it defines objects, this file will need to be updated as well.

That historical legal stuff is the only reason.


Why even the .Call are wrapped into some lgb.* function?

Right now .Call() calls are wrapped in an internal function lgb.call() to do error handling on the R side. But I have two PRs open currently to switch them all to bare .Call() calls: #4163, #4155.


If it were to use R headers, should it use R's external pointers system in order to call the C++ destructors? Or was it intentional that those destructors were put into R6 finalizers?

I don't understand this question, sorry. Would appreciate some specific links to code or documentation.


Is there some particular reason why e.g. the pointers are stored in objects of type double? Is the R package meant to be compilable in non-standard CPU architectures?

The handles created to data are doubles with 64-bit R and integers with 32-bit R.

lgb.null.handle <- function() {
if (.Machine$sizeof.pointer == 8L) {
return(NA_real_)
} else {
return(NA_integer_)
}
}

Does that answer your question?


What about things like the logger? Should it also switch to R's functions?

When lib_lightgbm is built as part of the R package, its logger already uses the R I/O functions:

private:
static void Write(LogLevel level, const char *level_str, const char *format,
va_list val) {
if (level <= GetLevel()) { // omit the message with low level
// R code should write back to R's output stream,
// otherwise to stdout
#ifndef LGB_R_BUILD
if (GetLogCallBack() == nullptr) {
printf("[LightGBM] [%s] ", level_str);
vprintf(format, val);
printf("\n");
fflush(stdout);
} else {
const size_t kBufSize = 512;
char buf[kBufSize];
snprintf(buf, kBufSize, "[LightGBM] [%s] ", level_str);
GetLogCallBack()(buf);
vsnprintf(buf, kBufSize, format, val);
GetLogCallBack()(buf);
GetLogCallBack()("\n");
}
#else
Rprintf("[LightGBM] [%s] ", level_str);
Rvprintf(format, val);
Rprintf("\n");
#endif
}
}
.

// R code should write back to R's error stream,
// otherwise to stderr
#ifndef LGB_R_BUILD
fprintf(stderr, "[LightGBM] [Fatal] %s\n", str_buf);
fflush(stderr);
#else
Rf_error("[LightGBM] [Fatal] %s\n", str_buf);
#endif

@david-cortes
Copy link
Contributor

These are the destructors that I'm talking about:

LGBM_SE LGBM_BoosterFree_R(LGBM_SE handle,

LGBM_SE LGBM_DatasetFree_R(LGBM_SE handle,

Which are put under R6 finalizers:

finalize = function() {

finalize = function() {

And BTW using them like this currently can lead to segmentation faults as their validity is not checked (#4208).

Another thing: Rf_error used in the logger is a C function which produces an R error. It does not do the same as a C++ exception as it won't trigger stack unwinding (i.e. will lead to memory leaks in some functions). That'd be better replaced by REprintf (prints an error message) and then throwing the R error from the calling function, being careful to destruct any C++ objects if needed.

@jameslamb
Copy link
Collaborator Author

Oh interesting! It sounds like you're really knowledgeable about this. We'd really appreciate some contributions if you see opportunities to improve the way the package is set up. It doesn't have to be one large PR that completely solves this issue.

For example, the point you made about using REprintf seems like it could be done in a fairly focused PR.

If you don't have the time or interest we'll understand and I will try to do what I can based on your detailed write-ups here.

@david-cortes
Copy link
Contributor

Would you be willing to add a dependency on either Rcpp or cpp11? That'd make things a lot easier.

@jameslamb
Copy link
Collaborator Author

I'm opposed to adding a new dependency unless we really have to, and at this moment I'm not convinced that it's necessary.

{xgboost}'s codebase is very similar to this project's R package, and they haven't needed to do that.

@jameslamb
Copy link
Collaborator Author

@david-cortes are you interested in contributing fixes for this issue in the next few weeks?

If you aren't, I'm going to make it my next priority in this project (using all the excellent insight you've given us in recent comments). Just want to know so you and I aren't working on the same thing at the same time.

@david-cortes
Copy link
Contributor

I don't have any plans to work on these issues right now, but feel free to ask if you need more suggestions.

@jameslamb
Copy link
Collaborator Author

Ok great, thank you so much! I probably will be asking you. We really appreciate it!!

I recognized you from {catboost}'s CRAN issue (catboost/catboost#439), glad to see you here as well 😀

@jameslamb jameslamb self-assigned this Apr 30, 2021
StrikerRUS added a commit that referenced this issue May 12, 2021
…fixes #3016) (#4265)

* started converting handles

* more changes

* sort of working for Dataset

* yay all the tests are passing for Dataset handle changes

* working for other handle types

* remove debugging logging

* remove unnecessary spaces

* fix null logic

* more NULL

* updates

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* consolidate steps

Co-authored-by: Nikita Titov <[email protected]>
@jameslamb jameslamb mentioned this issue May 20, 2021
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants