-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R package install with GPU support fails #3765
Comments
Thanks very much for the detailed report and great reproducible example! I'll take a look at this later today. |
Thank you so much @jameslamb |
Actually, the old way of installing the R package works:
|
Oh, I was wrong, the old method above compiles indeed, but it actually does not work:
|
@szilard Hey! Let me put my two cents in it 🙂 . I completely out of ideas how But I took a look at your Docker file. Seems that you are inheriting from LightGBM/docker/gpu/dockerfile.gpu Lines 15 to 20 in f997a06
The following action I can see in your Docker already 👍 LightGBM/docker/gpu/dockerfile.gpu Lines 59 to 61 in f997a06
In addition, performing those actions might not be enough. Then you can add similar paths from the example below to CMake commands LightGBM/docker/gpu/dockerfile.gpu Line 87 in f997a06
here LightGBM/R-package/src/install.libs.R Lines 166 to 168 in f997a06
and LightGBM/docker/gpu/dockerfile.gpu Line 88 in f997a06
here LightGBM/R-package/src/install.libs.R Line 135 in f997a06
I suspect that CMake goes mad somehow due to two OpenCL installations. Look at these lines from your log:
"found" and "Could NOT find" at the same time. |
BTW,
Probably this is the reason... I mean, maybe appropriate env variables are already set in Docker command line, but R doesn't see them. |
Well, as I corrected myself later, the Yeah, my Dockerfile has a history of additions over the years (and hacks like the However, lightgbm compiles fine outside the R package, so it seems it's only the R package that gets confused about OpenCL. |
Trying to understand what's going on, trying to strip down things as much as possible: If I remove the
Then the R install errors out (obviously) with:
The non-R install works even if striped down to this (no need for any ENV variables or any other compiler flags mentioned above by @StrikerRUS ):
Removing any of
If I add back the then the R package fails (this was yesterday's result, just including here to see the error message in context):
but now the non-R install can be stripped down even more of flags and it still works:
and notice it has found now OpenCL 2.2 instead of previously the old 1.2 included in Based on this it seems to me something's up with the R package (likely it can't get the (And I think having |
If I make this change in
then it finds OpenCL:
However, then it can't find Boost now:
It seems like the R build script cannot find the necessary paths anymore somehow (for the GPU install), not only OpenCL. |
Could you please try adding Boost paths (
On our CI it finds Boost in
Actual path for your Docker you can take from a successful installation from command line, I believe. Though, it's quite strange.
I'm not sure that newer version from Ubuntu ppa is better than preinstalled native version from NVIDIA in case you are really using NVIDIA cards for training. |
@jameslamb I believe R-package needs the same additional command line options for GPU-version as our Python-package:
|
I can't make it pass Boost by adding |
All this strange, because last time I ran the benchmarks (September 2020) it was all working. |
Please try |
Indeed, with this old Lightgbm commit 7e11d4a (Aug 30, 2020) and using the old
it works:
|
Well, actually I don't know, it compiles, though now I'm on an instance without GPU, so I'm not sure if it adds GPU support (yesterday it seemed that the old |
Sorry for possible confusion, maybe I did not explain it the best way, what I mean is that
seems to always compile OK, but in the latest lightgbm version it actually does not add GPU support. In fact I think I'm able to see if GPU support was added even on this non-GPU instance, because with old commit 7e11d4a (Aug 30, 2020) after compiling with the
while with the latest Lightgbm commit I get (compiling with the
So it seems old I'm not sure if it would help us fix the main issue if we find out which commit broke this (it's weird that the old |
With this:
(added boost libdir but not the include) it compiles:
and it will probably run. I get
but on a box without GPU (good sign), I'll have to try it out on an instance with GPU. The code I'm running btw:
|
Nice, given that the error happens on non-GPU machine! Indeed good sign! But please note that successfully compiled GPU version and using
|
or use |
I ran it on an instance with GPU (p3 with V100): With this patch:
that is by using this hack in my Dockerfile (and with
it is compiling and running OK. Full Dockerfile: https://github.com/szilard/GBM-perf/blob/f34c37357e82f7dd3d8f30e5625a7f268a3b98a5/gpu/Dockerfile Full R code running: https://github.com/szilard/GBM-perf/blob/f34c37357e82f7dd3d8f30e5625a7f268a3b98a5/gpu/run/3-lightgbm.R I wonder if on other systems it works out of the box or not (without adding the paths with the patch) as it used to run for me as well. |
Those paths are default ones. Very strange that they are not propagated into R... |
Thanks for such nice reproducible examples @szilard ! I can look into this this weekend, and probably expose more options via the |
Sounds great @jameslamb, thank you. |
Thanks to both of you for all the great information, and a nice reproducible example! I've proposed what I think could be a fix, in #3779. It wouldn't "just work", but would at least allow you to pass in these paths as command-line args like you can in the Python package, so no one would need to use |
Thanks @jameslamb for fix and merging into LightGBM master. I changed the Dockerfile in my repo GBM-perf to take advantage of this fix (replaced the |
@szilard I'm afraid you have a typo (duplicated
Quite strange that even with typo compilation succeed. |
Thanks @StrikerRUS , I fixed it now. Yeah, strange indeed it was compiling with the |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This used to work:
Now, I get this error:
If I build the docker image with the last
RUN
entry commented out:with
and then run it:
then I can run things manually:
gives the same error.
However, just compiling lightgbm (not the R package) seems fine:
as here:
though I also see
but it compiles anyway:
So there must be something in the R package(?) cc @jameslamb
The text was updated successfully, but these errors were encountered: