-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Facing issue with LightGBMClassifier training, it stops after iteration 0 #2014
Comments
Hey @PankajMerisha 👋! |
how new is this problem? You said you are using a version that is only days old (11.2). Also, can you try changing executionMode="streaming" to dataTransferMode="streaming"? The error shows that the code is running in "bulk" mode, but your example code show you set the mode to "streaming". So either there's a bug that is causing the wrong mode, or there's a mismatch between the error you are showing and the python code. Also, please turn off useBarrierExecutionMode. We don't recommend that by default. |
@svotaw : Thanks for the reply. I tried to run this on 0.10.2 as well, same issue. LightGBMClassifier( i tried above config as well, no luck. Please suggest me the right config to try. As per log, it stop after iteration 0, which means something is failing internally which logs are not showing. How to check logs in this case? |
with 11.2, did you try with dataTransferMode="streaming" and useBarrierExecutionMode=False? And if so, can you share some more log snippets from the failure? |
@svotaw : I just tried it, got same issue. please find the details below. Logs:
Code:
|
ok, so now you are using the newer streaming mode at least. But regardless, the problem does not seem to be in data transfer or networking, but in actual training. You get the same error in bulk and streaming mode, after data loading has already completed successfully. I don't have any guesses yet. I have not seen this kind of issue. I imagine it has something to do with the data. Why are you using only 1 partition for a large dataset? You are running in Databricks right? Do you not have multiple nodes? @imatiach-msft any ideas? |
@svotaw : Please find the answers below. Why are you using only 1 partition for a large dataset? You are running in Databricks right? Do you not have multiple nodes? It can be data issue as you mentioned, but if i filter the data training works, which is unexpected because data preparation code is not changing. |
Got more logs by setting verbosity to debug mode:
|
"Number of positive: -2124244372" hmmm... seems like a count shouldn't be negative. How many rows are there? I think we only support int32.max. I'm wondering if there's an overflow problem. You said it works if you reduce/filter the rows? Let me ping Ilya again. I'm not really an ML person (mostly infra), so someone else should look at this. |
How many rows are there? You said it works if you reduce/filter the rows? |
based on this error: it looks like there is an overflow somewhere. This seems to cause the error: |
hi @imatiach-msft : Thanks for the reply. |
There isn't an easy way that I know of to get the LightGBM native logs. @imatiach-msft ? I do it sometimes, but I do a local LightGBM build and modify my logging config. But that takes a lot of setup and only works locally. I only test small datasets locally. I would ask this question of the LightGBM folks. I think that if you can reduce the dataset size down and it works, that's good evidence of the overflow problem. Any fix in native code would have to be on the LightGBM side. You can file an issue there, to build on the other int32_t problems and increase its relevance to them. If you get them to do a fix, you can let us know and we can look at making a new build. We make our own LightGBM build, so we'd need to update that. |
Hi @svotaw i have done the changes locally referring to microsoft/LightGBM#5540, let me know how to build the required jar/package for SynapseML. |
Instructions for building the jar are in LightGBM. https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst. You have a private branch you have built? Then you will have to make a custom SynapseML build using that jar. I can help once you get there. We aren't going to point SynapeML at a private LightGBM build. I assume you are just planning to make a custom SynapseML build to test. Make sure you compile the jar for the OS you plan to use. We go through a complicated process to build a jar that has all 3 os types, but you can just make one for your own. |
hey @svotaw : Yes its private, i was able to do the whole process for mac and upload it in the databricks where it failed with message that its not build for linux. Now i am stuck on how to build base lightgbm on linux. Getting below error:
|
Sorry, can't help you for building LightGBM in linux. You'll have to ask them for help. We use the builds they drop in the pipeline to make the jar. |
sure, will try to reach out to them. Thanks for the help @svotaw |
SynapseML version
0.11.2
System information
Describe the problem
I am facing below issue everytime i am trying to train LightGBMClassifier model.
Data size is huge, 13 billion data points
30 categorical features.
Code to reproduce issue
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: