Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Facing issue with LightGBMClassifier training, it stops after iteration 0 #2014

Closed
3 of 19 tasks
PankajMerisha opened this issue Jul 13, 2023 · 19 comments
Closed
3 of 19 tasks
Labels

Comments

@PankajMerisha
Copy link

PankajMerisha commented Jul 13, 2023

SynapseML version

0.11.2

System information

  • Language version (e.g. python 3.8, scala 2.12): scala 2.12
  • Spark Version (e.g. 3.2.3): 3.4.0
  • Spark Platform (e.g. Synapse, Databricks): Databricks

Describe the problem

I am facing below issue everytime i am trying to train LightGBMClassifier model.
Data size is huge, 13 billion data points
30 categorical features.

23/07/13 10:58:46 INFO BulkPartitionTask: LightGBM task starting iteration 0
23/07/13 10:58:58 INFO BulkPartitionTask: *** Completed training in partition 1209, task 19553, iteration count: 1, best: None

Code to reproduce issue

final_lgb_param = {'baggingFraction': 0.7707226966638262, 'featureFraction': 0.9465486808655914, 'featuresCol': 'features', 'labelCol': 'label', 'lambdaL1': 0.048814197916815294, 'lambdaL2': 0.4302669912613888, 'learningRate': 0.4204699659791139, 'maxDepth': 4, 'metric': 'auc', 'minSumHessianInLeaf': 0.0016024539483708276, 'numIterations': 150, 'numLeaves': 326, 'objective': 'binary', 'seed': 1, 'weightCol': 'weight'}

lgb = LightGBMClassifier(
            numIterations=final_lgb_param['numIterations'],
            numLeaves=final_lgb_param['numLeaves'],
            maxDepth=final_lgb_param['maxDepth'],
            baggingFraction=final_lgb_param['baggingFraction'],
            featureFraction=final_lgb_param['featureFraction'],
            minSumHessianInLeaf=final_lgb_param['minSumHessianInLeaf'],
            learningRate=final_lgb_param['learningRate'],
            objective=final_lgb_param['objective'],
            labelCol=final_lgb_param['labelCol'],
            featuresCol=final_lgb_param['featuresCol'],
            weightCol=final_lgb_param['weightCol'],
            metric=final_lgb_param['metric'],
            microBatchSize=1000,
            useBarrierExecutionMode=True,
            executionMode="streaming",
            passThroughArgs="force_col_wise=true",
            parallelism="voting_parallel",
            verbosity=-1,
            numThreads=8,
            maxBin=200000
        )
    lgb_model = lgb.fit(train_data)
    

Other info / logs

No response

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
@github-actions
Copy link

Hey @PankajMerisha 👋!
Thank you so much for reporting the issue/feature request 🚨.
Someone from SynapseML Team will be looking to triage this issue soon.
We appreciate your patience.

@PankajMerisha
Copy link
Author

@svotaw
Copy link
Collaborator

svotaw commented Jul 13, 2023

how new is this problem? You said you are using a version that is only days old (11.2). Also, can you try changing executionMode="streaming" to dataTransferMode="streaming"? The error shows that the code is running in "bulk" mode, but your example code show you set the mode to "streaming". So either there's a bug that is causing the wrong mode, or there's a mismatch between the error you are showing and the python code. Also, please turn off useBarrierExecutionMode. We don't recommend that by default.

@PankajMerisha
Copy link
Author

@svotaw : Thanks for the reply.

I tried to run this on 0.10.2 as well, same issue.

LightGBMClassifier(
numIterations=final_lgb_param['numIterations'],
numLeaves=final_lgb_param['numLeaves'],
maxDepth=final_lgb_param['maxDepth'],
baggingFraction=final_lgb_param['baggingFraction'],
featureFraction=final_lgb_param['featureFraction'],
minSumHessianInLeaf=final_lgb_param['minSumHessianInLeaf'],
learningRate=final_lgb_param['learningRate'],
objective=final_lgb_param['objective'],
labelCol=final_lgb_param['labelCol'],
featuresCol=final_lgb_param['featuresCol'],
weightCol=final_lgb_param['weightCol'],
metric=final_lgb_param['metric'],
)

i tried above config as well, no luck. Please suggest me the right config to try.

As per log, it stop after iteration 0, which means something is failing internally which logs are not showing. How to check logs in this case?

@svotaw
Copy link
Collaborator

svotaw commented Jul 13, 2023

with 11.2, did you try with dataTransferMode="streaming" and useBarrierExecutionMode=False? And if so, can you share some more log snippets from the failure?

@PankajMerisha
Copy link
Author

PankajMerisha commented Jul 13, 2023

@svotaw : I just tried it, got same issue. please find the details below.

Logs:

23/07/13 18:07:28 INFO StreamingPartitionTask: Part 0: Pushing 100 sparse rows at 3674700
23/07/13 18:07:28 INFO StreamingPartitionTask: Part 0: Pushing 100 sparse rows at 3674800
23/07/13 18:07:28 INFO StreamingPartitionTask: Part 0: Pushing 100 sparse rows at 3674900
23/07/13 18:07:28 INFO StreamingPartitionTask: Part 0: Pushing 100 sparse rows at 3675000
23/07/13 18:07:28 INFO StreamingPartitionTask: Part 0: Pushing 100 sparse rows at 3675100
23/07/13 18:07:28 INFO FileScanRDD: Reading File path: dbfs:/mnt/idspexpstoragelrs/pankaj-merisha/win_rate_model/2023-07-13/win-rate-train-data-feature/part-02761-tid-3460575321846148139-c7d92851-98a0-4d5b-a1cc-a9c109d9a969-180283-1-c000.snappy.parquet, range: 0-134217728, partition values: [empty row], modificationTime: 1689225539000.
23/07/13 18:07:28 INFO StreamingPartitionTask: Part 0: Pushing 50 sparse rows at 3675200
23/07/13 18:07:28 INFO StreamingPartitionTask: LightGBM pushed 3675250 in partition 0
23/07/13 18:07:28 INFO StreamingPartitionTask: Part 0: inserted 3675250 partition 0 rows into shared training dataset at offset 0
23/07/13 18:07:28 INFO StreamingPartitionTask: done with data preparation on partition 0, task 22564
23/07/13 18:07:28 INFO StreamingPartitionTask: Calling NetworkInit on local port 14960 with value 10.139.67.29:14960,10.139.67.9:13778,10.139.64.20:12784,10.139.64.251:15442,10.139.67.60:12688,10.139.65.67:15824,10.139.65.176:16752,10.139.65.135:13553,10.139.64.181:15728,10.139.67.100:13200,10.139.66.249:14928,10.139.67.97:17072,10.139.67.82:12594,10.139.65.146:14577,10.139.64.230:12563,10.139.67.75:14256,10.139.64.191:16241,10.139.66.56:15792,10.139.65.88:16304,10.139.65.158:15536,10.139.65.63:13489,10.139.67.93:14448,10.139.67.86:13170,10.139.67.43:13136,10.139.65.76:14706,10.139.64.102:13426,10.139.64.185:15088,10.139.65.51:14897,10.139.64.34:12464,10.139.65.155:14513,10.139.64.189:16497,10.139.65.80:16816,10.139.64.162:14544,10.139.65.132:14096,10.139.64.109:13712,10.139.67.64:12976,10.139.64.184:14064,10.139.64.178:12434,10.139.64.17:14192,10.139.65.143:17168,10.139.66.205:16465,10.139.67.61:12723,10.139.64.41:14640,10.139.67.98:12656,10.139.64.177:15889,10.139.65.20:15856,10.139.65.47:16147,10.139.65.81:16881,10.139.65.147:14290,10.139.67.72:16689,10.139.65.55:16277,10.139.66.82:14000,10.139.67.101:12496,10.139.65.14:16116,10.139.67.19:13680,10.139.65.17:15056,10.139.66.15:16338,10.139.67.76:15665,10.139.67.13:16722,10.139.65.142:12849,10.139.67.105:14320,10.139.67.12:13040,10.139.65.18:14355,10.139.65.70:15282,10.139.64.249:13808,10.139.67.58:13104,10.139.64.130:17009,10.139.66.128:13232,10.139.65.60:13264,10.139.67.91:15760,10.139.64.183:16048,10.139.67.47:15699,10.139.67.23:14480,10.139.67.54:14993,10.139.65.137:17043,10.139.64.205:16624,10.139.67.36:14866,10.139.66.10:12625,10.139.65.48:16432,10.139.65.5:15410,10.139.67.34:16016,10.139.67.95:12400,10.139.65.148:16561,10.139.64.85:16209,10.139.66.94:15250,10.139.64.176:15314,10.139.67.102:13618,10.139.67.40:15601,10.139.67.51:14802,10.139.67.77:14224,10.139.64.254:14385,10.139.64.187:13456,10.139.67.67:13840,10.139.65.130:15952,10.139.67.62:17105,10.139.65.71:16848,10.139.64.212:15921,10.139.67.22:16657,10.139.64.201:15186,10.139.67.84:14833,10.139.64.210:14032,10.139.67.73:14608,10.139.65.15:15473,10.139.67.66:15152,10.139.64.182:12752,10.139.67.59:15120,10.139.67.3:13328,10.139.67.48:13073,10.139.67.90:16371,10.139.65.0:16784,10.139.67.37:16080,10.139.65.220:13937,10.139.67.46:14672,10.139.66.225:14160,10.139.65.82:12945,10.139.67.6:15635,10.139.64.198:14736,10.139.65.145:15568,10.139.65.4:12530,10.139.64.248:17139,10.139.67.8:16530,10.139.67.96:15504,10.139.64.250:13745,10.139.66.254:16403,10.139.67.17:15219,10.139.67.63:13650,10.139.64.233:13362,10.139.64.114:15026,10.139.64.255:13297,10.139.67.89:12912,10.139.65.157:16912,10.139.67.52:15377,10.139.65.78:15986,10.139.64.207:14128,10.139.64.186:16592,10.139.65.133:14768,10.139.65.156:13520,10.139.67.85:13392,10.139.67.74:13584,10.139.65.16:13872,10.139.66.253:13970,10.139.65.54:13904,10.139.65.149:15345,10.139.65.57:12881,10.139.66.130:16177,10.139.65.144:12816,10.139.66.65:16944,10.139.64.247:16977,10.139.65.65:14418,10.139.65.83:13009
23/07/13 18:07:40 INFO StreamingPartitionTask: NetworkInit succeeded. LightGBM task listening on: 14960
23/07/13 18:07:40 INFO StreamingPartitionTask: Waiting for all data prep to be done, task 22564, partition 0
23/07/13 18:07:40 INFO StreamingPartitionTask: Getting final training Dataset for partition 0.
23/07/13 18:07:41 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 0, task 22564
23/07/13 18:07:41 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 22564, partition 0
23/07/13 18:07:41 INFO StreamingPartitionTask: LightGBM task starting iteration 0
23/07/13 18:07:52 INFO StreamingPartitionTask: *** Completed training in partition 0, task 22564, iteration count: 1, best: None
23/07/13 18:07:52 INFO StreamingPartitionTask: Freeing Dataset from partition 0, task 22564
23/07/13 18:07:52 INFO StreamingPartitionTask: Beginning cleanup for partition 0, task 22564
23/07/13 18:07:52 INFO StreamingPartitionTask: Done with cleanup for partition 0, task 22564

Code:

final_lgb_param = {'baggingFraction': 0.7298454222061873, 'featureFraction': 0.8427168784585647, 'featuresCol': 'features', 'labelCol': 'label', 'lambdaL1': 0.5225755896380427, 'lambdaL2': 0.23933324719994797, 'learningRate': 0.4122049134704886, 'maxDepth': 7, 'metric': 'auc', 'minSumHessianInLeaf': 0.0017254325877716355, 'numIterations': 55, 'numLeaves': 495, 'objective': 'binary', 'seed': 1, 'weightCol': 'weight'}

lgb = LightGBMClassifier(
            numIterations=final_lgb_param['numIterations'],
            numLeaves=final_lgb_param['numLeaves'],
            maxDepth=final_lgb_param['maxDepth'],
            baggingFraction=final_lgb_param['baggingFraction'],
            featureFraction=final_lgb_param['featureFraction'],
            minSumHessianInLeaf=final_lgb_param['minSumHessianInLeaf'],
            learningRate=final_lgb_param['learningRate'],
            objective=final_lgb_param['objective'],
            labelCol=final_lgb_param['labelCol'],
            featuresCol=final_lgb_param['featuresCol'],
            weightCol=final_lgb_param['weightCol'],
            metric=final_lgb_param['metric'],  
            dataTransferMode="streaming",
            useBarrierExecutionMode=False      
        )

@svotaw
Copy link
Collaborator

svotaw commented Jul 13, 2023

ok, so now you are using the newer streaming mode at least. But regardless, the problem does not seem to be in data transfer or networking, but in actual training. You get the same error in bulk and streaming mode, after data loading has already completed successfully.

I don't have any guesses yet. I have not seen this kind of issue. I imagine it has something to do with the data.

Why are you using only 1 partition for a large dataset? You are running in Databricks right? Do you not have multiple nodes?

@imatiach-msft any ideas?

@PankajMerisha
Copy link
Author

PankajMerisha commented Jul 14, 2023

@svotaw : Please find the answers below.

Why are you using only 1 partition for a large dataset?
I shared log from one of the worker, its creating multiple partition.

You are running in Databricks right?
Screenshot 2023-07-14 at 6 39 00 AM
Yes

Do you not have multiple nodes?
Yes, using 150 nodes. I tried increasing this to max 500 as well, same issue.

It can be data issue as you mentioned, but if i filter the data training works, which is unexpected because data preparation code is not changing.

@PankajMerisha
Copy link
Author

Got more logs by setting verbosity to debug mode:

[LightGBM] [Info] Connected to rank 134
[LightGBM] [Info] Connected to rank 135
[LightGBM] [Info] Connected to rank 136
[LightGBM] [Info] Connected to rank 137
[LightGBM] [Info] Connected to rank 138
[LightGBM] [Info] Connected to rank 139
[LightGBM] [Info] Connected to rank 140
[LightGBM] [Info] Connected to rank 141
[LightGBM] [Info] Connected to rank 142
[LightGBM] [Info] Connected to rank 143
[LightGBM] [Info] Connected to rank 144
[LightGBM] [Info] Connected to rank 145
[LightGBM] [Info] Connected to rank 146
[LightGBM] [Info] Connected to rank 147
[LightGBM] [Info] Connected to rank 148
[LightGBM] [Info] Connected to rank 149
[LightGBM] [Info] Local rank: 0, total number of machines: 150
[LightGBM] [Info] Number of positive: -2124244372, number of negative: 13639477
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.217391
[LightGBM] [Debug] init for col-wise cost 0.000335 seconds, init for row-wise cost 3.113191 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 2.501258 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Dense Multi-Val Bin
[LightGBM] [Info] Total Bins 83
[LightGBM] [Info] Number of data points in the train set: 98851325, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.132464 -> initscore=-1.879347
[LightGBM] [Info] Start training from score -1.879347
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 1 and depth = 1
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Info] Finished linking network in 9.505846 seconds

@svotaw
Copy link
Collaborator

svotaw commented Jul 17, 2023

"Number of positive: -2124244372"

hmmm... seems like a count shouldn't be negative. How many rows are there? I think we only support int32.max. I'm wondering if there's an overflow problem. You said it works if you reduce/filter the rows?

Let me ping Ilya again. I'm not really an ML person (mostly infra), so someone else should look at this.

@PankajMerisha
Copy link
Author

@svotaw

How many rows are there?
~13-15 billion

You said it works if you reduce/filter the rows?
Yes, it works but not always. As our model trains on last 4 days data which is rolling window, so when data is less then 10 billion it works sometime.

@imatiach-msft
Copy link
Contributor

based on this error:
[LightGBM] [Info] Number of positive: -2124244372, number of negative: 13639477

it looks like there is an overflow somewhere. This seems to cause the error:
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Where the model is not able to actually find any splits. Hence, the training terminates early. The fix will be to find out where the overflow is occuring and resolve the casting issue.

@PankajMerisha
Copy link
Author

hi @imatiach-msft : Thanks for the reply.
I was going through base LGB lib list of issues, i can see multiple people have reported issue related to int32_t usage. Any way to check if my issue is one of them?

microsoft/LightGBM#5540
microsoft/LightGBM#5861

@svotaw
Copy link
Collaborator

svotaw commented Jul 17, 2023

There isn't an easy way that I know of to get the LightGBM native logs. @imatiach-msft ? I do it sometimes, but I do a local LightGBM build and modify my logging config. But that takes a lot of setup and only works locally. I only test small datasets locally.

I would ask this question of the LightGBM folks. I think that if you can reduce the dataset size down and it works, that's good evidence of the overflow problem. Any fix in native code would have to be on the LightGBM side. You can file an issue there, to build on the other int32_t problems and increase its relevance to them. If you get them to do a fix, you can let us know and we can look at making a new build. We make our own LightGBM build, so we'd need to update that.

@PankajMerisha
Copy link
Author

Hi @svotaw

i have done the changes locally referring to microsoft/LightGBM#5540, let me know how to build the required jar/package for SynapseML.

@svotaw
Copy link
Collaborator

svotaw commented Jul 20, 2023

Instructions for building the jar are in LightGBM. https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst. You have a private branch you have built?

Then you will have to make a custom SynapseML build using that jar. I can help once you get there. We aren't going to point SynapeML at a private LightGBM build.

I assume you are just planning to make a custom SynapseML build to test. Make sure you compile the jar for the OS you plan to use. We go through a complicated process to build a jar that has all 3 os types, but you can just make one for your own.

@PankajMerisha
Copy link
Author

PankajMerisha commented Jul 20, 2023

hey @svotaw : Yes its private, i was able to do the whole process for mac and upload it in the databricks where it failed with message that its not build for linux.

Now i am stuck on how to build base lightgbm on linux.

Getting below error:

root@dsp-serving-sp-54676f88f7-fnkx6:/lgb/LightGBM/build# make -j4
Scanning dependencies of target _lightgbm_swig_swig_compilation
[  1%] Building CXX object CMakeFiles/lightgbm_capi_objs.dir/src/c_api.cpp.o
[  3%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/boosting.cpp.o
[  5%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/cuda/cuda_score_updater.cpp.o
[  7%] Swig compile swig/lightgbmlib.i for java
/lgb/LightGBM/src/c_api.cpp:5:10: fatal error: LightGBM/c_api.h: No such file or directory
    5 | #include <LightGBM/c_api.h>
      |          ^~~~~~~~~~~~~~~~~~
compilation terminated.
/lgb/LightGBM/src/boosting/boosting.cpp:5:10: fatal error: LightGBM/boosting.h: No such file or directory
    5 | #include <LightGBM/boosting.h>
      |          ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/lightgbm_capi_objs.dir/build.make:76: CMakeFiles/lightgbm_capi_objs.dir/src/c_api.cpp.o] Error 1
make[2]: *** [CMakeFiles/lightgbm_objs.dir/build.make:76: CMakeFiles/lightgbm_objs.dir/src/boosting/boosting.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:145: CMakeFiles/lightgbm_capi_objs.dir/all] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:93: CMakeFiles/lightgbm_objs.dir/all] Error 2
/lgb/LightGBM/swig/lightgbmlib.i:29: Error: Unable to find '../include/LightGBM/export.h'
/lgb/LightGBM/swig/lightgbmlib.i:30: Error: Unable to find '../include/LightGBM/c_api.h'
/lgb/LightGBM/swig/ChunkedArray_API_extensions.i:15: Error: Unable to find '../include/LightGBM/utils/chunked_array.hpp'
make[2]: *** [CMakeFiles/_lightgbm_swig_swig_compilation.dir/build.make:76: CMakeFiles/_lightgbm_swig.dir/lightgbmlibJAVA.stamp] Error 1
make[2]: *** Deleting file 'CMakeFiles/_lightgbm_swig.dir/lightgbmlibJAVA.stamp'
make[1]: *** [CMakeFiles/Makefile2:227: CMakeFiles/_lightgbm_swig_swig_compilation.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

@svotaw
Copy link
Collaborator

svotaw commented Jul 20, 2023

Sorry, can't help you for building LightGBM in linux. You'll have to ask them for help. We use the builds they drop in the pipeline to make the jar.

@PankajMerisha
Copy link
Author

sure, will try to reach out to them. Thanks for the help @svotaw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants