Quantized training fails when a model is too complex #5982

nigimitama · 2023-07-16T02:42:49Z

Description

When a model is too complex and use_quantized_grad is True, training fails with this error:

[LightGBM] [Fatal] Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

Traceback (most recent call last):
  File "//main.py", line 16, in <module>
    lgb.train(params, lgb_train, num_boost_round=1000)
  File "/usr/local/lib/python3.10/site-packages/lightgbm/engine.py", line 266, in train
    booster.update(fobj=fobj)
  File "/usr/local/lib/python3.10/site-packages/lightgbm/basic.py", line 3557, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "/usr/local/lib/python3.10/site-packages/lightgbm/basic.py", line 237, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

Reproducible example

from sklearn.datasets import make_regression
X_train, y_train = make_regression(n_samples=1000, n_features=10, noise=1, random_state=42)

import lightgbm as lgb
lgb_train = lgb.Dataset(X_train, y_train)

params = {
    'device_type': 'cpu',
    'objective': 'mse',
    'num_leaves': 128,
    'use_quantized_grad': True,
    'verbose': -1,
    'seed': 0
}

lgb.train(params, lgb_train, num_boost_round=1000)

Environment info

python: 3.10

LightGBM version or commit hash: 4.0.0

Command(s) you used to install LightGBM

pip3 install lightgbm==4.0.0

I used this dockerfile

FROM python:3.10
RUN pip3 install \
    lightgbm==4.0.0 \
    scikit-learn==1.3.0
COPY main.py .
CMD python3 main.py

( main.py is the python script above)

Additional Comments

When a model is not too complex (e.g. num_boost_round=10), training successfully finished even if use_quantized_grad is True.

The text was updated successfully, but these errors were encountered:

NegatedObjectIdentity · 2023-07-16T18:26:16Z

Can confirm this bug with quantized gradient True and 1000 boosting rounds.

mglowacki100 · 2023-09-12T06:51:07Z

@jameslamb , @nigimitama
I've checked this issue with newest version (4.1.0) and error doesn't occur for above example:
https://colab.research.google.com/drive/1IYI9evSJa3OHTh0aZegN8Fq8aGIAYOmE?usp=sharing

shiyu1994 · 2023-09-12T07:56:42Z

@mglowacki100 Just pushed a fix. Please check #6092.

* fix leaf splits update after split in quantized training * fix preparation ordered gradients for quantized training * remove force_row_wise in distributed test for quantized training * Update src/treelearner/leaf_splits.hpp --------- Co-authored-by: James Lamb <[email protected]>

AlexGrunt · 2023-09-19T09:44:11Z

@shiyu1994 @jameslamb
Is this fix available in the latest version of Python package (4.1.0)?

AlexGrunt · 2023-09-20T07:07:24Z

@mglowacki100
I'm getting an error
LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .
for your example notebook

mglowacki100 · 2023-09-20T08:27:57Z

@AlexGrunt , yeah, it is non-deterministic issue, I've luck and it worked on the first time, but if you re-run the cell a couple times it'll will randomly fail/succeed.

AlexGrunt · 2023-09-20T10:02:56Z

@mglowacki100 So the only way to try gradient quantization is to wait for next python-package release?

mglowacki100 · 2023-09-20T13:19:10Z

@shiyu1994 sorry for bothering you with noob question, is there a way to test it on google colab?
I've tried

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake ..
make -j4

but in the end I get:

CMake Warning:
  Ignoring extra path from command line:

   ".."


CMake Error: The source directory "/" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
make: *** No targets specified and no makefile found.  Stop.

shiyu1994 · 2023-09-20T17:02:32Z

@AlexGrunt We will release more frequently in the future. The next release should come soon perhaps within 1 month.

@mglowacki100 It seems that the path .. is ignored so that no CMakeList.txt is found in the case. I think there should be a way to correctly point cmake to the path?

mglowacki100 · 2023-09-20T20:44:02Z

@shiyu1994 Thanks for tip! I've managed to build and install lightgbm on google colab. The problem was that google collab doesn't behave like terminal and cd somewhere need to be chained by ; with next operation e.g. cmake to be really executed in "somewhere".
Here is an step-by-step instruction:

Build C++ lightgbm

!git clone --recursive https://github.com/microsoft/LightGBM
!mkdir LightGBM/build
!cd LightGBM/build; cmake ..
!cd LightGBM/build; make -j4

Add colab missing virtual envs:
!apt install python3.10-venv
Build python package
!cd LightGBM; ./build-python.sh install
Restart colab environment
Checkup

import lightgbm as lgb
lgb.__version__

should be 4.1.0.99
Maybe this instruction could be added here as additional info/environment https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html ?

Here is google colab, btw. I've made crude test with repeating 100 times training with different seeds and there is no problem:
https://colab.research.google.com/drive/14PhbdEsGn_bP3X6e99sHL58BYIQqqYcg?usp=sharing
@AlexGrunt you can test with this notebook.

jameslamb · 2023-09-20T21:13:28Z

Maybe this instruction could be added here as additional info/environment

I don't feel we should take on documentation in the Installation Guide explaining how shell commands work in Google Colab.

Colab is running IPython notebooks with Jupyter, and that syntax is well covered in the IPython documentation: https://ipython.org/ipython-doc/3/interactive/shell.html

The LightGBM-specific sequence of commands are already covered by the docs at https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html.

Innixma · 2023-10-10T19:54:55Z

Excited for the fix to be available, definitely want to try this out for AutoGluon v1.0 release!

…5994) (microsoft#6092) * fix leaf splits update after split in quantized training * fix preparation ordered gradients for quantized training * remove force_row_wise in distributed test for quantized training * Update src/treelearner/leaf_splits.hpp --------- Co-authored-by: James Lamb <[email protected]>

jameslamb added the bug label Jul 16, 2023

jameslamb mentioned this issue Jul 20, 2023

Quantized Grad Randomly Fails best_split_info.left_count check #5994

Closed

jameslamb mentioned this issue Jul 31, 2023

AttributeError: 'DataFrame' object has no attribute 'cat' when running lgb.train() #6015

Closed

Innixma mentioned this issue Aug 3, 2023

Update setup.py - LightGBM version bump autogluon/autogluon#3427

Merged

shiyu1994 mentioned this issue Aug 15, 2023

[ci] add bot to lock inactive issues and PRs #6037

Merged

shiyu1994 mentioned this issue Sep 12, 2023

[fix] fix quantized training (fixes #5982) (fixes #5994) #6092

Merged

shiyu1994 closed this as completed in #6092 Sep 12, 2023

jameslamb mentioned this issue Nov 3, 2023

Quantized Gradient Creates a Failed Check #6175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized training fails when a model is too complex #5982

Quantized training fails when a model is too complex #5982

nigimitama commented Jul 16, 2023

NegatedObjectIdentity commented Jul 16, 2023

mglowacki100 commented Sep 12, 2023

shiyu1994 commented Sep 12, 2023

AlexGrunt commented Sep 19, 2023

AlexGrunt commented Sep 20, 2023

mglowacki100 commented Sep 20, 2023

AlexGrunt commented Sep 20, 2023

mglowacki100 commented Sep 20, 2023

shiyu1994 commented Sep 20, 2023

mglowacki100 commented Sep 20, 2023

jameslamb commented Sep 20, 2023

Innixma commented Oct 10, 2023