Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized training fails when a model is too complex #5982

Closed
nigimitama opened this issue Jul 16, 2023 · 12 comments · Fixed by #6092
Closed

Quantized training fails when a model is too complex #5982

nigimitama opened this issue Jul 16, 2023 · 12 comments · Fixed by #6092
Labels

Comments

@nigimitama
Copy link

Description

When a model is too complex and use_quantized_grad is True, training fails with this error:

[LightGBM] [Fatal] Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

Traceback (most recent call last):
  File "//main.py", line 16, in <module>
    lgb.train(params, lgb_train, num_boost_round=1000)
  File "/usr/local/lib/python3.10/site-packages/lightgbm/engine.py", line 266, in train
    booster.update(fobj=fobj)
  File "/usr/local/lib/python3.10/site-packages/lightgbm/basic.py", line 3557, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "/usr/local/lib/python3.10/site-packages/lightgbm/basic.py", line 237, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

Reproducible example

from sklearn.datasets import make_regression
X_train, y_train = make_regression(n_samples=1000, n_features=10, noise=1, random_state=42)

import lightgbm as lgb
lgb_train = lgb.Dataset(X_train, y_train)

params = {
    'device_type': 'cpu',
    'objective': 'mse',
    'num_leaves': 128,
    'use_quantized_grad': True,
    'verbose': -1,
    'seed': 0
}

lgb.train(params, lgb_train, num_boost_round=1000)

Environment info

  • python: 3.10

LightGBM version or commit hash: 4.0.0

Command(s) you used to install LightGBM

pip3 install lightgbm==4.0.0

I used this dockerfile

FROM python:3.10
RUN pip3 install \
    lightgbm==4.0.0 \
    scikit-learn==1.3.0
COPY main.py .
CMD python3 main.py

( main.py is the python script above)

Additional Comments

When a model is not too complex (e.g. num_boost_round=10), training successfully finished even if use_quantized_grad is True.

@jameslamb jameslamb added the bug label Jul 16, 2023
@NegatedObjectIdentity
Copy link

Can confirm this bug with quantized gradient True and 1000 boosting rounds.

@mglowacki100
Copy link

@jameslamb , @nigimitama
I've checked this issue with newest version (4.1.0) and error doesn't occur for above example:
https://colab.research.google.com/drive/1IYI9evSJa3OHTh0aZegN8Fq8aGIAYOmE?usp=sharing

@shiyu1994
Copy link
Collaborator

@mglowacki100 Just pushed a fix. Please check #6092.

shiyu1994 added a commit that referenced this issue Sep 12, 2023
* fix leaf splits update after split in quantized training

* fix preparation ordered gradients for quantized training

* remove force_row_wise in distributed test for quantized training

* Update src/treelearner/leaf_splits.hpp

---------

Co-authored-by: James Lamb <[email protected]>
@AlexGrunt
Copy link

@shiyu1994 @jameslamb
Is this fix available in the latest version of Python package (4.1.0)?

@AlexGrunt
Copy link

@mglowacki100
I'm getting an error
LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .
for your example notebook

@mglowacki100
Copy link

@AlexGrunt , yeah, it is non-deterministic issue, I've luck and it worked on the first time, but if you re-run the cell a couple times it'll will randomly fail/succeed.

@AlexGrunt
Copy link

@mglowacki100 So the only way to try gradient quantization is to wait for next python-package release?

@mglowacki100
Copy link

@shiyu1994 sorry for bothering you with noob question, is there a way to test it on google colab?
I've tried

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake ..
make -j4

but in the end I get:

CMake Warning:
  Ignoring extra path from command line:

   ".."


CMake Error: The source directory "/" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
make: *** No targets specified and no makefile found.  Stop.

@shiyu1994
Copy link
Collaborator

@AlexGrunt We will release more frequently in the future. The next release should come soon perhaps within 1 month.

@mglowacki100 It seems that the path .. is ignored so that no CMakeList.txt is found in the case. I think there should be a way to correctly point cmake to the path?

@mglowacki100
Copy link

@shiyu1994 Thanks for tip! I've managed to build and install lightgbm on google colab. The problem was that google collab doesn't behave like terminal and cd somewhere need to be chained by ; with next operation e.g. cmake to be really executed in "somewhere".
Here is an step-by-step instruction:

  1. Build C++ lightgbm
!git clone --recursive https://github.com/microsoft/LightGBM
!mkdir LightGBM/build
!cd LightGBM/build; cmake ..
!cd LightGBM/build; make -j4
  1. Add colab missing virtual envs:
    !apt install python3.10-venv
  2. Build python package
    !cd LightGBM; ./build-python.sh install
  3. Restart colab environment
  4. Checkup
import lightgbm as lgb
lgb.__version__

should be 4.1.0.99
Maybe this instruction could be added here as additional info/environment https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html ?

Here is google colab, btw. I've made crude test with repeating 100 times training with different seeds and there is no problem:
https://colab.research.google.com/drive/14PhbdEsGn_bP3X6e99sHL58BYIQqqYcg?usp=sharing
@AlexGrunt you can test with this notebook.

@jameslamb
Copy link
Collaborator

Maybe this instruction could be added here as additional info/environment

I don't feel we should take on documentation in the Installation Guide explaining how shell commands work in Google Colab.

Colab is running IPython notebooks with Jupyter, and that syntax is well covered in the IPython documentation: https://ipython.org/ipython-doc/3/interactive/shell.html

The LightGBM-specific sequence of commands are already covered by the docs at https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html.

@Innixma
Copy link

Innixma commented Oct 10, 2023

Excited for the fix to be available, definitely want to try this out for AutoGluon v1.0 release!

Ten0 pushed a commit to Ten0/LightGBM that referenced this issue Jan 12, 2024
…5994) (microsoft#6092)

* fix leaf splits update after split in quantized training

* fix preparation ordered gradients for quantized training

* remove force_row_wise in distributed test for quantized training

* Update src/treelearner/leaf_splits.hpp

---------

Co-authored-by: James Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants