Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad_alloc even with swap being available #4184

Closed
hlbkin opened this issue Feb 26, 2019 · 10 comments
Closed

bad_alloc even with swap being available #4184

hlbkin opened this issue Feb 26, 2019 · 10 comments

Comments

@hlbkin
Copy link

hlbkin commented Feb 26, 2019

Hi Team!

Im getting bad_alloc during xgboost train test iterations in cross validation

I have very specific cross validation in Python, thus we iterate over chunks of numpy array for train and test

Some of the train array sizes only fits in memory using the swap on linux (say additional 30-50 GB of swap memory).

It seems that right after some swap memory is used xgboost crashes

is it expected result?

@hlbkin
Copy link
Author

hlbkin commented Feb 27, 2019

Version is 0.81 from here https://s3-us-west-2.amazonaws.com/xgboost-wheels/list.html?prefix=

Just ran some experiments. The code crashed even at 50% of RAM being used

stack trace:
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what(): std::bad_alloc: out of memory

@hlbkin
Copy link
Author

hlbkin commented Feb 27, 2019

Current dataset 2_000_000 rows x 1500 features

on 0.72
When dataset was around 2 mio x 1000, it never crashed, even though RAM consumptions went up to 90%

default_params = {
    'n_estimators': 100,
    'eta': hp.quniform('eta', 0.01, 0.25, 0.015),
    'max_depth': hp.quniform('max_depth', 1, 4, 1),
    'min_child_weight': hp.qloguniform('min_child_weight', 1, 6, 1),
    'gamma': hp.loguniform('gamma', -10, 10),
    'alpha': hp.loguniform('alpha', -10, 10),
    'colsample_bytree': hp.quniform('colsample_bytree', 0.3, 1, 0.1),
    'subsample': hp.quniform('subsample', 0.3, 1, 0.1),
    'grow_policy': hp.choice('grow_policy', ['lossguide', 'depthwise']),
    'eval_metric': 'rmse',
    'n_gpus': 1,
    'gpu_id': 0,
    'objective': 'gpu:reg:linear',  # could be 'gpu:reg:linear'
    'weighting_scheme': hp.choice('weighting_scheme', [None, None]),
    'tree_method': 'gpu_hist',  # could be gpu_hist later, or exact - which is slower
    'silent': 0
}

@hlbkin
Copy link
Author

hlbkin commented Feb 27, 2019

Just tried with old 0.72 multigpu version, same code, same dataset

GPU memory:

[08:55:31] Allocated 4937MB on [0] GeForce GTX 1080 Ti, 6031MB remaining.
[08:55:39] Allocated 38MB on [0] GeForce GTX 1080 Ti, 5991MB remaining.

File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 204, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 74, in _train_internal
bst.update(dtrain, i, obj)
File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 894, in update
dtrain.handle))
File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 130, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: b'[08:55:41] /workspace/src/tree/updater_gpu_hist.cu:572: GPU plugin exception: /workspace/src/tree/../common/device_helpers.cuh(560): unspecified launch failure\n\n\nStack trace returned 10 entries:\n[bt] (0) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::StackTrace()+0x47) [0x7f9aa8ac82f7]\n[bt] (1) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x7f9aa8ac89a8]\n[bt] (2) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x2b6) [0x7f9aa8cf86f6]\n[bt] (3) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x5a8) [0x7f9aa8b43ab8]\n[bt] (4) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0x8c6) [0x7f9aa8b44836]\n[bt] (5) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x3b8) [0x7f9aa8b73838]\n[bt] (6) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x2b) [0x7f9aa8b0789b]\n[bt] (7) /home/yhgtemp/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f9ac6e81ec0]\n[bt] (8) /home/yhgtemp/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f9ac6e8187d]\n[bt] (9) /home/yhgtemp/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f9ac7096e2e]\n\n'
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: unspecified launch failure

@hlbkin
Copy link
Author

hlbkin commented Feb 27, 2019

Seems like this issue #2874

Code crashes on predict part, on 7-8 iterations through cv folds

Edit: moved predict part to "cpu_predictor", and still getting the crash on latest iterations

During cv iteration we delete the model to release memory

    model_copy = model_object.model.copy()
    model_copy.variables = variables_copy
    model_object.model.__del__()

@trivialfis
Copy link
Member

@hlbkin Maybe you can try external_memory on CPU after #4193 being merged. Other than that, there's not much we can do about bad_alloc. We can try optimizing memory usage as we go, but that won't happen over night.

@hlbkin
Copy link
Author

hlbkin commented Feb 28, 2019

@trivialfis is that GPU failing because of temporary memory burst?

@trivialfis
Copy link
Member

trivialfis commented Feb 28, 2019

@hlbkin Possibly. But I think you can use tools like nvidia-smi to monitor the status of GPU. And please use latest version, I have been fixing various bugs for XGBoost, it not gonna be bug free but still I believe the latest commit is much more robust than 0.7x.

@hlbkin
Copy link
Author

hlbkin commented Mar 3, 2019

Yes, using 0.81 is much better, generally do not have this kind of behaviour, and does not crash with same dataset

nvidia-smi does not really show memory bursts - i.e. generally GPU is using less then 50% of memory and then suddenly crashes (with 0.72).

I will run some tests with bigger (but similar) dataset and report

@hcho3
Copy link
Collaborator

hcho3 commented Mar 13, 2019

@hlbkin Are things looking better now?

@hlbkin
Copy link
Author

hlbkin commented Mar 15, 2019

Yes, everything is fine for now. But what I did also was moving numpy dataset to float32 manually (which is natural for xgboost as I understood)

@hlbkin hlbkin closed this as completed Mar 15, 2019
@hlbkin hlbkin mentioned this issue May 1, 2019
18 tasks
@lock lock bot locked as resolved and limited conversation to collaborators Jun 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants