bad_alloc even with swap being available #4184

hlbkin · 2019-02-26T18:47:04Z

Hi Team!

Im getting bad_alloc during xgboost train test iterations in cross validation

I have very specific cross validation in Python, thus we iterate over chunks of numpy array for train and test

Some of the train array sizes only fits in memory using the swap on linux (say additional 30-50 GB of swap memory).

It seems that right after some swap memory is used xgboost crashes

is it expected result?

hlbkin · 2019-02-27T07:56:01Z

Version is 0.81 from here https://s3-us-west-2.amazonaws.com/xgboost-wheels/list.html?prefix=

Just ran some experiments. The code crashed even at 50% of RAM being used

stack trace:
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what(): std::bad_alloc: out of memory

hlbkin · 2019-02-27T08:07:34Z

Current dataset 2_000_000 rows x 1500 features

on 0.72
When dataset was around 2 mio x 1000, it never crashed, even though RAM consumptions went up to 90%

default_params = {
    'n_estimators': 100,
    'eta': hp.quniform('eta', 0.01, 0.25, 0.015),
    'max_depth': hp.quniform('max_depth', 1, 4, 1),
    'min_child_weight': hp.qloguniform('min_child_weight', 1, 6, 1),
    'gamma': hp.loguniform('gamma', -10, 10),
    'alpha': hp.loguniform('alpha', -10, 10),
    'colsample_bytree': hp.quniform('colsample_bytree', 0.3, 1, 0.1),
    'subsample': hp.quniform('subsample', 0.3, 1, 0.1),
    'grow_policy': hp.choice('grow_policy', ['lossguide', 'depthwise']),
    'eval_metric': 'rmse',
    'n_gpus': 1,
    'gpu_id': 0,
    'objective': 'gpu:reg:linear',  # could be 'gpu:reg:linear'
    'weighting_scheme': hp.choice('weighting_scheme', [None, None]),
    'tree_method': 'gpu_hist',  # could be gpu_hist later, or exact - which is slower
    'silent': 0
}

hlbkin · 2019-02-27T09:01:47Z

Just tried with old 0.72 multigpu version, same code, same dataset

GPU memory:

[08:55:31] Allocated 4937MB on [0] GeForce GTX 1080 Ti, 6031MB remaining.
[08:55:39] Allocated 38MB on [0] GeForce GTX 1080 Ti, 5991MB remaining.

File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 204, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 74, in _train_internal
bst.update(dtrain, i, obj)
File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 894, in update
dtrain.handle))
File "/home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 130, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: b'[08:55:41] /workspace/src/tree/updater_gpu_hist.cu:572: GPU plugin exception: /workspace/src/tree/../common/device_helpers.cuh(560): unspecified launch failure\n\n\nStack trace returned 10 entries:\n[bt] (0) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::StackTrace()+0x47) [0x7f9aa8ac82f7]\n[bt] (1) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x7f9aa8ac89a8]\n[bt] (2) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x2b6) [0x7f9aa8cf86f6]\n[bt] (3) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x5a8) [0x7f9aa8b43ab8]\n[bt] (4) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0x8c6) [0x7f9aa8b44836]\n[bt] (5) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x3b8) [0x7f9aa8b73838]\n[bt] (6) /home/yhgtemp/anaconda3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x2b) [0x7f9aa8b0789b]\n[bt] (7) /home/yhgtemp/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f9ac6e81ec0]\n[bt] (8) /home/yhgtemp/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f9ac6e8187d]\n[bt] (9) /home/yhgtemp/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f9ac7096e2e]\n\n'
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: unspecified launch failure

hlbkin · 2019-02-27T10:00:28Z

Seems like this issue #2874

Code crashes on predict part, on 7-8 iterations through cv folds

Edit: moved predict part to "cpu_predictor", and still getting the crash on latest iterations

During cv iteration we delete the model to release memory

    model_copy = model_object.model.copy()
    model_copy.variables = variables_copy
    model_object.model.__del__()

trivialfis · 2019-02-28T09:49:43Z

@hlbkin Maybe you can try external_memory on CPU after #4193 being merged. Other than that, there's not much we can do about bad_alloc. We can try optimizing memory usage as we go, but that won't happen over night.

hlbkin · 2019-02-28T14:24:33Z

@trivialfis is that GPU failing because of temporary memory burst?

trivialfis · 2019-02-28T14:33:36Z

@hlbkin Possibly. But I think you can use tools like nvidia-smi to monitor the status of GPU. And please use latest version, I have been fixing various bugs for XGBoost, it not gonna be bug free but still I believe the latest commit is much more robust than 0.7x.

hlbkin · 2019-03-03T15:51:04Z

Yes, using 0.81 is much better, generally do not have this kind of behaviour, and does not crash with same dataset

nvidia-smi does not really show memory bursts - i.e. generally GPU is using less then 50% of memory and then suddenly crashes (with 0.72).

I will run some tests with bigger (but similar) dataset and report

hcho3 · 2019-03-13T12:56:45Z

@hlbkin Are things looking better now?

hlbkin · 2019-03-15T12:24:09Z

Yes, everything is fine for now. But what I did also was moving numpy dataset to float32 manually (which is natural for xgboost as I understood)

hlbkin closed this as completed Mar 15, 2019

hlbkin mentioned this issue May 1, 2019

XGBoost 0.90 Roadmap #4389

Closed

18 tasks

lock bot locked as resolved and limited conversation to collaborators Jun 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bad_alloc even with swap being available #4184

bad_alloc even with swap being available #4184

hlbkin commented Feb 26, 2019

hlbkin commented Feb 27, 2019

hlbkin commented Feb 27, 2019 •

edited

Loading

hlbkin commented Feb 27, 2019 •

edited

Loading

hlbkin commented Feb 27, 2019 •

edited

Loading

trivialfis commented Feb 28, 2019

hlbkin commented Feb 28, 2019

trivialfis commented Feb 28, 2019 •

edited

Loading

hlbkin commented Mar 3, 2019

hcho3 commented Mar 13, 2019

hlbkin commented Mar 15, 2019

bad_alloc even with swap being available #4184

bad_alloc even with swap being available #4184

Comments

hlbkin commented Feb 26, 2019

hlbkin commented Feb 27, 2019

hlbkin commented Feb 27, 2019 • edited Loading

hlbkin commented Feb 27, 2019 • edited Loading

hlbkin commented Feb 27, 2019 • edited Loading

trivialfis commented Feb 28, 2019

hlbkin commented Feb 28, 2019

trivialfis commented Feb 28, 2019 • edited Loading

hlbkin commented Mar 3, 2019

hcho3 commented Mar 13, 2019

hlbkin commented Mar 15, 2019

hlbkin commented Feb 27, 2019 •

edited

Loading

hlbkin commented Feb 27, 2019 •

edited

Loading

hlbkin commented Feb 27, 2019 •

edited

Loading

trivialfis commented Feb 28, 2019 •

edited

Loading