You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
celestinoxp opened this issue
Feb 23, 2025
· 3 comments
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
Using autogluon, Ray is used for paralellism, but when have a big dataset with a lot of columns eg:10.000 Ray reproduces an error from catboost. It's only catboost with problems, all other algorithms work well.
The problem should be investigated by Ray team because if i uninstall Ray, Autogluon do not gives the error:
Error:
Fitting model: CatBoost_BAG_L1 ... Training model for up to 9610.46s of the 16674.71s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 28.90% memory usage per fold, 57.80%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=8, gpus=0, memory=28.90%)
Warning: Exception caused CatBoost_BAG_L1 to fail during training... Skipping this model.
ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation
Detailed Traceback:
Traceback (most recent call last):
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2160, in _train_and_save
model = self._train_single(**model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2047, in _train_single
model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, total_resources=total_resources, **model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\stacker_ensemble_model.py", line 270, in _fit
return super()._fit(X=X, y=y, time_limit=time_limit, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 390, in _fit
self._fit_folds(
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 847, in _fit_folds
fold_fitting_strategy.after_all_folds_scheduled()
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 690, in after_all_folds_scheduled
self._run_parallel(X, y, X_pseudo, y_pseudo, model_base_ref, time_limit_fold, head_node_id)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 631, in _run_parallel
self._process_fold_results(finished, unfinished, fold_ctx)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 587, in _process_fold_results
raise processed_exception
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 550, in _process_fold_results
fold_model, pred_proba, time_start_fit, time_end_fit, predict_time, predict_1_time, predict_n_size, fit_num_cpus, fit_num_gpus = self.ray.get(finished)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(CatBoostError): ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation
Versions / Dependencies
Autogluon 1.2
Ray 3.0.0-dev (latest to have latest fixes)
Python 3.12.9
Windows
Catboost 1.2.7 dev (latest to have latest fixes)
The text was updated successfully, but these errors were encountered:
celestinoxp
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Feb 23, 2025
@Innixma I think ray and catboost try to use too much memory and catboost can't allocate it and gives an error. I don't know if autogluon is the culprit, but it seems to me that Ray and Catboost are. All models work without problems. What do you think about this?
Hi @celestinoxp, this is a problem with AutoGluon, not with Ray or CatBoost. It is AutoGluon's job to stop the models prior to them going OOM and to estimate their memory usage prior to fitting.
This estimate is from AutoGluon:
Will train 2 folds in parallel instead (Estimated 28.90% memory usage per fold, 57.80%/80.00% total)
In this case, AutoGluon's estimate was wrong, and the two models ended up taking >100% of memory instead of 57.80%, causing the out of memory exception.
I'd recommend closing this issue and redirecting it to AutoGluon.
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
Using autogluon, Ray is used for paralellism, but when have a big dataset with a lot of columns eg:10.000 Ray reproduces an error from catboost. It's only catboost with problems, all other algorithms work well.
The problem should be investigated by Ray team because if i uninstall Ray, Autogluon do not gives the error:
Error:
Fitting model: CatBoost_BAG_L1 ... Training model for up to 9610.46s of the 16674.71s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 28.90% memory usage per fold, 57.80%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=8, gpus=0, memory=28.90%)
Warning: Exception caused CatBoost_BAG_L1 to fail during training... Skipping this model.
ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation
Detailed Traceback:
Traceback (most recent call last):
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2160, in _train_and_save
model = self._train_single(**model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2047, in _train_single
model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, total_resources=total_resources, **model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\stacker_ensemble_model.py", line 270, in _fit
return super()._fit(X=X, y=y, time_limit=time_limit, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 390, in _fit
self._fit_folds(
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 847, in _fit_folds
fold_fitting_strategy.after_all_folds_scheduled()
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 690, in after_all_folds_scheduled
self._run_parallel(X, y, X_pseudo, y_pseudo, model_base_ref, time_limit_fold, head_node_id)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 631, in _run_parallel
self._process_fold_results(finished, unfinished, fold_ctx)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 587, in _process_fold_results
raise processed_exception
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 550, in _process_fold_results
fold_model, pred_proba, time_start_fit, time_end_fit, predict_time, predict_1_time, predict_n_size, fit_num_cpus, fit_num_gpus = self.ray.get(finished)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(CatBoostError): ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation
Versions / Dependencies
Autogluon 1.2
Ray 3.0.0-dev (latest to have latest fixes)
Python 3.12.9
Windows
Catboost 1.2.7 dev (latest to have latest fixes)
Reproduction script
my code is autogluon code
predictor = TabularPredictor(
label="n2_maior_igual_17",
eval_metric="log_loss",
path="modelos/n2_maior_igual_17/"
).fit(
dados_treino_n2_maior_igual_17,
presets="best_quality",
excluded_model_types=["KNN", "XT", "RF"],
ds_args={"enable_ray_logging": False},
ag_args_fit={
"early_stop": None,
"colsample_bylevel": 1.0,
},
time_limit= 8 * 3600,
refit_full=True,
calibrate=True
)
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: