Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Ray creates a Issue with catboost #50843

Open
celestinoxp opened this issue Feb 23, 2025 · 3 comments
Open

[Core] Ray creates a Issue with catboost #50843

celestinoxp opened this issue Feb 23, 2025 · 3 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@celestinoxp
Copy link

What happened + What you expected to happen

Using autogluon, Ray is used for paralellism, but when have a big dataset with a lot of columns eg:10.000 Ray reproduces an error from catboost. It's only catboost with problems, all other algorithms work well.
The problem should be investigated by Ray team because if i uninstall Ray, Autogluon do not gives the error:

Error:
Fitting model: CatBoost_BAG_L1 ... Training model for up to 9610.46s of the 16674.71s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 28.90% memory usage per fold, 57.80%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=8, gpus=0, memory=28.90%)
Warning: Exception caused CatBoost_BAG_L1 to fail during training... Skipping this model.
ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation
Detailed Traceback:
Traceback (most recent call last):
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2160, in _train_and_save
model = self._train_single(**model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2047, in _train_single
model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, total_resources=total_resources, **model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\stacker_ensemble_model.py", line 270, in _fit
return super()._fit(X=X, y=y, time_limit=time_limit, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 390, in _fit
self._fit_folds(
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 847, in _fit_folds
fold_fitting_strategy.after_all_folds_scheduled()
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 690, in after_all_folds_scheduled
self._run_parallel(X, y, X_pseudo, y_pseudo, model_base_ref, time_limit_fold, head_node_id)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 631, in _run_parallel
self._process_fold_results(finished, unfinished, fold_ctx)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 587, in _process_fold_results
raise processed_exception
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 550, in _process_fold_results
fold_model, pred_proba, time_start_fit, time_end_fit, predict_time, predict_1_time, predict_n_size, fit_num_cpus, fit_num_gpus = self.ray.get(finished)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(CatBoostError): ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation

Versions / Dependencies

Autogluon 1.2
Ray 3.0.0-dev (latest to have latest fixes)
Python 3.12.9
Windows
Catboost 1.2.7 dev (latest to have latest fixes)

Reproduction script

my code is autogluon code

predictor = TabularPredictor(
label="n2_maior_igual_17",
eval_metric="log_loss",
path="modelos/n2_maior_igual_17/"
).fit(
dados_treino_n2_maior_igual_17,
presets="best_quality",
excluded_model_types=["KNN", "XT", "RF"],
ds_args={"enable_ray_logging": False},
ag_args_fit={
"early_stop": None,
"colsample_bylevel": 1.0,
},
time_limit= 8 * 3600,
refit_full=True,
calibrate=True
)

Issue Severity

High: It blocks me from completing my task.

@celestinoxp celestinoxp added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 23, 2025
@celestinoxp
Copy link
Author

@Innixma I think ray and catboost try to use too much memory and catboost can't allocate it and gives an error. I don't know if autogluon is the culprit, but it seems to me that Ray and Catboost are. All models work without problems. What do you think about this?

Image

@celestinoxp
Copy link
Author

celestinoxp commented Feb 23, 2025

This issue should be discussed by the teams and contributors of the packages in question in order to fix the problem:
Autogluon: @Innixma
Ray: @aslonnie @dentiny @kevin85421 @anmscale
Catboost: @andrey-khropov @georgthegreat

@Innixma
Copy link

Innixma commented Feb 25, 2025

Hi @celestinoxp, this is a problem with AutoGluon, not with Ray or CatBoost. It is AutoGluon's job to stop the models prior to them going OOM and to estimate their memory usage prior to fitting.

This estimate is from AutoGluon:

Will train 2 folds in parallel instead (Estimated 28.90% memory usage per fold, 57.80%/80.00% total)

In this case, AutoGluon's estimate was wrong, and the two models ended up taking >100% of memory instead of 57.80%, causing the out of memory exception.

I'd recommend closing this issue and redirecting it to AutoGluon.

@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants