-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
What happened + What you expected to happen
Description
The PlacementGroupFactory in Ray Tune works as expected for simple nested child tasks, but fails when using a nested Ray XGBoost trainer. The nested Ray XGBoost trainer is unable to use the placement group resource, instead it created an additional placement group, causing the tuner.fit to hang forever with that "PENDING" placement group.
We kindly request the Ray team's assistance:
- Is the following described issue expected? If not, how should we address it? A rough timeline would be helpful if available.
- Could you provide guidance on preventing hangs when placement groups can't be satisfied in Ray Tune? We'd prefer a graceful failure over an indefinite hang.
Steps to Reproduce (See Reproduction script section for full scripts)
- Set up a Ray cluster with 4 CPUs
- Use the following code structure:
from ray.train.xgboost import XGBoostTrainer
def train_xgboost(config):
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(
num_workers=1,
resources_per_worker={"CPU": 2},
),
....
)
result = trainer.fit()
pg = tune.PlacementGroupFactory([
{"CPU": 1},
{"CPU": 3}
])
tuner = tune.Tuner(
tune.with_resources(train_xgboost, resources=pg),
...
)
tuner.fit()
Expected Behavior
The nested XGBoost task uses a total of 3 cpus: 1 for coordinator + num_worker(1) * cpu/worker(2) = 3. And XGBoost trainer should be able to use the already-created placement group that has {cpu:3}.
Actual Behavior
The tuner.fit call hangs forever, and Ray dashboard shows the following resource usage:
Resource Status
Usage:
1.0/4.0 CPU (1.0 used of 4.0 reserved in placement groups)
0B/28.68GiB memory
35.87KiB/2.00GiB object_store_memory
Demands:
{'CPU': 3.0} * 1 (PACK): 1+ pending placement groups
and placement group table shows the following:

As shown in the placement graph, two placement groups were created:
- first placement group with {"CPU": 1}, {"CPU": 3}; this pg was successfully created and passed down to ray tune
- second placement group with {"CPU": 3}; this pg was not able to be fulfilled due to entire cluster has 4 cpus and all have been used by first placement group, as a result second pg gets stuck in PENDING state and causing tuner.fit to hang.
This basically means, the XGBoost trainer didnt utilize the 3 cpus that was specified in the placement group, instead it created additional placement group that asks for 3 additional cpus, given cluster has a total of 4 cpus and all have been reserved in the first placement group, hence the second placement group cannot be satisfied and hang forever.
Things I have Tried
-
It works if we don't specify an additional bundle in the PG. Changing [{"CPU": 1}, {"CPU": 3}] to [{"CPU": 1}] resolves the issue; the first PG uses 1 CPU, and the nested XGBoost trainer creates a second PG using 3 CPUs, totaling 4 CPUs (matching the ray cluster's capacity). However, this solution is inadequate. Ray doesn't pre-check if the nested Ray XGBoost trainer's resources can be utilized. Without proper tuning, hangs may still occur, especially when concurrent_trials exceeds 1.
-
We can enforce timeout in either trial or experiment level ; However this is also inadequate because we dont necessarily know the suitable timeout value to set due to various sizes of workloads
Versions / Dependencies
ray 2.10.0
python: 3.11.5
MacOS
Reproduction script
from ray import tune
from ray import train
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig
import ray
# Set up a ray cluster with 4 cpus
ray.shutdown()
ray.init(num_cpus=4)
# Execute the ray tune
def train_xgboost(config):
train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
trainer = XGBoostTrainer(
label_column="y",
params={"objective": "reg:squarederror"},
scaling_config=ScalingConfig(
num_workers=1,
resources_per_worker={"CPU": 2},
),
datasets={"train": train_dataset},
)
result = trainer.fit()
train.report({"train-rmse": result.metrics["train-rmse"]})
pg = tune.PlacementGroupFactory([
{"CPU": 1},
{"CPU": 3}
])
tuner = tune.Tuner(
tune.with_resources(train_xgboost, resources=pg),
param_space={},
tune_config=tune.TuneConfig(
metric="train-rmse",
mode="min",
num_samples=1,
max_concurrent_trials=1,
),
)
tuner.fit()
Issue Severity
High: It blocks me from completing my task.