Skip to content

[RayTune+RayTrain]The PlacementGroupFactory in Ray Tune fails when using Ray XGBoost trainer  #47439

@sfc-gh-shchen

Description

@sfc-gh-shchen

What happened + What you expected to happen

Description
The PlacementGroupFactory in Ray Tune works as expected for simple nested child tasks, but fails when using a nested Ray XGBoost trainer. The nested Ray XGBoost trainer is unable to use the placement group resource, instead it created an additional placement group, causing the tuner.fit to hang forever with that "PENDING" placement group.

We kindly request the Ray team's assistance:

  • Is the following described issue expected? If not, how should we address it? A rough timeline would be helpful if available.
  • Could you provide guidance on preventing hangs when placement groups can't be satisfied in Ray Tune? We'd prefer a graceful failure over an indefinite hang.

Steps to Reproduce (See Reproduction script section for full scripts)

  1. Set up a Ray cluster with 4 CPUs
  2. Use the following code structure:
from ray.train.xgboost import XGBoostTrainer

def train_xgboost(config):
    trainer = XGBoostTrainer(
          scaling_config=ScalingConfig(
                num_workers=1,
                resources_per_worker={"CPU": 2},
          ),
         ....
    )
    result = trainer.fit()

pg = tune.PlacementGroupFactory([
    {"CPU": 1},
    {"CPU": 3}
])

tuner = tune.Tuner(    
    tune.with_resources(train_xgboost, resources=pg),
    ...
)
tuner.fit()

Expected Behavior
The nested XGBoost task uses a total of 3 cpus: 1 for coordinator + num_worker(1) * cpu/worker(2) = 3. And XGBoost trainer should be able to use the already-created placement group that has {cpu:3}.

Actual Behavior
The tuner.fit call hangs forever, and Ray dashboard shows the following resource usage:

Resource Status
Usage:
1.0/4.0 CPU (1.0 used of 4.0 reserved in placement groups)
0B/28.68GiB memory
35.87KiB/2.00GiB object_store_memory
Demands:
{'CPU': 3.0} * 1 (PACK): 1+ pending placement groups

and placement group table shows the following:
Screenshot 2024-08-30 at 4 52 45 PM

As shown in the placement graph, two placement groups were created:

  1. first placement group with {"CPU": 1}, {"CPU": 3}; this pg was successfully created and passed down to ray tune
  2. second placement group with {"CPU": 3}; this pg was not able to be fulfilled due to entire cluster has 4 cpus and all have been used by first placement group, as a result second pg gets stuck in PENDING state and causing tuner.fit to hang.

This basically means, the XGBoost trainer didnt utilize the 3 cpus that was specified in the placement group, instead it created additional placement group that asks for 3 additional cpus, given cluster has a total of 4 cpus and all have been reserved in the first placement group, hence the second placement group cannot be satisfied and hang forever.

Things I have Tried

  • It works if we don't specify an additional bundle in the PG. Changing [{"CPU": 1}, {"CPU": 3}] to [{"CPU": 1}] resolves the issue; the first PG uses 1 CPU, and the nested XGBoost trainer creates a second PG using 3 CPUs, totaling 4 CPUs (matching the ray cluster's capacity). However, this solution is inadequate. Ray doesn't pre-check if the nested Ray XGBoost trainer's resources can be utilized. Without proper tuning, hangs may still occur, especially when concurrent_trials exceeds 1.

  • We can enforce timeout in either trial or experiment level ; However this is also inadequate because we dont necessarily know the suitable timeout value to set due to various sizes of workloads

Versions / Dependencies

ray 2.10.0
python: 3.11.5
MacOS

Reproduction script

from ray import tune
from ray import train
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig
import ray

# Set up a ray cluster with 4 cpus
ray.shutdown()
ray.init(num_cpus=4)


# Execute the ray tune
def train_xgboost(config):
    train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
    trainer = XGBoostTrainer(
        label_column="y",
        params={"objective": "reg:squarederror"},
        scaling_config=ScalingConfig(
            num_workers=1,            
            resources_per_worker={"CPU": 2},            
        ),
        datasets={"train": train_dataset},
    )
    result = trainer.fit()
    train.report({"train-rmse": result.metrics["train-rmse"]})

pg = tune.PlacementGroupFactory([
    {"CPU": 1},
    {"CPU": 3}
])

tuner = tune.Tuner(    
    tune.with_resources(train_xgboost, resources=pg),
    param_space={},
    tune_config=tune.TuneConfig(
        metric="train-rmse",
        mode="min",
        num_samples=1,
        max_concurrent_trials=1,        
    ),
)
tuner.fit()

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'ttriageNeeds triage (eg: priority, bug/not-bug, and owning component)tuneTune-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions