-
Notifications
You must be signed in to change notification settings - Fork 7k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issues
Description
What happened + What you expected to happen
The map groups release tests fail with scale-factor 100 if you use an autoscaling cluster.
Here's an example run: https://buildkite.com/ray-project/release/builds/68217#_.
From what I can tell, the failures happen because the aggregation actors OOM. Ray Data only reserves 0.26 CPUs per aggregator, so the scheduler might be packing too many aggregators on each node. That might point to an issue in _get_aggregator_num_cpus:
ray/python/ray/data/_internal/execution/operators/hash_shuffle.py
Lines 1177 to 1183 in 9ffdd76
| cap = min(4.0, total_available_cluster_resources.cpu * 0.25 / num_aggregators) | |
| target_num_cpus = min(cap, estimated_aggregator_memory_required / (4 * GiB)) | |
| # Round resource to 2d decimal point (for readability) | |
| return round(target_num_cpus, 2) | |
Versions / Dependencies
Reproduction script
See autoscaling map groups release tests.
Issue Severity
None
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issues