Skip to content

[Data] map_groups release test fails with scale-factor 100 on autoscaling cluster #58734

@bveeramani

Description

@bveeramani

What happened + What you expected to happen

The map groups release tests fail with scale-factor 100 if you use an autoscaling cluster.

Here's an example run: https://buildkite.com/ray-project/release/builds/68217#_.

From what I can tell, the failures happen because the aggregation actors OOM. Ray Data only reserves 0.26 CPUs per aggregator, so the scheduler might be packing too many aggregators on each node. That might point to an issue in _get_aggregator_num_cpus:

cap = min(4.0, total_available_cluster_resources.cpu * 0.25 / num_aggregators)
target_num_cpus = min(cap, estimated_aggregator_memory_required / (4 * GiB))
# Round resource to 2d decimal point (for readability)
return round(target_num_cpus, 2)

Versions / Dependencies

9ffdd76

Reproduction script

See autoscaling map groups release tests.

Issue Severity

None

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions