[Data] `map_groups` release test fails with scale-factor 100 on autoscaling cluster

### What happened + What you expected to happen

The map groups release tests fail with scale-factor 100 if you use an autoscaling cluster.

Here's an example run: https://buildkite.com/ray-project/release/builds/68217#_.

From what I can tell, the failures happen because the aggregation actors OOM. Ray Data only reserves 0.26 CPUs per aggregator, so the scheduler might be packing too many aggregators on each node. That might point to an issue in `_get_aggregator_num_cpus`:  https://github.com/ray-project/ray/blob/9ffdd7610ca03390fac42ae4ef9efce2717bd00e/python/ray/data/_internal/execution/operators/hash_shuffle.py#L1177-L1183

### Versions / Dependencies

9ffdd7610ca03390fac42ae4ef9efce2717bd00e

### Reproduction script

See autoscaling map groups release tests.

### Issue Severity

None

	cap = min(4.0, total_available_cluster_resources.cpu * 0.25 / num_aggregators)

	target_num_cpus = min(cap, estimated_aggregator_memory_required / (4 * GiB))

	# Round resource to 2d decimal point (for readability)
	return round(target_num_cpus, 2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] `map_groups` release test fails with scale-factor 100 on autoscaling cluster #58734

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] map_groups release test fails with scale-factor 100 on autoscaling cluster #58734

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Data] `map_groups` release test fails with scale-factor 100 on autoscaling cluster #58734