Restart P2P on OSError: [Errno 28] No space left on device
instead of failing if the cluster has grown since we started
#8674
Labels
adaptive
All things relating to adaptive scaling
enhancement
Improve existing functionality or make things work better
shuffle
The idea here is similar to #8673:
Since P2P fixes the set of involved workers during the initialization of a shuffle run, we don't benefit from workers who join the cluster afterward. This is particularly important because P2P can't succeed if the sum of available disk space across all involved workers is smaller than the size of the (serialized) data.
Even if we don't hit the heuristic suggested in #8673, I think we should restart a P2P operation if the disk buffer on an involved worker encounters a
OSError: [Errno 28] No space left on device
and the worker count has grown since we started. We should add a circuit-breaker to this similar to thesuspicious_count
to avoid errors that are genuinely caused by inhomogeneous partitions (or because the cluster refuses to scale to a sufficient size).The text was updated successfully, but these errors were encountered: