Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674

hendrikmakait · 2024-06-05T14:24:51Z

The idea here is similar to #8673:

Since P2P fixes the set of involved workers during the initialization of a shuffle run, we don't benefit from workers who join the cluster afterward. This is particularly important because P2P can't succeed if the sum of available disk space across all involved workers is smaller than the size of the (serialized) data.

Even if we don't hit the heuristic suggested in #8673, I think we should restart a P2P operation if the disk buffer on an involved worker encounters a OSError: [Errno 28] No space left on device and the worker count has grown since we started. We should add a circuit-breaker to this similar to the suspicious_count to avoid errors that are genuinely caused by inhomogeneous partitions (or because the cluster refuses to scale to a sufficient size).

The text was updated successfully, but these errors were encountered:

fjetter · 2024-06-05T15:49:09Z

I'm mildly concerned about the suspicious count logic and what that would entail (from a complexity perspective) but otherwise +1

hendrikmakait · 2024-06-06T08:00:56Z

I'm mildly concerned about the suspicious count logic and what that would entail (from a complexity perspective) but otherwise +1

The circuit-breaker logic is the thing I'm least concerned about. I'd keep this pretty simple: Add yet-another dictionary to the scheduler plugin that keeps a count which we increase every time we restart due to an Errno28. Don't restart if we exceed some threshold. Done. (Clear out when we clear out all the other stuff.)

hendrikmakait added enhancement Improve existing functionality or make things work better adaptive All things relating to adaptive scaling shuffle labels Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674

Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674

hendrikmakait commented Jun 5, 2024 •

edited

Loading

fjetter commented Jun 5, 2024 •

edited

Loading

hendrikmakait commented Jun 6, 2024

Restart P2P on OSError: [Errno 28] No space left on device instead of failing if the cluster has grown since we started #8674

Restart P2P on OSError: [Errno 28] No space left on device instead of failing if the cluster has grown since we started #8674

Comments

hendrikmakait commented Jun 5, 2024 • edited Loading

fjetter commented Jun 5, 2024 • edited Loading

hendrikmakait commented Jun 6, 2024

Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674

Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674

hendrikmakait commented Jun 5, 2024 •

edited

Loading

fjetter commented Jun 5, 2024 •

edited

Loading