Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart P2P on OSError: [Errno 28] No space left on device instead of failing if the cluster has grown since we started #8674

Open
hendrikmakait opened this issue Jun 5, 2024 · 2 comments
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better shuffle

Comments

@hendrikmakait
Copy link
Member

hendrikmakait commented Jun 5, 2024

The idea here is similar to #8673:

Since P2P fixes the set of involved workers during the initialization of a shuffle run, we don't benefit from workers who join the cluster afterward. This is particularly important because P2P can't succeed if the sum of available disk space across all involved workers is smaller than the size of the (serialized) data.

Even if we don't hit the heuristic suggested in #8673, I think we should restart a P2P operation if the disk buffer on an involved worker encounters a OSError: [Errno 28] No space left on device and the worker count has grown since we started. We should add a circuit-breaker to this similar to the suspicious_count to avoid errors that are genuinely caused by inhomogeneous partitions (or because the cluster refuses to scale to a sufficient size).

@hendrikmakait hendrikmakait added enhancement Improve existing functionality or make things work better adaptive All things relating to adaptive scaling shuffle labels Jun 5, 2024
@fjetter
Copy link
Member

fjetter commented Jun 5, 2024

I'm mildly concerned about the suspicious count logic and what that would entail (from a complexity perspective) but otherwise +1

@hendrikmakait
Copy link
Member Author

I'm mildly concerned about the suspicious count logic and what that would entail (from a complexity perspective) but otherwise +1

The circuit-breaker logic is the thing I'm least concerned about. I'd keep this pretty simple: Add yet-another dictionary to the scheduler plugin that keeps a count which we increase every time we restart due to an Errno28. Don't restart if we exceed some threshold. Done. (Clear out when we clear out all the other stuff.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better shuffle
Projects
None yet
Development

No branches or pull requests

2 participants