-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-51966][PYTHON] Replace select.select() with select.poll() when running on POSIX os #53306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-51966][PYTHON] Replace select.select() with select.poll() when running on POSIX os #53306
Conversation
…osix On glibc based Linux systems select() can monitor only file descriptor numbers that are less than FD_SETSIZE (1024). This is an unreasonably low limit for many modern applications.
|
This is identical to #50774 (which was never reviewed and closed by the bot), but rebased against current |
|
@HyukjinKwon is that something you could maybe review (in combination with py4j/py4j#560)? We needed to implement this change to allow us to run 1000+ executors without running into |
|
Can we have an environment variable to fallback? |
|
@HyukjinKwon you can now use |
|
@HyukjinKwon @gaogaotiantian - I see you now merged #53388 Can you merge a similar change to |
|
See traceback: |
|
Could you make sure the CI pass? I think it’s the linter issue. Also it has some conflicts now. |
That would be my fault. I did not realize this PR exists when I fix the |
|
Just updated the branch - this should fix conflict and overall is similar to your changes in Removed the fallback - if we have one it should be for both As per your PR I updated |
|
Does this LGTM, @gaogaotiantian ? |
python/pyspark/accumulators.py
Outdated
| if self.rfile in r and func(): | ||
| if poller is not None: | ||
| # Unlike select, poll timeout is in millis. Rule out error events. | ||
| r = [fd for fd, event in poller.poll(1000) if event & select.POLLIN] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I took some time on this and I think my implementation lacks error checking as well which I will fix later. However, we need to deal with errors here. If the socket has some issues, we will busy loop here forever because poller.poll() does not raise an exception.
We should break on POLLHUP and raise an error on POLLERR and POLLNVAL. We also need to confirm that there's no other event set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - you are right - so we want to handle these:
POLLINandPOLLHUPwe want to break - when usingselect()these both return "ready for reading" - if peer hangs up reads will return 0 bytesPOLLERRandPOLLNVALwe want to raise -select()will already raise on error
|
@gaogaotiantian let me know what you think re. latest commit - it will check for errors in both I also removed try/catch around: in Considering |
What changes were proposed in this pull request?
On glibc based Linux systems
select()can monitor only file descriptor numbers that are less thanFD_SETSIZE(1024).This is an unreasonably low limit for many modern applications.
This PR replaces
select.select()withselect.poll()when running on POSIX os.Why are the changes needed?
When running via
pysparkwe frequently observe:On POSIX systems
poll()should be used instead ofselect().Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests + we have been running this change (combined with py4j/py4j#560) on our YARN cluster (Linux) since April 2025.
Was this patch authored or co-authored using generative AI tooling?
No