-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Check if python is exiting on a core worker on non-main Python thread #49547
Conversation
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
This should at least fix the issue for the compiled graphs test segfault, but after trying multiple times now even on master, I can't seem to repro the segfault again, if you can still repro can you try with this @kevin85421 ? |
As I mentioned in #47864 (comment), it has become very hard to reproduce on my side after rebasing with the master branch. If you want to reproduce it, you can try to reproduce with Ray 2.37.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good
@@ -2362,6 +2362,10 @@ cdef CRayStatus check_signals() nogil: | |||
# The Python exceptions are not handled if it is raised from cdef, | |||
# so we have to handle it here. | |||
try: | |||
if sys.is_finalizing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that the segfault occurs in with gil
because it dereferences something that has already been released. Is it possible to avoid acquiring the GIL in that situation? I guess it is impossible to determine the state of the Python interpreter without the GIL. If it is impossible, I am fine with the current solution.
It would be helpful if you could test it with a reproduction both with and without this PR so that we can know whether this solution helps or not because it's hard to write tests for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya there's no way to get python interpreter state without the gil, so the other option becomes reworking the signal checking in C++ as i attempted here #49319, but it'll require some more work to handle some edge-case tests where we rely on knowing the Python SystemExit state to exit out of the core worker.
Ideally this should fix, because sys.is_finalizing
works on non-main thread and should allow us to exit out before python starts freeing the resources needed for with gil
. But ya will try more to repro with vs. without this change, and update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar logic used here
if sys.is_finalizing(): |
from #47702. Only checking for system exit on ChannelTimeout though instead of each time through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's hard to reproduce, I'm fine with running a RayCG program hundreds of times to check if no segfault related to acquiring the GIL occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. I'm OK to merge then follow up with more investigations on repro etc.
Signed-off-by: dayshah <[email protected]>
Ran it 50-100 times with no segfault, when trying to get segfault on commits from ~2 months ago, I found that the older commits segfault after I remove the explicit teardowns for different reasons than the one we're trying to fix here as we had some other issues back then as well. |
… thread (#49547) Signed-off-by: dayshah <[email protected]>
… thread (ray-project#49547) Signed-off-by: dayshah <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
…n Python thread (ray-project#49547)" This reverts commit 41d15db.
… non-main Python thread (ray-project#49547)"" This reverts commit 97a0127.
…n Python thread (ray-project#49547)" This reverts commit 41d15db. Signed-off-by: Edward Oakes <[email protected]>
… non-main Python thread (ray-project#49547)"" This reverts commit 97a0127. Signed-off-by: Edward Oakes <[email protected]>
Why are these changes needed?
Right now the Monitor thread used for compiled graphs calls ray.get and the get call can stay alive up to the point of the python interpreter starting to shutdown. This can cause a segfault when acquiring the GIL in check_signals as mentioned here #47864 (comment). Also based on the documentation of https://docs.python.org/3/c-api/exceptions.html#c.PyErr_CheckSignals which is currently used in check_signals, it doesn't actually work on threads that aren't the main python thread.
sys.is_finalizing tells us if the Python interpreter is shutting down and freeing resources.
sys.is_finalizing works on any thread, so even if it's called from the non-main Python thread and if the Python interpreter is shutting down it will work and allow us to exit out before python starts destroying things we may access through PyErr_CheckSignals() calls.
Related issue number
Closes #48806
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.