Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Check if python is exiting on a core worker on non-main Python thread #49547

Merged
merged 6 commits into from
Jan 10, 2025

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Jan 2, 2025

Why are these changes needed?

Right now the Monitor thread used for compiled graphs calls ray.get and the get call can stay alive up to the point of the python interpreter starting to shutdown. This can cause a segfault when acquiring the GIL in check_signals as mentioned here #47864 (comment). Also based on the documentation of https://docs.python.org/3/c-api/exceptions.html#c.PyErr_CheckSignals which is currently used in check_signals, it doesn't actually work on threads that aren't the main python thread.

sys.is_finalizing tells us if the Python interpreter is shutting down and freeing resources.
sys.is_finalizing works on any thread, so even if it's called from the non-main Python thread and if the Python interpreter is shutting down it will work and allow us to exit out before python starts destroying things we may access through PyErr_CheckSignals() calls.

Related issue number

Closes #48806

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dayshah dayshah changed the title [core] Use Py_IsFinalizing to check for python exiting on core_worker on non-main thread [core] Check if python is exiting on a core worker on non-main Python thread Jan 2, 2025
Signed-off-by: dayshah <[email protected]>
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Jan 2, 2025
Signed-off-by: dayshah <[email protected]>
@dayshah
Copy link
Contributor Author

dayshah commented Jan 3, 2025

This should at least fix the issue for the compiled graphs test segfault, but after trying multiple times now even on master, I can't seem to repro the segfault again, if you can still repro can you try with this @kevin85421 ?

@dayshah dayshah requested a review from kevin85421 January 3, 2025 20:26
@kevin85421
Copy link
Member

As I mentioned in #47864 (comment), it has become very hard to reproduce on my side after rebasing with the master branch. If you want to reproduce it, you can try to reproduce with Ray 2.37.0.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good

@@ -2362,6 +2362,10 @@ cdef CRayStatus check_signals() nogil:
# The Python exceptions are not handled if it is raised from cdef,
# so we have to handle it here.
try:
if sys.is_finalizing():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that the segfault occurs in with gil because it dereferences something that has already been released. Is it possible to avoid acquiring the GIL in that situation? I guess it is impossible to determine the state of the Python interpreter without the GIL. If it is impossible, I am fine with the current solution.

It would be helpful if you could test it with a reproduction both with and without this PR so that we can know whether this solution helps or not because it's hard to write tests for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya there's no way to get python interpreter state without the gil, so the other option becomes reworking the signal checking in C++ as i attempted here #49319, but it'll require some more work to handle some edge-case tests where we rely on knowing the Python SystemExit state to exit out of the core worker.

Ideally this should fix, because sys.is_finalizing works on non-main thread and should allow us to exit out before python starts freeing the resources needed for with gil. But ya will try more to repro with vs. without this change, and update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar logic used here

if sys.is_finalizing():

from #47702. Only checking for system exit on ChannelTimeout though instead of each time through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's hard to reproduce, I'm fine with running a RayCG program hundreds of times to check if no segfault related to acquiring the GIL occurs.

Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. I'm OK to merge then follow up with more investigations on repro etc.

@dayshah
Copy link
Contributor Author

dayshah commented Jan 10, 2025

Ran it 50-100 times with no segfault, when trying to get segfault on commits from ~2 months ago, I found that the older commits segfault after I remove the explicit teardowns for different reasons than the one we're trying to fix here as we had some other issues back then as well.

@jjyao jjyao merged commit 41d15db into ray-project:master Jan 10, 2025
5 checks passed
@dayshah dayshah deleted the python-finalizing-exit branch January 10, 2025 21:56
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
edoakes added a commit to edoakes/ray that referenced this pull request Feb 20, 2025
edoakes added a commit to edoakes/ray that referenced this pull request Feb 20, 2025
edoakes added a commit to edoakes/ray that referenced this pull request Feb 22, 2025
…n Python thread (ray-project#49547)"

This reverts commit 41d15db.

Signed-off-by: Edward Oakes <[email protected]>
edoakes added a commit to edoakes/ray that referenced this pull request Feb 22, 2025
… non-main Python thread (ray-project#49547)""

This reverts commit 97a0127.

Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core][compiled graphs] test_dag_exception_chained does not exit cleanly
4 participants