Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GILPool is a broken abstraction in presence of gevent #3668

Closed
davidhewitt opened this issue Dec 19, 2023 · 10 comments
Closed

GILPool is a broken abstraction in presence of gevent #3668

davidhewitt opened this issue Dec 19, 2023 · 10 comments

Comments

@davidhewitt
Copy link
Member

From investigation in pydantic/pydantic#7211

I am able to produce a small minimal case which shows that in the presence of gevent task switching, the GILPool is not sound.

The fundamentals of the interaction can be summarised as follows:

  1. gevent task A calls into a PyO3 function, creating a GILPool we will call pool A.
  2. The PyO3 function does some work, populating pool A, before running a pure-python callback which allows gevent to switch task
  3. gevent task B calls into the PyO3 function, creating another new pool B. Due to the current GILPool scoping design, this pool is necessarily freed when pool A is freed.
  4. The PyO3 function again does work, filling GilPool B, and then eventually also switches task.
  5. Task A resumes and finishes its PyO3 function call, freeing pool A (which also frees the contents of pool B).
  6. Switch back to task B, which continues PyO3 function execution, reading from the freed pool B and thus leading to UB.

Some quick thoughts:

  • This is pretty painful; I need to have a good think whether there's any mitigation which we can install in the GILPool implementation to avoid this.
  • The good news is that one concrete mitigation is to continue the new Py2 API work and rip out the GILPool completely.

It's unclear to me whether the root cause is that GILPool relies on an invalid assumption regarding nature of Rust scopes in the presence of FFI calls, or that gevent is not sound by assuming that it's always safe to switch task. Given the long existence of gevent, probably this lesson is for us.

@adamreichold
Copy link
Member

This is pretty painful; I need to have a good think whether there's any mitigation which we can install in the GILPool implementation to avoid this.

As a short-term mitigation, I think we could switch the pool to design to not store the objects in TLS, but directly in the pool itself and only put the currently active pool into TLS. This should avoid pool A freeing objects allocated for pool B in the above.

@davidhewitt
Copy link
Member Author

Possibly, but I think with just that change the gevent task switching will still confuse the active pool in the TLS, and instead when task A resumes it will start writing new objects into task B's pool. I think that must still necessarily be unsound if task B completes before task A.

@davidhewitt
Copy link
Member Author

One option is to just panic if gevent is present in sys.modules, but that's pretty disruptive.

@adamreichold
Copy link
Member

Ah, of course, gevent switching tasks does not switch TLS which is basically again what this is about...

One option is to just panic if gevent is present in sys.modules, but that's pretty disruptive.

Is there a hook or something which would allow us to fix up ourselves when gevent decides to switch tasks?

@davidhewitt
Copy link
Member Author

https://www.gevent.org/api/gevent.local.html - might be able to build something off gevent.local?

Alternatively, maybe it's better to go lower-level and interact with greenlet: https://greenlet.readthedocs.io/en/latest/c_api.html#c.PyGreenlet_GetCurrent

Though the API for greenlet is pretty minimal, it's not clear there's enough for us to build on. If we do find a greenlet-only solution, that's probably better than a gevent-only fix?

@davidhewitt
Copy link
Member Author

@jamadden sorry for the direct ping; I wonder if you have any advice on making native TLS interact safely with gevent? At the moment we use Rust TLS but if switching to e.g. Python's TLS API helps, I think making this sound is the highest priority. Thanks in advance 🙏

@jamadden
Copy link

(The tl;dr is: maybe contextvars will work for you.)

Ah, of course, gevent switching tasks does not switch TLS

Right, because its greenlets that are switching, not threads.

Is there a hook or something which would allow us to fix up ourselves when gevent decides to switch tasks?

Not really. The underlying greenlet library can call a function when greenlets are being switched, but that's intended for user-level profiling or debugging; it's not very composable.

I wonder if you have any advice on making native TLS interact safely with gevent? At the moment we use Rust TLS but if switching to e.g. Python's TLS API helps...

Also unfortunately, I don't have much to offer in terms of true native TLS. Greenlets aren't threads, and thus don't do anything with thread-local storage, whether provided by the compiler (.e.g, thread_local storage class), a platform API (e.g., pthread_getspecific) or Python's implementation (PyThread_tss_*, PyThread_get_key_value --- which is implemented using one of the previous options, IIRC). In fact, the greenlet implementation itself relies on thread-local values being thread-local and not greenlet-local.

It sounds like you need greenlet-local values. It is possible to do that with gevent.local, but (a) that's gevent-specific; and (b) there's only a Python API, nothing at the C level (so its relatively slow to access from C, and because it winds up executing arbitrary Python code, its theoretically possible for a greenlet switch to occur in the middle of accessing a gevent.local value. Yikes!)

Perhaps you can use contextvars for what you need? These were originally added to CPython to solve similar problems that async code encountered --- essentially, once you have async functions, you need coroutine-local values, as thread-local values no longer cut it; people ran into almost exactly the situation you describe here with things like the state of Decimal operations as expressed through a Context: A switch from one async function to another could lead to operating in an unexpected Context. The new accepted wisdom seems to be to use contextvars rather than threading.local to avoid this. Because of this, contextvars are natively supported by greenlet to be greenlet-local.

The nice thing about contextvars is that it solves both of the issues from above: (a) it's not gevent or greenlet specific, it will work with or without greenlets; and (b) there is a supported CPython C API. And of course, if you could get into a similar situation with async functions, it would take care of the issue there too.

@adamreichold
Copy link
Member

@davidhewitt

So we would call PyContextVar_New during initialization to setup an e.g. OWNED_OBJECTS variable and stash that pointer in a global variable. Then register_owned will use PyContextVar_Get to retrieve a pyclass storing a PyObjVec and GILPool will use PyContextVar_Set/Get to initialize/drain that pyclass. Sounds doable, doesn't it?

@davidhewitt
Copy link
Member Author

Yes, I think so. It looks like this is not available in abi3 though, so I guess for that we'll have to go via Python calls and just hope that it's good enough.

Either way I guess it will be a notable performance degradation, so probably only useful as a mitigation while we get off the pool.

I can try to implement this tomorrow 👍

Thanks @jamadden!

@davidhewitt
Copy link
Member Author

With the release of the Bound API in 0.21, which allows avoiding the GIL Pool, I think we can call this issue resolved. The 0.22 and 0.23 releases will completely phase out the old API and allow us to completely remove the pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants