-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.9.4 has a threading issue #415
Comments
+1 |
Thanks for the report. If you have time to produce a test case that would help. It could modify or fork from integration/wsgi.py or integration/thread. What environment is this? |
Hi @ijl, I'm in the same team as @adlaube. First of all Thank you for orjson! We experienced a great speedup from using orjson and it is nowadays our main JSON implementation that we use within our applications. Before the downgrade to orjson 3.9.2 ( Beside the already mentioned application freezes, we also saw sporadic segfaults which seem to come from orjson but as also other C and Rust extensions are used in our application, we can't 100% confirm that yet. In case of application freezes we can prove that the application froze as the thread calling orjson didn't release the GIL anymore as it was stuck somewhere in orjson. |
Can confirm, experienced hanging in python 3.11 and segfaults in python 3.9. Downgrading to 3.9.2 as others mentioned resolved the issue. |
3.9.5 removes the futex. I think going further requires a test exercising what you expect to work. |
Hi @ijl, unfortunately I have not been able to reproduce this issue locally yet. Interestingly this issue did not occur in our test landscape indicating that a consistent high load is required. For future issues it might be helpful to include a minimal set of symbols: debug settings Let me know if there is any further useful information I could be able to retrieve from the dumps for you. [0] Case 0: hanging thread which holds the GIL
Case 1: Segmentation fault |
Sadly for us, the bug is still present. I wrote a minimal test but didn't raise the issue... will give it another shot. |
I can reproduce this crash in the Zulip development environment by reloading the page a few dozen times. I’m seeing similar symptoms: various segfaults and 100% CPU freezes. The environment is Python 3.8.10 on Ubuntu 20.04.6 LTS x86-64. The first bad commit seems to be e9b745e. Moreover, I can reproduce the crash by cherry-picking only those buffer refactoring changes in Obviously Zulip is a pretty complicated application and far from an ideal minimal test case, but at least it’s open source and you can run it yourself; alternatively, I’m happy to try out any debugging suggestions you might have. |
orjson 3.9.3 introduced a crash (ijl/orjson#415) -- revert to the last version before the bug.
orjson 3.9.3 introduced a crash (ijl/orjson#415) -- revert to the last version before the bug.
orjson 3.9.3 introduced a crash (ijl/orjson#415) -- revert to the last version before the bug. (cherry picked from commit 2612a3b)
We ended up downgrading to 3.9.2 in Home Assistant since we kept getting reports of crashes. Sadly I haven't been able to replicate it myself |
orjson 3.9.3 introduced a crash (ijl/orjson#415) -- revert to the last version before the bug.
I’ve been noticing by running debug builds with warnings enabled that the crash is always preceded by a bunch of warnings like this:
These warnings are not caused by orjson, of course—I see these same warnings on other random lines of code during normal execution. But they do suggest that the crash may be triggered when the Python garbage collector happens to run from inside orjson’s code. This makes me suspect that orjson might be playing fast-and-loose with reference counting? Just from looking around at the Lines 294 to 298 in d1cd27e
The entire concept of I haven’t pinned this crash on any particular reference counting bug, but this genre of bug is the kind of thing that’s likely to produce rare crashes depending on when Python happens to invoke the garbage collector. |
Those are CPython implementation details. |
3.9.6 was tagged and could be run against your test suite, possibly with |
Yeah, 3.9.6 still crashes. Here’s an interesting panic:
remaining 100 frames
|
Here’s a question. What happens if |
import orjson
class C:
def __del__(self):
orjson.loads('"' + "a" * 10000 + '"')
c = C()
c.c = c
del c
orjson.loads("[" + "[]," * 1000 + "[]]") $ python reentrant.py
Illegal instruction (core dumped) |
orjson.loads may allocate a Python object that triggers a garbage collection that invokes a destructor that calls orjson.loads again. Or the destructor may release the GIL so a different thread can call orjson.loads. To remain safe under such reentrancy, we need to avoid reinitializing the yyjson pool while it might still be in use. The simplest fix is to initialize the yyjson pool only once, like we did before commit e9b745e. Fixes ijl#415. Signed-off-by: Anders Kaseorg <[email protected]>
orjson.loads may allocate a Python object that triggers a garbage collection that invokes a destructor that calls orjson.loads again. Or the destructor may release the GIL so a different thread can call orjson.loads. To remain safe under such reentrancy, we need to avoid reinitializing the yyjson pool while it might still be in use. The simplest fix is to initialize the yyjson pool only once, like we did before commit e9b745e. Fixes ijl#415. Signed-off-by: Anders Kaseorg <[email protected]>
changelog: ijl/orjson@3.9.2...3.9.7 Bump again now that ijl/orjson#415 is fixed
This is fixed in 3.9.7. Thank you everyone for the logs, investigation, and fix. |
Thank you ijl and andersk and everyone else! :-) |
orjson 3.9.3 introduced a crash (ijl/orjson#415) -- revert to the last version before the bug.
orjson 3.9.3 introduced a crash (ijl/orjson#415) -- revert to the last version before the bug.
After upgrading to 3.9.4 from 3.9.2, I started to experience freezing of my application. With the downgrade, I can confirm the same issue happens with 3.9.3 but not with 3.9.2!
My application utilizes threads for receiving messages from multiple topics. When I receive two messages almost simultaneously (below 8ms time difference), the application freezes on orjson.loads call. Freezes = thread that stucks on json.loads consumes 100% of CPU, and all other threads stop. Nothing happens at all - even prometheus_client is not providing any metrics, i.e., HTTP interface timeouts.
For troubleshooting, I used strace which shows only repeating lines:
full strace: strace.txt
The text was updated successfully, but these errors were encountered: