-
Notifications
You must be signed in to change notification settings - Fork 901
Description
Bug Description
Back when we've got a support for the GC protocol, while it is not usual for __traverse__ to do anything more than calling visit.call, it was assumed that running any Python code is not harmful and currently __traverse__ trampoline is no more than an ordinary trampoline with a slightly different signature.
It turns out that not only this assumption is wrong but also any currently generated tp_traverse implementation is prone to crash. CPython has reused PyGC_Head fields to store intermediate refcounts for a long time, so tp_traverse should be never able to read or write any PyGC_Head fields.
In particular the __traverse__ trampoline would first allocate a new GILPool and apply deferred refcount updates at once, possibly deallocating some objects in the pool and in turn somehow touching temporarily corrupted PyGC_Head (e.g. PyObject_GC_Untrack or PyObject_GC_Del). For this reason I believe this is the root cause of #1623 and #3064; both suggests that PyGC_Head is corrupted and threading seems to be relevant here (it is much easier to defer refcount updates in non-interpreter threads).
Reproduction
Too lengthy to open by default
I was able to reliably trigger SIGSEGV on Python 3.8 through 3.11 in x86-64 Linux. Given the following src/lib.rs and typical Cargo.toml (which has been omitted):
use pyo3::{prelude::*, PyTraverseError, PyVisit};
use std::{cell::RefCell, thread};
#[pyclass]
struct Ref {
inner: RefCell<Option<Py<Self>>>,
clear_on_traverse: bool,
}
impl Ref {
fn new(clear_on_traverse: bool) -> Self {
Self {
inner: RefCell::new(None),
clear_on_traverse,
}
}
}
#[pymethods]
impl Ref {
fn __traverse__(&self, visit: PyVisit<'_>) -> Result<(), PyTraverseError> {
if self.clear_on_traverse {
self.inner.take();
} else {
if let Some(inner) = &*self.inner.borrow() {
visit.call(inner)?;
}
}
Ok(())
}
fn __clear__(&mut self) {
self.inner.take();
}
}
#[pyfunction]
fn crash1() {
thread::spawn(move || {
_ = Python::with_gil(|py| -> PyResult<_> {
let a = Py::new(py, Ref::new(false))?;
a.borrow_mut(py).inner.replace(Some(a.clone_ref(py)));
let b = Py::new(py, Ref::new(false)))?;
Ok((a, b))
});
eprintln!("thread finished");
});
Ok(())
}
#[pyfunction]
fn crash2() {
thread::spawn(move || {
_ = Python::with_gil(|py| -> PyResult<_> {
let a1 = Py::new(py, Ref::new(true))?;
let a2 = Py::new(py, Ref::new(false))?;
a1.borrow_mut(py).inner.replace(Some(a2.clone_ref(py)));
a2.borrow_mut(py).inner.replace(Some(a1.clone_ref(py)));
let b = Py::new(py, Ref::new(false)))?;
Ok((a1, b))
});
eprintln!("thread finished");
});
}
#[pymodule]
#[pyo3(name = "pyo3_crash_test")]
fn module_entry(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(crash1, m)?)?;
m.add_function(wrap_pyfunction!(crash2, m)?)?;
Ok(())
}Running the following Python code will crash inside gc.collect():
import sys
import gc
import time
import pyo3_crash_test
pyo3_crash_test.crash1() # or crash2
time.sleep(0.1) # not strictly required, but ensures that the thread finishes before GC
print('collecting...', file=sys.stderr)
gc.collect()
assert False, 'crash expected during GC'Here Ref is holding an optional reference to other Ref, possibly itself. Setting clear_on_traverse will clear the reference during the traversal, simulating an otherwise totally safe operation to do.
In the crash1 the refcounts for a and b will be 2 and 1 respectively at the time of GC. GC will overwrite PyGC_Head._gc_prev for both a and b (effectively converting the list to singly-linked), and then __traverse__ will be called for a, creating a new GILPool and triggering ReferencePool::update_counts. As both values are registered for Py_DECREF the refcount of b will be now 0, and this will eventually call PyObject_GC_Del which assumes that the list is still doubly-linked and crashes.
The situation is similar for crash2: a2 gets dropped with the GIL still acquired, so the refcounts for a1, a2 and b are 2, 1 and 1 respectively. Now assume that GILPool::new somehow recognizes __traverse__ and doesn't call update_counts. Even in this case a1 will now remove a reference to a2 during __traverse__, falsely beliving Py_DECREF is safe to call as the GIL is acquired, so the same thing will happen for a2.
Possible Solution
It seems evident to me that __traverse__ should be handled as if no GIL is (and can be) acquired. This implies that Python::with_gil should probably fail and gil_is_acquired should return false inside __traverse__.
I believe it is also possible to have multiple nested __traverse__ calls. The GC process drains the linked list of objects while updating PyGC_Head and it sets gcstate->collecting to 1 during the process anyway, so __traverse__ cannot be called twice for the same object and nested calls should be rare, but there are at least three possible paths (as of the main branch of CPython):
- Creating new objects may trigger
PyObject_GC_Track, which implicitly callstp_traversein debug builds. gc.get_referrersandgc.get_referrentsusetp_traverseto collect objects, and are not affected bygcstate->collecting. They can be independently nested if__traverse__somehow calls those functions.- While directly calling
gc.collect()checks forgcstate->collectingand returns immediately, the automatic GC scheduling may result in a double GC. Specifically there is an interval between the scheduling (_Py_ScheduleGC) and the actual GC (_Py_RunGC) and some opcode may end up callinggc.collect()inbetween. This is extremely rare though, as this interval can be at most a single opcode, and is otherwise safe.
Probably it is sufficient to add the second "traversal" counter for the nested __traverse__ calls in addition to GIL_COUNT. If the traversal counter is non-zero the GIL cannot be acquired and gil_is_acquired should return false regardless of GIL_COUNT. The GIL_COUNT is still updated as usual, but it is effectively ignored until the traversal counter drops to zero, at which point update_counts should be considered (but only when GIL_COUNT was non-zero). But I'm not very much confident about this solution for multiple reasons, including that there is no actual guarantee that we can call any other Python API from tp_traverse and someday CPython may touch PyGC_Head from unexpected functions---I mean, even gcstate->collecting check is not complete so who knows? If possible, a fully static solution to prevent anything but PyVisit would be much more desirable.
Suggestions
Given a full solution needs much testing at least, I looked for possible workarounds without a local PyO3 update. Unfortunately GILPool::new call is mandatory for the trampoline and the only indirect approach would be temporarily untracking objects via PyObject_GC_UnTrack, but that pretty much contradicts why we have tp_traverse in the first place.
So it seems that PyO3 needs a minimal update to ensure that you remain safe if you don't do weird things inside __traverse__. No promises here, but I hope to file a small PR to disable update_counts from the trampoline soon. In my local testing this fixes crash1 but not crash2.
Metadata
Your operating system and version
x86-64 Linux (at least)
Your Python version (python --version)
3.8.10, 3.9.16, 3.10.11, 3.11.13
Your Rust version (rustc --version)
1.69.0
Your PyO3 version
0.18.3
How did you install python? Did you use a virtualenv?
deadsnakes PPA, the bundled version of venv, pip and the most recent version of maturin (0.15.2).