Skip to content

__traverse__ should not alter any refcounts, at least in CPython #3165

@lifthrasiir

Description

@lifthrasiir

Bug Description

Back when we've got a support for the GC protocol, while it is not usual for __traverse__ to do anything more than calling visit.call, it was assumed that running any Python code is not harmful and currently __traverse__ trampoline is no more than an ordinary trampoline with a slightly different signature.

It turns out that not only this assumption is wrong but also any currently generated tp_traverse implementation is prone to crash. CPython has reused PyGC_Head fields to store intermediate refcounts for a long time, so tp_traverse should be never able to read or write any PyGC_Head fields.

In particular the __traverse__ trampoline would first allocate a new GILPool and apply deferred refcount updates at once, possibly deallocating some objects in the pool and in turn somehow touching temporarily corrupted PyGC_Head (e.g. PyObject_GC_Untrack or PyObject_GC_Del). For this reason I believe this is the root cause of #1623 and #3064; both suggests that PyGC_Head is corrupted and threading seems to be relevant here (it is much easier to defer refcount updates in non-interpreter threads).

Reproduction

Too lengthy to open by default

I was able to reliably trigger SIGSEGV on Python 3.8 through 3.11 in x86-64 Linux. Given the following src/lib.rs and typical Cargo.toml (which has been omitted):

use pyo3::{prelude::*, PyTraverseError, PyVisit};
use std::{cell::RefCell, thread};

#[pyclass]
struct Ref {
    inner: RefCell<Option<Py<Self>>>,
    clear_on_traverse: bool,
}

impl Ref {
    fn new(clear_on_traverse: bool) -> Self {
        Self {
            inner: RefCell::new(None),
            clear_on_traverse,
        }
    }
}

#[pymethods]
impl Ref {
    fn __traverse__(&self, visit: PyVisit<'_>) -> Result<(), PyTraverseError> {
        if self.clear_on_traverse {
            self.inner.take();
        } else {
            if let Some(inner) = &*self.inner.borrow() {
                visit.call(inner)?;
            }
        }
        Ok(())
    }

    fn __clear__(&mut self) {
        self.inner.take();
    }
}

#[pyfunction]
fn crash1() {
    thread::spawn(move || {
        _ = Python::with_gil(|py| -> PyResult<_> {
            let a = Py::new(py, Ref::new(false))?;
            a.borrow_mut(py).inner.replace(Some(a.clone_ref(py)));

            let b = Py::new(py, Ref::new(false)))?;
            Ok((a, b))
        });
        eprintln!("thread finished");
    });

    Ok(())
}

#[pyfunction]
fn crash2() {
    thread::spawn(move || {
        _ = Python::with_gil(|py| -> PyResult<_> {
            let a1 = Py::new(py, Ref::new(true))?;
            let a2 = Py::new(py, Ref::new(false))?;
            a1.borrow_mut(py).inner.replace(Some(a2.clone_ref(py)));
            a2.borrow_mut(py).inner.replace(Some(a1.clone_ref(py)));

            let b = Py::new(py, Ref::new(false)))?;
            Ok((a1, b))
        });
        eprintln!("thread finished");
    });
}

#[pymodule]
#[pyo3(name = "pyo3_crash_test")]
fn module_entry(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(crash1, m)?)?;
    m.add_function(wrap_pyfunction!(crash2, m)?)?;
    Ok(())
}

Running the following Python code will crash inside gc.collect():

import sys
import gc
import time
import pyo3_crash_test

pyo3_crash_test.crash1() # or crash2
time.sleep(0.1) # not strictly required, but ensures that the thread finishes before GC
print('collecting...', file=sys.stderr)
gc.collect()
assert False, 'crash expected during GC'

Here Ref is holding an optional reference to other Ref, possibly itself. Setting clear_on_traverse will clear the reference during the traversal, simulating an otherwise totally safe operation to do.

In the crash1 the refcounts for a and b will be 2 and 1 respectively at the time of GC. GC will overwrite PyGC_Head._gc_prev for both a and b (effectively converting the list to singly-linked), and then __traverse__ will be called for a, creating a new GILPool and triggering ReferencePool::update_counts. As both values are registered for Py_DECREF the refcount of b will be now 0, and this will eventually call PyObject_GC_Del which assumes that the list is still doubly-linked and crashes.

The situation is similar for crash2: a2 gets dropped with the GIL still acquired, so the refcounts for a1, a2 and b are 2, 1 and 1 respectively. Now assume that GILPool::new somehow recognizes __traverse__ and doesn't call update_counts. Even in this case a1 will now remove a reference to a2 during __traverse__, falsely beliving Py_DECREF is safe to call as the GIL is acquired, so the same thing will happen for a2.

Possible Solution

It seems evident to me that __traverse__ should be handled as if no GIL is (and can be) acquired. This implies that Python::with_gil should probably fail and gil_is_acquired should return false inside __traverse__.

I believe it is also possible to have multiple nested __traverse__ calls. The GC process drains the linked list of objects while updating PyGC_Head and it sets gcstate->collecting to 1 during the process anyway, so __traverse__ cannot be called twice for the same object and nested calls should be rare, but there are at least three possible paths (as of the main branch of CPython):

  • Creating new objects may trigger PyObject_GC_Track, which implicitly calls tp_traverse in debug builds.
  • gc.get_referrers and gc.get_referrents use tp_traverse to collect objects, and are not affected by gcstate->collecting. They can be independently nested if __traverse__ somehow calls those functions.
  • While directly calling gc.collect() checks for gcstate->collecting and returns immediately, the automatic GC scheduling may result in a double GC. Specifically there is an interval between the scheduling (_Py_ScheduleGC) and the actual GC (_Py_RunGC) and some opcode may end up calling gc.collect() inbetween. This is extremely rare though, as this interval can be at most a single opcode, and is otherwise safe.

Probably it is sufficient to add the second "traversal" counter for the nested __traverse__ calls in addition to GIL_COUNT. If the traversal counter is non-zero the GIL cannot be acquired and gil_is_acquired should return false regardless of GIL_COUNT. The GIL_COUNT is still updated as usual, but it is effectively ignored until the traversal counter drops to zero, at which point update_counts should be considered (but only when GIL_COUNT was non-zero). But I'm not very much confident about this solution for multiple reasons, including that there is no actual guarantee that we can call any other Python API from tp_traverse and someday CPython may touch PyGC_Head from unexpected functions---I mean, even gcstate->collecting check is not complete so who knows? If possible, a fully static solution to prevent anything but PyVisit would be much more desirable.

Suggestions

Given a full solution needs much testing at least, I looked for possible workarounds without a local PyO3 update. Unfortunately GILPool::new call is mandatory for the trampoline and the only indirect approach would be temporarily untracking objects via PyObject_GC_UnTrack, but that pretty much contradicts why we have tp_traverse in the first place.

So it seems that PyO3 needs a minimal update to ensure that you remain safe if you don't do weird things inside __traverse__. No promises here, but I hope to file a small PR to disable update_counts from the trampoline soon. In my local testing this fixes crash1 but not crash2.

Metadata

Your operating system and version

x86-64 Linux (at least)

Your Python version (python --version)

3.8.10, 3.9.16, 3.10.11, 3.11.13

Your Rust version (rustc --version)

1.69.0

Your PyO3 version

0.18.3

How did you install python? Did you use a virtualenv?

deadsnakes PPA, the bundled version of venv, pip and the most recent version of maturin (0.15.2).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions