Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traceback.clear_frames does not clear locals when there have been previous access to f_locals #113939

Closed
albertz opened this issue Jan 11, 2024 · 11 comments · Fixed by #113940
Closed
Assignees
Labels
type-bug An unexpected behavior, bug, or error

Comments

@albertz
Copy link
Contributor

albertz commented Jan 11, 2024

Bug report

Bug description:

import traceback
import gc


class Obj:
    def __init__(self, name: str):
        self.name = name

    def __repr__(self):
        return f"Obj({self.name!r})"

    def __del__(self):
        print("del", self)


def deep(i: int):
    a = Obj(f"a, i={i}")
    if i == 2:
        raise Exception(f"exception at i={i}")
    print(a)


def func():
    for i in range(5):
        gc.collect()
        print("** i:", i)
        try:
            deep(i)
        except Exception as exc:
            print("caught", exc)
            print_tb(exc.__traceback__)
            # traceback.clear_frames(prev_exc.__traceback__)
            clear_tb(exc.__traceback__)
            continue  # continue with next i
        print("deep", i, "done")


def print_tb(tb):
    print("Call stack:")
    while tb:
        frame_i = tb.tb_frame.f_locals.get("i")
        print(f"  {tb.tb_frame.f_code.co_name}: i={frame_i}")
        tb = tb.tb_next


def clear_tb(tb):
    print("Clearing stack:")
    while tb:
        print(tb.tb_frame)
        try:
            tb.tb_frame.clear()
        except RuntimeError:
            print("  cannot clear?")
        else:
            print("  cleared")
            # Using this code triggers that the ref actually goes out of scope, otherwise it does not!
            # print("  now:", tb.tb_frame.f_locals)
        tb = tb.tb_next


if __name__ == '__main__':
    func()
    print("exit")

Running this code gives the following output:

** i: 0
Obj('a, i=0')
del Obj('a, i=0')
deep 0 done
** i: 1
Obj('a, i=1')
del Obj('a, i=1')
deep 1 done
** i: 2
caught exception at i=2
Call stack:
  func: i=2
  deep: i=2
Clearing stack:
<frame at 0x7f9ee1cc72a0, file '/u/zeyer/code/playground/py-oom-out-of-scope.py', line 34, code func>
  cannot clear?
<frame at 0x7f9ee1c168c0, file '/u/zeyer/code/playground/py-oom-out-of-scope.py', line 20, code deep>
  cleared
** i: 3
Obj('a, i=3')
del Obj('a, i=3')
deep 3 done
** i: 4
Obj('a, i=4')
del Obj('a, i=4')
deep 4 done
exit
del Obj('a, i=2')

You see that Obj('a, i=2') only is deleted at exit.

This only happens when the print_tb is used before, which will access f_locals of each frame.

traceback.clear_frames should have cleared the locals. But as you see from the output, it does not.

clear_tb is basically a copy of traceback.clear_frames.

The problem goes away if you access tb.tb_frame.f_locals after it was cleared (i.e. tb.tb_frame.clear() was called).

Looking at the C code, this is what tb_frame.clear() will do:
https://github.com/python/cpython/blob/3.12/Objects/frameobject.c#L933-L946

static int
frame_tp_clear(PyFrameObject *f)
{
    Py_CLEAR(f->f_trace);

    /* locals and stack */
    PyObject **locals = _PyFrame_GetLocalsArray(f->f_frame);
    assert(f->f_frame->stacktop >= 0);
    for (int i = 0; i < f->f_frame->stacktop; i++) {
        Py_CLEAR(locals[i]);
    }
    f->f_frame->stacktop = 0;
    return 0;
}

However, if you accessed tb_frame.f_locals before, it will have created a dictionary in frame->f_locals here: https://github.com/python/cpython/blob/5c238225f60c33cf1931b1a8c9a3310192c716ae/Objects/frameobject.c#L1218C18-L1218C33

That frame->f_locals dict will also have references to all the local vars. And that f_locals dict is not cleared in tb_frame.clear().

However, then when you access tb_frame.f_locals again, it will update the existing frame->f_locals dict, and delete all the local vars in it, because they are not available anymore. Here:
https://github.com/python/cpython/blob/3.12/Objects/frameobject.c#L1256C13-L1256C55

I think it's a bug (or at least very unexpected) that tb_frame.clear() does not clear frame->f_locals.

So my suggestion would be to add Py_CLEAR(f->f_frame->f_locals) in frame_tp_clear.


There is then another related issue: When the except block is left, the exception goes out of scope, so then it should free all the locals (even when frame.clear() was not called). However, this is also not the case.

After inspecting this further: Once frame.f_locals was accessed from the current frame where the exception is handled, this frame.f_locals still has a reference to the exception, and thus to the frames, even though the DELETE_FAST for the exception deleted it from the fast locals. See the comments below for more on this.


Note, for PyTorch and others, when you first do extended exception reporting which accesses f_locals in any way, this here fixes two arising problems. Related:

E.g., this came up for us because we have this extended exception reporting, which accesses f_locals:

        # Extend exception message by module call stack.
        module_names_by_id = {}  # id -> name
        for name, mod in model.named_modules():
            if id(mod) not in module_names_by_id:
                module_names_by_id[id(mod)] = name or "(root)"
        exc_ext = []
        for frame in iter_traceback(exc.__traceback__):
            if frame.f_code.co_nlocals == 0:
                continue
            frame_self = frame.f_locals.get("self")
            if isinstance(frame_self, (torch.nn.Module, rf.Module)):
                func = get_func_from_code_object(frame.f_code, frame=frame)
                if func and func.__name__ and func.__name__.startswith("_") and not func.__name__.startswith("__"):
                    continue
                func_name = (func and func.__qualname__) or type(frame_self).__name__
                exc_ext.append(f"({func_name}) {module_names_by_id.get(id(frame_self), '(unknown)')}")
        if not exc_ext:
            exc_ext.append("(No module call frames.)")
        if len(exc.args) == 1 and isinstance(exc.args[0], str) and not always_direct_print:
            exc.args = ("\n".join([exc.args[0], "", "Module call stack:"] + exc_ext),)
        else:
            print("Module call stack:", file=log.v3)
            for msg in exc_ext:
                print(msg, file=log.v3)

The normal traceback.clear_frames here does not help.


CPython versions tested on:

3.11, 3.12, 3.13

Operating systems tested on:

Linux

Linked PRs

@albertz albertz added the type-bug An unexpected behavior, bug, or error label Jan 11, 2024
albertz added a commit to albertz/cpython-1 that referenced this issue Jan 11, 2024
f_locals might still contain references to the local vars.

Fix python#113939.
@albertz
Copy link
Contributor Author

albertz commented Jan 11, 2024

Note, another workaround is to do:

tb.tb_frame.f_locals.clear()

in print_tb, after f_locals was used. Once f_locals is accessed again, it will get repopulated with the locals, though. So this is easy to mess up.

@albertz
Copy link
Contributor Author

albertz commented Jan 11, 2024

I just realize, there is maybe another related bug:

When exc goes out of scope, it should anyway free the traceback and all the locals in deeper frames, or not? So even when not using traceback.clear_frames, it should delete the Obj('a, i=2') right away and not only at exit.

albertz added a commit to albertz/py_better_exchook that referenced this issue Jan 11, 2024
albertz added a commit to rwth-i6/returnn that referenced this issue Jan 11, 2024
@albertz
Copy link
Contributor Author

albertz commented Jan 12, 2024

Note, my workaround is now to use this function:

def traceback_clear_frames(tb):
    """
    Clear traceback frame locals.

    Just like :func:`traceback.clear_frames`, but has an additional fix
    (https://github.com/python/cpython/issues/113939).

    :param types.TracebackType tb:
    """
    while tb:
        try:
            tb.tb_frame.clear()
        except RuntimeError:
            pass
        else:
            # Using this code triggers that the ref actually goes out of scope, otherwise it does not!
            # https://github.com/python/cpython/issues/113939
            tb.tb_frame.f_locals  # noqa
        tb = tb.tb_next

Instead of traceback.clear_frames.

Whenever frame.f_locals is accessed at any point, it is important to call this, otherwise you will get the object reference leak as described.

@albertz
Copy link
Contributor Author

albertz commented Jan 12, 2024

When exc goes out of scope, it should anyway free the traceback and all the locals in deeper frames, or not? So even when not using traceback.clear_frames, it should delete the Obj('a, i=2') right away and not only at exit.

I investigated this a bit.

Normally, the exc variable will get deleted once you exit the exception except block. The bytecode looks like this:

             34 POP_TOP
             36 POP_BLOCK
             38 POP_EXCEPT
             40 LOAD_CONST               0 (None)
             42 STORE_FAST               0 (exc)
             44 DELETE_FAST              0 (exc)
             46 JUMP_FORWARD            10 (to 58)

POP_EXCEPT will reset sys.exc_info(), and then you see that it first does exc = None for some reason, and then additionally del exc.

So, why does exc stays alive even after the except block?

The reason is, once we access frame.f_locals of the current frame, this locals will include exc. But then no other code will even clean up frame.f_locals (only if we would trigger to rebuild this dict, e.g. by accessing frame.f_locals again, or also by calling locals()). It only goes out of scope once we return from func.

Note, it will go out of scope when it returns from func here, once the next GC is done. So it is not only at exit, even though it appears like this from the output. But still, I would argue that it should go out of scope after except is left, not just after func is left.

I'm not sure how to fix this. Some options:

  • The POP_EXCEPT could free f_locals, or update it (if it is used).
  • When compiling bytecode, instead of just DELETE_FAST for the exc, there could be another op to free (or update) f_locals, or maybe do DELETE_NAME in addition, or so.
  • The DELETE_FAST could update f_locals if it is used.

@albertz
Copy link
Contributor Author

albertz commented Jan 12, 2024

The DELETE_FAST could update f_locals if it is used.

Code for that:

diff --git a/Python/bytecodes.c b/Python/bytecodes.c
index f53ddae8df..51333fb1eb 100644
--- a/Python/bytecodes.c
+++ b/Python/bytecodes.c
@@ -1508,6 +1508,18 @@ dummy_func(
             PyObject *v = GETLOCAL(oparg);
             ERROR_IF(v == NULL, unbound_local_error);
             SETLOCAL(oparg, NULL);
+            PyObject *ns = LOCALS();
+            if(ns) {
+                PyObject *name = GETITEM(_PyFrame_GetCode(frame)->co_localsplusnames, oparg);
+                int err = PyObject_DelItem(ns, name);
+                // Can't use ERROR_IF here.
+                if (err != 0) {
+                    _PyEval_FormatExcCheckArg(tstate, PyExc_NameError,
+                                            NAME_ERROR_MSG,
+                                            name);
+                    GOTO_ERROR(error);
+                }
+            }
         }
 
         inst(MAKE_CELL, (--)) {

This fixes the problem. Then exc is freed after it goes out of the except block.

But I'm not sure if you would want this change?

@albertz
Copy link
Contributor Author

albertz commented Jan 12, 2024

A much simpler change for DELETE_FAST:

diff --git a/Python/bytecodes.c b/Python/bytecodes.c
index f53ddae8df..7d6c0b41e3 100644
--- a/Python/bytecodes.c
+++ b/Python/bytecodes.c
@@ -1508,6 +1508,7 @@ dummy_func(
             PyObject *v = GETLOCAL(oparg);
             ERROR_IF(v == NULL, unbound_local_error);
             SETLOCAL(oparg, NULL);
+            Py_CLEAR(LOCALS());
         }
 
         inst(MAKE_CELL, (--)) {

@gvanrossum
Copy link
Member

Maybe @iritkatriel can help with this?

@iritkatriel iritkatriel self-assigned this Jan 14, 2024
@albertz
Copy link
Contributor Author

albertz commented Jan 15, 2024

Note, in the PR (#113940), I also have added two test cases now, for both of the problems described here (frame.clear() + exception object deleted after except block finishes).

@albertz
Copy link
Contributor Author

albertz commented Jan 16, 2024

As it was mentioned in the PR, PEP-667 and PEP-558 is related here, which would probably fix this. Also related is e.g. issue #74929.

@albertz
Copy link
Contributor Author

albertz commented Jan 31, 2024

Thanks for merging.

Now, one part of this issue is fixed, namely frame.clear().

However, the other part still remains open, where locals are potentially not freed even when they get out of scope, e.g. the exception object including its traceback with all frames. Should we open a new separate issue about that? Or reopen this one here? Or not needed, as we already have PEP-667 and PEP-558? #74929 is also similar/related, but describes a more complex scenario.

For reference, this is a test case for this problem:

class LocalsTest(unittest.TestCase):
    """
    Tests for locals.
    """

    def test_locals_cleared_after_exception_handled(self):
        # see gh-113939
        class C:
            pass
        wr = None
        def inner():
            nonlocal wr
            c = C()
            wr = weakref.ref(c)
            1/0
        try:
            inner()
        except ZeroDivisionError as exc:
            support.gc_collect()
            self.assertIsNotNone(wr())
            print(exc.__traceback__.tb_frame.f_locals)
        support.gc_collect()
        self.assertIsNone(wr())

This test still fails now.

@iritkatriel
Copy link
Member

Maybe add a comment to #74929 with the simpler example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants