Skip to content

[Feat] Remove reference cycle in error handling for faster GC#327

Merged
junrushao merged 21 commits intoapache:mainfrom
KEKE046:fix-ref-cycle
Dec 11, 2025
Merged

[Feat] Remove reference cycle in error handling for faster GC#327
junrushao merged 21 commits intoapache:mainfrom
KEKE046:fix-ref-cycle

Conversation

@KEKE046
Copy link
Contributor

@KEKE046 KEKE046 commented Dec 10, 2025

When training neural networks, it is beneficial to free tensors as early as possible once they are no longer used. In CPython, as long as there is no reference cycle, tensors are freed immediately when their reference count drops to zero.

Torch blog: Finding and Removing Reference Cycles

However, in our TVM FFI error handling path, we create reference cycles that keep intermediate tensors (and the entire call chain) alive much longer than necessary. This slows down local GC, increases memory pressure, and hurts training throughput.

Problem

The following code represents the problem.

  1. We allocate a tensor in foo
  2. We call Map.get with a non-existent key, which causes a KeyError and will be captured by Map
  3. The KeyError causes a reference cycle, and makes the whole call chain require gc
import torch
import tvm_ffi
import gc
from torch.utils.viz._cycles import warn_tensor_cycles

m = tvm_ffi.Map({'a': 1})

def foo():
    a = torch.tensor([1], device='cuda')
    # map get a non-existent key will cause a KeyError and capture it
    _tmp = m.get('b', 0)

foo()

remove = warn_tensor_cycles()
gc.collect()
remove()

The following figure demonstrates the call chain and reference cycles:

image

The following figure shows the object reference graph from torch warn_tensor_cycles

image

Approach

As shown in the figure, the 3 variables, frame, py_error, and tb, are the core of the reference cycle. To remove this error, we can manually delete the 3 variables in the loop.

frame = self._create_frame(filename, lineno, func)
return types.TracebackType(tb, frame, frame.f_lasti, lineno)
_TRACEBACK_MANAGER = TracebackManager()
def _with_append_backtrace(py_error: BaseException, backtrace: str) -> BaseException:
"""Append the backtrace to the py_error and return it."""
tb = py_error.__traceback__
for filename, lineno, func in _parse_backtrace(backtrace):
tb = _TRACEBACK_MANAGER.append_traceback(tb, filename, lineno, func)
return py_error.with_traceback(tb)

We can use try ... finally ... to manually delete these variables after the function returns.

    try:
        tb = py_error.__traceback__
        for filename, lineno, func in _parse_backtrace(backtrace):
            tb = _TRACEBACK_MANAGER.append_traceback(tb, filename, lineno, func)
        return py_error.with_traceback(tb)
    finally:
        del py_error, tb

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @KEKE046, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a performance bottleneck in the TVM FFI error handling mechanism. It resolves an issue where reference cycles were inadvertently created, preventing timely garbage collection of critical objects like tensors and call chain elements. By strategically breaking these cycles, the change significantly reduces memory pressure and enhances the efficiency of neural network training by allowing resources to be freed sooner.

Highlights

  • Reference Cycle Removal: Implemented try...finally blocks in TracebackManager.append_traceback and _with_append_backtrace to explicitly delete frame, py_error, and tb variables.
  • Improved Garbage Collection: This change prevents reference cycles in the TVM FFI error handling path, allowing Python's garbage collector to free intermediate tensors and call chain objects more quickly.
  • Performance Enhancement: By enabling faster garbage collection, the pull request aims to reduce memory pressure and improve training throughput, especially when dealing with neural networks and frequent error handling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to resolve a reference cycle in error handling to improve garbage collection performance, particularly for tensors during neural network training. The strategy of using try...finally to explicitly delete variables that form the cycle is sound. While reviewing the changes in python/tvm_ffi/error.py, I identified a critical issue in the append_traceback method where the new implementation could mask exceptions. I have provided a detailed comment and a code suggestion to address this problem while maintaining the original intent of breaking the reference cycle.

KEKE046 and others added 2 commits December 10, 2025 17:25
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@KEKE046
Copy link
Contributor Author

KEKE046 commented Dec 10, 2025

No @gemini-code-assist, it's wrong, it creates a f_locals reference

image

But I come up with another hack to prevent circular reference:

  # Magic hack to prevent a circular reference
  def create(tb, frame, lineno):
      return types.TracebackType(tb, frame, frame.f_lasti, lineno)
  return create(tb, self._create_frame(filename, lineno, func), lineno)

@LeiWang1999
Copy link

LGTM!

@junrushao
Copy link
Member

This is an incredible finding! CC: @tqchen

@tqchen
Copy link
Member

tqchen commented Dec 11, 2025

Thanks @KEKE046 , @junrushao let us prioritize merging this

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comment on readability, would be great to add more comment on why it happens, and how the change prevent things from happening

@tqchen
Copy link
Member

tqchen commented Dec 11, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a subtle but important issue of reference cycles in error handling, which can impact memory usage and performance in machine learning workloads. The changes in python/tvm_ffi/error.py use two clever techniques to break these cycles. The use of a nested function in append_traceback is a good pattern to avoid frame-related cycles, and the try...finally block in _with_append_backtrace aims to clear references to break cycles involving exception objects. While the overall approach is excellent, I've identified a potential issue in the finally block that could mask exceptions, and I've also suggested improving a comment for better long-term maintainability. My feedback includes code suggestions for both points.

@tqchen
Copy link
Member

tqchen commented Dec 11, 2025

@KEKE046 would be good to go through one pass over gemini reviews and add comments about context, thank you so much!

tb = _TRACEBACK_MANAGER.append_traceback(tb, filename, lineno, func)
return py_error.with_traceback(tb)
finally:
# this is a hack to break reference cycle
Copy link
Member

@tqchen tqchen Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"we explicitly break reference cycle here", this is deliberate and not a work around, so we do not need to say it is a "hack"

@tqchen
Copy link
Member

tqchen commented Dec 11, 2025

Thanks @KEKE046 i think we are close, please take a look at the latest comments

@tqchen
Copy link
Member

tqchen commented Dec 11, 2025

Another minor question, is it possible to create a regression test about this? genai suggests that maybe one way to do is to create an weakref that calls to a callback, which get triggered in del, then we confirm in your example if the torch.Tensor or anything actually get recycled in an exception

import gc
import weakref

# List to track if the object was collected
collected_list = []

def callback(wr):
    """Callback triggered when the weakref object is finalized."""
    collected_list.append(True)

def create_local_cycle_and_collect():
    # 1. Create an object (a list is simple)
    obj = [1, 2, 3]

    # 2. Create a weak reference to the object
    # This reference will trigger 'callback' when 'obj' is garbage collected.
    ref = weakref.ref(obj, callback)

@KEKE046
Copy link
Contributor Author

KEKE046 commented Dec 11, 2025

Sorry, @tqchen , you're so fast that my push may be outdated

@tqchen
Copy link
Member

tqchen commented Dec 11, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a clever optimization to break reference cycles in the FFI error handling path, which will improve garbage collection performance and reduce memory pressure, especially during neural network training. The approach of using try...finally to del references and a nested function to manage local variable scope is well-reasoned. My feedback focuses on improving the clarity and accuracy of the comments that explain these subtle mechanisms, to ensure the code remains maintainable.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@KEKE046
Copy link
Contributor Author

KEKE046 commented Dec 11, 2025

I'll try to create a test to verify the error

@KEKE046
Copy link
Contributor Author

KEKE046 commented Dec 11, 2025

Some lint errors don't occur in my environment, so I'll check it

@junrushao
Copy link
Member

Incredible work!

@junrushao junrushao merged commit 6ccbdb6 into apache:main Dec 11, 2025
7 checks passed
@tqchen
Copy link
Member

tqchen commented Dec 11, 2025

Thanks @KEKE046 for great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants