[Feat] Remove reference cycle in error handling for faster GC#327
[Feat] Remove reference cycle in error handling for faster GC#327junrushao merged 21 commits intoapache:mainfrom
Conversation
Summary of ChangesHello @KEKE046, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a performance bottleneck in the TVM FFI error handling mechanism. It resolves an issue where reference cycles were inadvertently created, preventing timely garbage collection of critical objects like tensors and call chain elements. By strategically breaking these cycles, the change significantly reduces memory pressure and enhances the efficiency of neural network training by allowing resources to be freed sooner. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to resolve a reference cycle in error handling to improve garbage collection performance, particularly for tensors during neural network training. The strategy of using try...finally to explicitly delete variables that form the cycle is sound. While reviewing the changes in python/tvm_ffi/error.py, I identified a critical issue in the append_traceback method where the new implementation could mask exceptions. I have provided a detailed comment and a code suggestion to address this problem while maintaining the original intent of breaking the reference cycle.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
No @gemini-code-assist, it's wrong, it creates a
But I come up with another hack to prevent circular reference: # Magic hack to prevent a circular reference
def create(tb, frame, lineno):
return types.TracebackType(tb, frame, frame.f_lasti, lineno)
return create(tb, self._create_frame(filename, lineno, func), lineno) |
|
LGTM! |
|
This is an incredible finding! CC: @tqchen |
|
Thanks @KEKE046 , @junrushao let us prioritize merging this |
tqchen
left a comment
There was a problem hiding this comment.
LGTM, minor comment on readability, would be great to add more comment on why it happens, and how the change prevent things from happening
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a subtle but important issue of reference cycles in error handling, which can impact memory usage and performance in machine learning workloads. The changes in python/tvm_ffi/error.py use two clever techniques to break these cycles. The use of a nested function in append_traceback is a good pattern to avoid frame-related cycles, and the try...finally block in _with_append_backtrace aims to clear references to break cycles involving exception objects. While the overall approach is excellent, I've identified a potential issue in the finally block that could mask exceptions, and I've also suggested improving a comment for better long-term maintainability. My feedback includes code suggestions for both points.
|
@KEKE046 would be good to go through one pass over gemini reviews and add comments about context, thank you so much! |
python/tvm_ffi/error.py
Outdated
| tb = _TRACEBACK_MANAGER.append_traceback(tb, filename, lineno, func) | ||
| return py_error.with_traceback(tb) | ||
| finally: | ||
| # this is a hack to break reference cycle |
There was a problem hiding this comment.
"we explicitly break reference cycle here", this is deliberate and not a work around, so we do not need to say it is a "hack"
|
Thanks @KEKE046 i think we are close, please take a look at the latest comments |
|
Another minor question, is it possible to create a regression test about this? genai suggests that maybe one way to do is to create an weakref that calls to a callback, which get triggered in del, then we confirm in your example if the torch.Tensor or anything actually get recycled in an exception import gc
import weakref
# List to track if the object was collected
collected_list = []
def callback(wr):
"""Callback triggered when the weakref object is finalized."""
collected_list.append(True)
def create_local_cycle_and_collect():
# 1. Create an object (a list is simple)
obj = [1, 2, 3]
# 2. Create a weak reference to the object
# This reference will trigger 'callback' when 'obj' is garbage collected.
ref = weakref.ref(obj, callback) |
|
Sorry, @tqchen , you're so fast that my push may be outdated |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a clever optimization to break reference cycles in the FFI error handling path, which will improve garbage collection performance and reduce memory pressure, especially during neural network training. The approach of using try...finally to del references and a nested function to manage local variable scope is well-reasoned. My feedback focuses on improving the clarity and accuracy of the comments that explain these subtle mechanisms, to ensure the code remains maintainable.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
I'll try to create a test to verify the error |
|
Some lint errors don't occur in my environment, so I'll check it |
|
Incredible work! |
|
Thanks @KEKE046 for great work! |

When training neural networks, it is beneficial to free tensors as early as possible once they are no longer used. In CPython, as long as there is no reference cycle, tensors are freed immediately when their reference count drops to zero.
Torch blog: Finding and Removing Reference Cycles
However, in our TVM FFI error handling path, we create reference cycles that keep intermediate tensors (and the entire call chain) alive much longer than necessary. This slows down local GC, increases memory pressure, and hurts training throughput.
Problem
The following code represents the problem.
Map.getwith a non-existent key, which causes a KeyError and will be captured by MapThe following figure demonstrates the call chain and reference cycles:
The following figure shows the object reference graph from torch warn_tensor_cycles
Approach
As shown in the figure, the 3 variables,
frame,py_error, andtb, are the core of the reference cycle. To remove this error, we can manually delete the 3 variables in the loop.tvm-ffi/python/tvm_ffi/error.py
Lines 124 to 136 in f7e09d6
We can use
try ... finally ...to manually delete these variables after the function returns.