Fix reference cycle in hf_raise_for_status delaying object destruction#4084
Closed
yg7445 wants to merge 1 commit intohuggingface:mainfrom
Closed
Fix reference cycle in hf_raise_for_status delaying object destruction#4084yg7445 wants to merge 1 commit intohuggingface:mainfrom
yg7445 wants to merge 1 commit intohuggingface:mainfrom
Conversation
Commit 098091f ("huggingface#3889") changed hf_raise_for_status() from inline raises to storing exceptions in local variables before raising: entry_err = _format(RemoteEntryNotFoundError, message, response) entry_err.repo_type = repo_type raise entry_err from e This creates a CPython reference cycle: entry_err.__cause__ -> e, and e.__traceback__ -> frame -> f_locals['entry_err'] -> entry_err. The cycle prevents the exception from being freed when except blocks exit. When this exception propagates through callers (e.g. transformers' cached_files -> LLM.__init__), the traceback chain holds a reference to `self`, preventing refcount-based cleanup. In vllm, this means `del llm` doesn't trigger the weakref finalizer that sends SIGTERM to the EngineCore subprocess, so GPU memory is never released. Fix by moving repo_type/repo_id/bucket_id assignment into helper functions (_format_with_repo_info, _format_with_bucket_info) so the exception is never stored as a local in hf_raise_for_status. Co-authored-by: Claude <noreply@anthropic.com>
Contributor
|
Hi @yg7445 , thanks a lot for your PR and detailed report. I've being playing with it locally and I can also confirm the fix seems to work. If that's fine with you, I have opened a separate PR to fix the problem in a slightly cleaner way. Instead of Also the test I've added is actually checking the circular reference has been solved. Therefore, is it fine with you if I close this PR in favor of #4092? |
Contributor
|
Closing in favor of #4092 (same fix, applied differently). Will be released later this week. |
Author
|
@Wauplin Sounds good to me, thanks for the improvement and for taking care of it! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Commit 098091f ("#3889") changed
hf_raise_for_status()from inline raises to storing exceptions in local variables before raising:This creates a CPython reference cycle:
entry_err.__cause__→e(the originalHTTPStatusError)e.__traceback__→ traceback →tb_frame→hf_raise_for_statusframehf_raise_for_statusframe →f_locals['entry_err']→ back to (1)The cycle prevents the exception from being freed by refcounting when
exceptblocks exit. The cyclic GC will eventually collect it, but the delay is long enough to cause real problems. When this exception propagates through callers (e.g.transformers.cached_files→LLM.__init__), the traceback chain holds a reference toselfin the caller's frame, preventing deterministic cleanup.In vLLM, this means
del llmdoesn't immediately trigger theweakref.finalizethat sends SIGTERM to the EngineCore subprocess, so GPU memory isn't released until the cyclic GC eventually runs. Bisected tov1.6.0—v1.5.0works fine. Related: vllm-project/vllm#38384.Fix
Move
repo_type/repo_id/bucket_idassignment into helper functions (_format_with_repo_info,_format_with_bucket_info) so the exception object is never stored as a local variable inhf_raise_for_status's frame. This preserves the functionality added in #3889 while avoiding the reference cycle.Test plan
_format_with_repo_infoand_format_with_bucket_infohf_raise_for_statustests still passtests/v1/shutdown/test_delete.pypasses (8/8) with this fixNote
Low Risk
Low risk: refactors how Hub HTTP exceptions are constructed/raised while preserving error types and attached metadata; main risk is subtle differences in exception object lifetimes or attributes in edge cases.
Overview
Fixes
hf_raise_for_statusto avoid creating reference cycles when enriching raised HTTP exceptions with repo/bucket metadata.This introduces
_format_with_repo_infoand_format_with_bucket_infohelpers that setrepo_type/repo_id/bucket_idon the error without keeping the exception in local variables, and updates the relevant raise paths to use them. Adds focused unit tests covering both helpers.Reviewed by Cursor Bugbot for commit e49bb83. Bugbot is set up for automated code reviews on this repo. Configure here.