Skip to content

Fix reference cycle in hf_raise_for_status delaying object destruction#4084

Closed
yg7445 wants to merge 1 commit intohuggingface:mainfrom
yg7445:fix/hf-raise-for-status-refcycle
Closed

Fix reference cycle in hf_raise_for_status delaying object destruction#4084
yg7445 wants to merge 1 commit intohuggingface:mainfrom
yg7445:fix/hf-raise-for-status-refcycle

Conversation

@yg7445
Copy link
Copy Markdown

@yg7445 yg7445 commented Apr 10, 2026

Summary

Commit 098091f ("#3889") changed hf_raise_for_status() from inline raises to storing exceptions in local variables before raising:

# Before (v1.5.0) — no cycle
raise _format(RemoteEntryNotFoundError, message, response) from e

# After (v1.6.0) — creates cycle
entry_err = _format(RemoteEntryNotFoundError, message, response)
entry_err.repo_type = repo_type
entry_err.repo_id = repo_id
raise entry_err from e

This creates a CPython reference cycle:

  1. entry_err.__cause__e (the original HTTPStatusError)
  2. e.__traceback__ → traceback → tb_framehf_raise_for_status frame
  3. hf_raise_for_status frame → f_locals['entry_err'] → back to (1)

The cycle prevents the exception from being freed by refcounting when except blocks exit. The cyclic GC will eventually collect it, but the delay is long enough to cause real problems. When this exception propagates through callers (e.g. transformers.cached_filesLLM.__init__), the traceback chain holds a reference to self in the caller's frame, preventing deterministic cleanup.

In vLLM, this means del llm doesn't immediately trigger the weakref.finalize that sends SIGTERM to the EngineCore subprocess, so GPU memory isn't released until the cyclic GC eventually runs. Bisected to v1.6.0v1.5.0 works fine. Related: vllm-project/vllm#38384.

Fix

Move repo_type/repo_id/bucket_id assignment into helper functions (_format_with_repo_info, _format_with_bucket_info) so the exception object is never stored as a local variable in hf_raise_for_status's frame. This preserves the functionality added in #3889 while avoiding the reference cycle.

Test plan

  • Added unit tests for _format_with_repo_info and _format_with_bucket_info
  • Verified all existing hf_raise_for_status tests still pass
  • Verified vLLM's tests/v1/shutdown/test_delete.py passes (8/8) with this fix

Note

Low Risk
Low risk: refactors how Hub HTTP exceptions are constructed/raised while preserving error types and attached metadata; main risk is subtle differences in exception object lifetimes or attributes in edge cases.

Overview
Fixes hf_raise_for_status to avoid creating reference cycles when enriching raised HTTP exceptions with repo/bucket metadata.

This introduces _format_with_repo_info and _format_with_bucket_info helpers that set repo_type/repo_id/bucket_id on the error without keeping the exception in local variables, and updates the relevant raise paths to use them. Adds focused unit tests covering both helpers.

Reviewed by Cursor Bugbot for commit e49bb83. Bugbot is set up for automated code reviews on this repo. Configure here.

Commit 098091f ("huggingface#3889") changed hf_raise_for_status() from inline
raises to storing exceptions in local variables before raising:

    entry_err = _format(RemoteEntryNotFoundError, message, response)
    entry_err.repo_type = repo_type
    raise entry_err from e

This creates a CPython reference cycle: entry_err.__cause__ -> e, and
e.__traceback__ -> frame -> f_locals['entry_err'] -> entry_err. The
cycle prevents the exception from being freed when except blocks exit.

When this exception propagates through callers (e.g. transformers'
cached_files -> LLM.__init__), the traceback chain holds a reference
to `self`, preventing refcount-based cleanup. In vllm, this means
`del llm` doesn't trigger the weakref finalizer that sends SIGTERM
to the EngineCore subprocess, so GPU memory is never released.

Fix by moving repo_type/repo_id/bucket_id assignment into helper
functions (_format_with_repo_info, _format_with_bucket_info) so the
exception is never stored as a local in hf_raise_for_status.

Co-authored-by: Claude <noreply@anthropic.com>
@Wauplin
Copy link
Copy Markdown
Contributor

Wauplin commented Apr 13, 2026

Hi @yg7445 , thanks a lot for your PR and detailed report. I've being playing with it locally and I can also confirm the fix seems to work. If that's fine with you, I have opened a separate PR to fix the problem in a slightly cleaner way. Instead of _format_with_repo_info and _format_with_bucket_info, I have added a **attrs parameter to set arbitrary attributes to _format.

Also the test I've added is actually checking the circular reference has been solved.

Therefore, is it fine with you if I close this PR in favor of #4092?

@Wauplin
Copy link
Copy Markdown
Contributor

Wauplin commented Apr 13, 2026

Closing in favor of #4092 (same fix, applied differently). Will be released later this week.

@Wauplin Wauplin closed this Apr 13, 2026
@yg7445
Copy link
Copy Markdown
Author

yg7445 commented Apr 13, 2026

@Wauplin Sounds good to me, thanks for the improvement and for taking care of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants