Set HF_HUB_OFFLINE=1 by default, enabled with '--hf_token' flag. by sudostock · Pull Request #2086 · NVIDIA-NeMo/Megatron-Bridge

sudostock · 2026-01-27T19:14:21Z

This makes it consistent with the TRANSFORMER_OFFLINE variable.

Many mbridge recipes require hf data for either config information and/or tokenizer. This is causing many issues both internally and at customer sites as the HF connections have a tendency to get rate limited especially when launching many jobs.

To enable 'offline mode', we need 'HF_HUB_OFFLINE=1' and the necessary files in the cache locally. We've aligned 'HF_HUB_OFFLINE' to operate like TRANFORMERS_OFFLINE. Default to offline unless '--hf_token' is specified. Not all models require an hf_token to access however the rate limits are strict enough that it is worth enforcing.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved offline mode handling for Hugging Face Hub access in performance tests, ensuring online downloads are properly enabled when authentication credentials are available.
Chores
- Refined environment variable configuration logic for enhanced clarity.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This makes it consistent with the TRANSFORMER_OFFLINE variable. Signed-off-by: Alex Filby <afilby@nvidia.com>

copy-pr-bot · 2026-01-27T19:14:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-27T19:20:03Z

📝 Walkthrough

Walkthrough

Default environment variable HF_HUB_OFFLINE changed from "0" to "1" in PERF_ENV_VARS. When an HF token is provided to slurm_executor, environment variable assignments now use explicit statements instead of dict.update(), with HF_HUB_OFFLINE explicitly set to "0".

Changes

Cohort / File(s)	Summary
Environment configuration in slurm executor `scripts/performance/utils/executors.py`	HF_HUB_OFFLINE default changed to "1"; when hf_token is provided, replaced dict.update() with explicit assignments for HF_TOKEN, TRANSFORMERS_OFFLINE, and HF_HUB_OFFLINE

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR makes major behavioral change to HuggingFace model access without documenting test results, verification of offline mode functionality, or convergence/performance validation.	Document test results for offline mode with all affected recipes, confirm HF token flag enables online mode, validate no regressions, and resolve global state mutation issue in PERF_ENV_VARS.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: setting HF_HUB_OFFLINE=1 by default and enabling it conditionally with the '--hf_token' flag, which aligns with the actual modifications in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@scripts/performance/utils/executors.py`:
- Around line 109-112: The code mutates the global PERF_ENV_VARS in-place when
hf_token is provided, causing HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE to persist
across calls/threads; fix by creating a per-call copy (e.g., local_env =
PERF_ENV_VARS.copy()) inside the executor before modifying and use local_env for
subprocess/env injection instead of mutating PERF_ENV_VARS; update the branch
that checks hf_token to set values on local_env and ensure any code that
previously referenced PERF_ENV_VARS in this execution uses local_env
(referencing PERF_ENV_VARS and hf_token to locate the change).

coderabbitai · 2026-01-27T19:20:06Z

scripts/performance/utils/executors.py

    if hf_token is not None:
-        PERF_ENV_VARS.update({"HF_TOKEN": hf_token, "TRANSFORMERS_OFFLINE": "0"})
+        PERF_ENV_VARS["HF_TOKEN"] = hf_token
+        PERF_ENV_VARS["TRANSFORMERS_OFFLINE"] = "0"
+        PERF_ENV_VARS["HF_HUB_OFFLINE"] = "0"


⚠️ Potential issue | 🟠 Major

Avoid global env var leakage across calls.
Mutating PERF_ENV_VARS in-place makes HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE “sticky” across future invocations (and across threads). Use a per-call copy instead.

✅ Proposed fix (use a local copy)

def slurm_executor( @@ ) -> run.SlurmExecutor: @@ - if wandb_key is not None: - PERF_ENV_VARS["WANDB_API_KEY"] = wandb_key + env_vars = PERF_ENV_VARS.copy() + if wandb_key is not None: + env_vars["WANDB_API_KEY"] = wandb_key @@ - PERF_ENV_VARS["NCCL_NET_GDR_LEVEL"] = "PHB" # For NCCL 2.25 - PERF_ENV_VARS["NCCL_NET_GDR_C2C"] = "1" # For NCCL 2.26 + env_vars["NCCL_NET_GDR_LEVEL"] = "PHB" # For NCCL 2.25 + env_vars["NCCL_NET_GDR_C2C"] = "1" # For NCCL 2.26 @@ - PERF_ENV_VARS["NEMO_HOME"] = nemo_home + env_vars["NEMO_HOME"] = nemo_home @@ - PERF_ENV_VARS["HF_TOKEN"] = hf_token - PERF_ENV_VARS["TRANSFORMERS_OFFLINE"] = "0" - PERF_ENV_VARS["HF_HUB_OFFLINE"] = "0" + env_vars["HF_TOKEN"] = hf_token + env_vars["TRANSFORMERS_OFFLINE"] = "0" + env_vars["HF_HUB_OFFLINE"] = "0" @@ - PERF_ENV_VARS.update(custom_env_vars) + env_vars.update(custom_env_vars) @@ - env_vars=PERF_ENV_VARS, + env_vars=env_vars,

🤖 Prompt for AI Agents

In `@scripts/performance/utils/executors.py` around lines 109 - 112, The code mutates the global PERF_ENV_VARS in-place when hf_token is provided, causing HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE to persist across calls/threads; fix by creating a per-call copy (e.g., local_env = PERF_ENV_VARS.copy()) inside the executor before modifying and use local_env for subprocess/env injection instead of mutating PERF_ENV_VARS; update the branch that checks hf_token to set values on local_env and ensure any code that previously referenced PERF_ENV_VARS in this execution uses local_env (referencing PERF_ENV_VARS and hf_token to locate the change).

malay-nagda

Thanks for this fix!

nv-mollys · 2026-01-28T17:59:39Z

/ok to test 17b4744

…DIA-NeMo#2086) Signed-off-by: Alex Filby <afilby@nvidia.com> Signed-off-by: conver334 <conver334@gmail.com>

Set HF_HUB_OFFLINE=1 by default, enabled with '--hf_token' flag.

20da0b9

This makes it consistent with the TRANSFORMER_OFFLINE variable. Signed-off-by: Alex Filby <afilby@nvidia.com>

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

malay-nagda approved these changes Jan 28, 2026

View reviewed changes

malay-nagda requested a review from ko3n1g January 28, 2026 07:36

ko3n1g approved these changes Jan 28, 2026

View reviewed changes

Merge branch 'main' into hf_hub_offline-main

17b4744

nv-mollys enabled auto-merge (squash) January 28, 2026 17:59

copy-pr-bot bot temporarily deployed to nemo-ci January 28, 2026 18:00 Inactive

copy-pr-bot bot temporarily deployed to test January 28, 2026 18:00 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 28, 2026 19:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 28, 2026 19:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 28, 2026 19:35 Inactive

nv-mollys merged commit 8a937f6 into NVIDIA-NeMo:main Jan 28, 2026
48 checks passed

conver334 pushed a commit to conver334/Megatron-Bridge that referenced this pull request Jan 30, 2026

Set HF_HUB_OFFLINE=1 by default, enabled with '--hf_token' flag. (NVI…

6408859

…DIA-NeMo#2086) Signed-off-by: Alex Filby <afilby@nvidia.com> Signed-off-by: conver334 <conver334@gmail.com>

sudostock mentioned this pull request Mar 6, 2026

restore container writable and clarify HF offline behavior #2672

Merged

sudostock mentioned this pull request Mar 17, 2026

Clarify HF offline behavior with --offline flag #2847

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set HF_HUB_OFFLINE=1 by default, enabled with '--hf_token' flag.#2086

Set HF_HUB_OFFLINE=1 by default, enabled with '--hf_token' flag.#2086
nv-mollys merged 2 commits intoNVIDIA-NeMo:mainfrom
sudostock:hf_hub_offline-main

sudostock commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

malay-nagda left a comment

Uh oh!

nv-mollys commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sudostock commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

malay-nagda left a comment

Choose a reason for hiding this comment

Uh oh!

nv-mollys commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sudostock commented Jan 27, 2026 •

edited by coderabbitai bot

Loading