Skip to content

Clarify HF offline behavior with --offline flag #2847

Open
sudostock wants to merge 7 commits intomainfrom
afilby/offline-mode-flag
Open

Clarify HF offline behavior with --offline flag #2847
sudostock wants to merge 7 commits intomainfrom
afilby/offline-mode-flag

Conversation

@sudostock
Copy link
Copy Markdown
Contributor

@sudostock sudostock commented Mar 17, 2026

Update of #2672 for main branch:

  • add --offline (Slurm launcher) to set HF_HUB_OFFLINE=1
  • keep default split: TRANSFORMERS_OFFLINE=1, HF_HUB_OFFLINE=0
  • keep --hf_token behavior (HF_TOKEN + TRANSFORMERS_OFFLINE=0)
  • make --hf_token and --offline mutually exclusive
  • update Qwen3 pretrain validation to require hf_token or offline and explicitly state offline mode requires pre-downloaded tokenizer cache
  • always pass --container-writable in Slurm srun args (with rationale comment)
  • expand performance README with clear, actionable HF cache/offline guidance

Removed in: #2112 , Originally added in #2086 #2084

Summary by CodeRabbit

  • New Features

    • Added --offline flag for offline HuggingFace Hub access with local caching support.
    • --offline and --hf_token are now mutually exclusive CLI options.
    • Slurm launcher now includes container writable mode by default.
  • Documentation

    • Expanded guidance on offline usage, HuggingFace connectivity, caching behavior, and Slurm launcher configuration.
  • Tests

    • Added comprehensive test coverage for offline mode functionality.

Signed-off-by: Alex Filby <afilby@nvidia.com>
Signed-off-by: Alex Filby <afilby@nvidia.com>
- Remove non-required args from CLI parser test (account, partition)
- Clarify docstring on executor offline test
- Add test pinning container-writable as unconditional and HF_HUB_OFFLINE=0 as the default

Signed-off-by: Alex Filby <afilby@nvidia.com>
…se and improve Qwen3 message

Use argparse mutually_exclusive_group so conflicting flags are rejected
at parse time. Clarify the Qwen3 assertion to mention huggingface-cli
download and HF_HOME. Reformat test file per ruff.

Signed-off-by: Alex Filby <afilby@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Alex Filby <afilby@nvidia.com>
Signed-off-by: Alex Filby <afilby@nvidia.com>
@sudostock sudostock force-pushed the afilby/offline-mode-flag branch from c0946df to f6c23d0 Compare March 17, 2026 21:19
@sudostock sudostock marked this pull request as ready for review March 17, 2026 23:37
@sudostock sudostock requested review from a team, erhoo82 and malay-nagda as code owners March 17, 2026 23:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This pull request introduces offline mode support for HuggingFace Hub connectivity in the performance experiment CLI. An --offline flag is added to argument parsing with mutual exclusion from --hf_token, propagated through setup and executor functions to configure environment variables, and documented with comprehensive guidance and unit tests.

Changes

Cohort / File(s) Summary
Documentation
scripts/performance/README.md
Adds guidance on --offline flag mutual exclusion with --hf_token, new section on HuggingFace connectivity and caching behavior, Slurm launcher writable container flag, and practical offline usage instructions.
Argument Parsing & Setup
scripts/performance/argument_parser.py, scripts/performance/setup_experiment.py
Introduces --offline flag in mutual exclusion group with --hf_token; adds offline parameter to main() and propagates it through executor invocations; relaxes Qwen3 tokenizer requirement to accept hf_token or offline.
Executor Configuration
scripts/performance/utils/executors.py
Adds offline parameter to slurm_executor with default False; sets HF_HUB_OFFLINE="1" when offline is True; adds --container-writable to srun arguments.
Testing
tests/unit_tests/scripts/test_performance_offline_mode.py
New test module with mocked nemo_run scaffold covering CLI argument parsing (offline flag acceptance, mutual exclusion validation) and Slurm executor behavior (offline/online environment variables and container writable flag configuration).

Sequence Diagram

sequenceDiagram
    participant User as User / CLI
    participant Parser as Argument Parser
    participant Setup as setup_experiment.main()
    participant Executor as slurm_executor()
    participant Slurm as Slurm Batch System

    User->>Parser: Call with --offline flag
    Parser->>Parser: Parse args, validate --offline<br/>and --hf_token mutual exclusion
    Parser->>Setup: Call main(..., offline=True)
    Setup->>Setup: Validate Qwen3 requirement<br/>(hf_token or offline)
    Setup->>Executor: Invoke slurm_executor(..., offline=True)
    Executor->>Executor: Set HF_HUB_OFFLINE="1"<br/>in PERF_ENV_VARS
    Executor->>Executor: Add --container-writable<br/>to srun_args
    Executor->>Slurm: Submit sbatch with offline<br/>environment & writable container
    Slurm->>Slurm: Execute task without<br/>HF Hub network calls
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding an --offline flag to clarify HuggingFace offline behavior. It is concise, specific, and directly related to the primary objective of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR adds offline flag and configuration infrastructure changes with 5 comprehensive unit tests covering CLI parsing, mutual exclusion, executor behavior, and default/token modes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch afilby/offline-mode-flag
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
scripts/performance/README.md (1)

244-244: ⚠️ Potential issue | 🟡 Minor

Typographical error: double hyphen prefix.

Line 244 has - - which appears to be a typo.

📝 Proposed fix
-- - `--save_config_filepath`: Path to save the task configuration file.
+- `--save_config_filepath`: Path to save the task configuration file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/README.md` at line 244, Fix the typographical
double-hyphen bullet at the README entry by removing the extra hyphen so the
line reads a single list bullet with the option name; specifically update the
text containing "`--save_config_filepath`" to remove the leading duplicate
hyphen (change "- - `--save_config_filepath`..." to "-
`--save_config_filepath`...").
scripts/performance/setup_experiment.py (1)

340-355: ⚠️ Potential issue | 🟠 Major

dgxc_executor does not accept an offline parameter, creating inconsistent behavior with slurm_executor.

The offline flag from CLI is silently ignored when a DGXCloud cluster is specified because dgxc_executor (lines 167-182 in scripts/performance/utils/executors.py) does not accept an offline parameter, unlike slurm_executor which conditionally sets HF_HUB_OFFLINE=1 when offline mode is enabled (lines 114-116). This means users running with --offline --dgxc_cluster=... cannot dynamically enable offline mode for DGXCloud.

The dgxc_executor hardcodes TRANSFORMERS_OFFLINE=1 by default but lacks the conditional logic to respect user intent for offline mode. Consider either:

  1. Adding offline parameter support to dgxc_executor with conditional HF_HUB_OFFLINE handling, or
  2. Raising an error when --offline is used with DGXCloud (if offline mode is not supported there)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/setup_experiment.py` around lines 340 - 355, The
dgxc_executor call ignores the CLI --offline flag because dgxc_executor (in
scripts/performance/utils/executors.py) does not accept an offline parameter
while slurm_executor does; to fix, add an offline parameter to dgxc_executor
signature and propagation in its call sites (including
scripts/performance/setup_experiment.py) and implement the same conditional
environment handling as slurm_executor: when offline is True set
HF_HUB_OFFLINE=1 (and only set TRANSFORMERS_OFFLINE/other offline-related env
vars when appropriate) or, if DGXCloud truly cannot support offline mode,
explicitly detect offline in dgxc_executor and raise a clear error indicating
--offline is unsupported for DGXCloud. Ensure you update the dgxc_executor
parameter list, its custom_env_vars construction, and the call site in
setup_experiment.py to pass the offline flag.
🧹 Nitpick comments (5)
tests/unit_tests/scripts/test_performance_offline_mode.py (5)

137-152: Add @pytest.mark.unit marker.

📝 Proposed fix
+@pytest.mark.unit
 def test_slurm_executor_default_has_container_writable_and_hub_online(import_performance_module):
     """By default, --container-writable is always set and HF Hub access stays online."""

As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/scripts/test_performance_offline_mode.py` around lines 137 -
152, Add the pytest unit marker to the test function
test_slurm_executor_default_has_container_writable_and_hub_online by decorating
it with `@pytest.mark.unit`; ensure pytest is imported in the file if not already,
place the decorator immediately above the function definition and run tests to
confirm the marker is recognized.

73-92: Add @pytest.mark.unit marker to categorize this test.

Per coding guidelines, tests should use pytest markers to categorize them.

📝 Proposed fix
+@pytest.mark.unit
 def test_parse_cli_args_accepts_offline_flag(import_performance_module):
     """The performance CLI should keep exposing the offline switch."""

As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/scripts/test_performance_offline_mode.py` around lines 73 -
92, Add the pytest unit marker to the test function
test_parse_cli_args_accepts_offline_flag by decorating it with
`@pytest.mark.unit`; ensure pytest is imported at the top of the file if not
already present so the decorator resolves. Locate the function definition in
tests/unit_tests/scripts/test_performance_offline_mode.py and add the
`@pytest.mark.unit` decorator immediately above def
test_parse_cli_args_accepts_offline_flag(...).

95-115: Add @pytest.mark.unit marker.

📝 Proposed fix
+@pytest.mark.unit
 def test_argparse_rejects_hf_token_with_offline(import_performance_module):
     """argparse should reject --hf_token and --offline together at parse time."""

As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/scripts/test_performance_offline_mode.py` around lines 95 -
115, Add the pytest unit marker to the test function
test_argparse_rejects_hf_token_with_offline so it is categorized as a unit test;
locate the function definition for test_argparse_rejects_hf_token_with_offline
(which creates argument_parser and calls parser.parse_args) and add
`@pytest.mark.unit` immediately above the def line to mark it accordingly.

118-134: Add @pytest.mark.unit marker and note test isolation concern.

The test correctly verifies offline mode behavior. However, due to PERF_ENV_VARS being a mutable module-level dict (as noted in executors.py review), this test may be affected by or affect other tests in the same process. The test order independence could be compromised.

📝 Proposed fix for marker
+@pytest.mark.unit
 def test_slurm_executor_sets_offline_env_and_container_writable(import_performance_module):
     """Offline mode should set HF_HUB_OFFLINE and preserve the offline Transformers default."""

As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/scripts/test_performance_offline_mode.py` around lines 118 -
134, Add the pytest marker and guard mutable global state: annotate
test_slurm_executor_sets_offline_env_and_container_writable with
`@pytest.mark.unit` and ensure it does not leak PERf_ENV_VARS changes by
snapshotting and restoring the module-level PERF_ENV_VARS around the call to
executors.slurm_executor (or use monkeypatch to set a fresh copy) so other tests
remain isolated; reference the executors.slurm_executor invocation and the
PERF_ENV_VARS mutable dict when implementing the snapshot/restore or
monkeypatch.

155-171: Add @pytest.mark.unit marker.

📝 Proposed fix
+@pytest.mark.unit
 def test_slurm_executor_hf_token_enables_online_transformers(import_performance_module):
     """Providing an HF token should enable the online Transformers path."""

As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/scripts/test_performance_offline_mode.py` around lines 155 -
171, Add the pytest unit marker to the test function by decorating
test_slurm_executor_hf_token_enables_online_transformers with `@pytest.mark.unit`
and ensure pytest is imported in the module if not already; this means adding
"import pytest" at the top (or ensuring it's present) and placing
"@pytest.mark.unit" immediately above the def of
test_slurm_executor_hf_token_enables_online_transformers so the test is
categorized as a unit test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/utils/executors.py`:
- Around line 39-48: The module-level dict PERF_ENV_VARS is being mutated
in-place by slurm_executor causing cross-call state leakage; to fix, make a
fresh local copy inside slurm_executor (e.g., env = PERF_ENV_VARS.copy() or
dict(PERF_ENV_VARS)) and perform all modifications (adding WANDB_API_KEY,
HF_TOKEN, HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE overrides, etc.) on that local
copy before using it, leaving the module-level PERF_ENV_VARS immutable for
subsequent calls.

---

Outside diff comments:
In `@scripts/performance/README.md`:
- Line 244: Fix the typographical double-hyphen bullet at the README entry by
removing the extra hyphen so the line reads a single list bullet with the option
name; specifically update the text containing "`--save_config_filepath`" to
remove the leading duplicate hyphen (change "- - `--save_config_filepath`..." to
"- `--save_config_filepath`...").

In `@scripts/performance/setup_experiment.py`:
- Around line 340-355: The dgxc_executor call ignores the CLI --offline flag
because dgxc_executor (in scripts/performance/utils/executors.py) does not
accept an offline parameter while slurm_executor does; to fix, add an offline
parameter to dgxc_executor signature and propagation in its call sites
(including scripts/performance/setup_experiment.py) and implement the same
conditional environment handling as slurm_executor: when offline is True set
HF_HUB_OFFLINE=1 (and only set TRANSFORMERS_OFFLINE/other offline-related env
vars when appropriate) or, if DGXCloud truly cannot support offline mode,
explicitly detect offline in dgxc_executor and raise a clear error indicating
--offline is unsupported for DGXCloud. Ensure you update the dgxc_executor
parameter list, its custom_env_vars construction, and the call site in
setup_experiment.py to pass the offline flag.

---

Nitpick comments:
In `@tests/unit_tests/scripts/test_performance_offline_mode.py`:
- Around line 137-152: Add the pytest unit marker to the test function
test_slurm_executor_default_has_container_writable_and_hub_online by decorating
it with `@pytest.mark.unit`; ensure pytest is imported in the file if not already,
place the decorator immediately above the function definition and run tests to
confirm the marker is recognized.
- Around line 73-92: Add the pytest unit marker to the test function
test_parse_cli_args_accepts_offline_flag by decorating it with
`@pytest.mark.unit`; ensure pytest is imported at the top of the file if not
already present so the decorator resolves. Locate the function definition in
tests/unit_tests/scripts/test_performance_offline_mode.py and add the
`@pytest.mark.unit` decorator immediately above def
test_parse_cli_args_accepts_offline_flag(...).
- Around line 95-115: Add the pytest unit marker to the test function
test_argparse_rejects_hf_token_with_offline so it is categorized as a unit test;
locate the function definition for test_argparse_rejects_hf_token_with_offline
(which creates argument_parser and calls parser.parse_args) and add
`@pytest.mark.unit` immediately above the def line to mark it accordingly.
- Around line 118-134: Add the pytest marker and guard mutable global state:
annotate test_slurm_executor_sets_offline_env_and_container_writable with
`@pytest.mark.unit` and ensure it does not leak PERf_ENV_VARS changes by
snapshotting and restoring the module-level PERF_ENV_VARS around the call to
executors.slurm_executor (or use monkeypatch to set a fresh copy) so other tests
remain isolated; reference the executors.slurm_executor invocation and the
PERF_ENV_VARS mutable dict when implementing the snapshot/restore or
monkeypatch.
- Around line 155-171: Add the pytest unit marker to the test function by
decorating test_slurm_executor_hf_token_enables_online_transformers with
`@pytest.mark.unit` and ensure pytest is imported in the module if not already;
this means adding "import pytest" at the top (or ensuring it's present) and
placing "@pytest.mark.unit" immediately above the def of
test_slurm_executor_hf_token_enables_online_transformers so the test is
categorized as a unit test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ef5ac3be-73a9-41a0-b953-351960a53e46

📥 Commits

Reviewing files that changed from the base of the PR and between f689296 and f6c23d0.

📒 Files selected for processing (5)
  • scripts/performance/README.md
  • scripts/performance/argument_parser.py
  • scripts/performance/setup_experiment.py
  • scripts/performance/utils/executors.py
  • tests/unit_tests/scripts/test_performance_offline_mode.py

malay-nagda
malay-nagda previously approved these changes Mar 18, 2026
@yaoyu-33 yaoyu-33 added area:misc Cross-cutting utilities, logging, helpers, and other changes feature New capabilities, enhancements, or enablement work ready-to-merge PR is approved, current, and only waiting for CI to pass before merge labels Mar 19, 2026
@yaoyu-33
Copy link
Copy Markdown
Contributor

/claude review

@yaoyu-33 yaoyu-33 added needs-author Author action is required before review or merge can continue and removed ready-to-merge PR is approved, current, and only waiting for CI to pass before merge labels Mar 19, 2026
Added a test case to check for regression.

Signed-off-by: Alex Filby <afilby@nvidia.com>
@sudostock sudostock force-pushed the afilby/offline-mode-flag branch from 4ba3718 to ec09583 Compare March 20, 2026 00:51
@sudostock sudostock added needs-review PR is ready for code review and waiting on a reviewer and removed needs-author Author action is required before review or merge can continue labels Mar 20, 2026
@sudostock sudostock requested a review from yaoyu-33 March 23, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:misc Cross-cutting utilities, logging, helpers, and other changes feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants