download AA-LCR_extracted-text.zip via hf_hub_download by anowaczynski-nvidia · Pull Request #1126 · NVIDIA-NeMo/Skills

anowaczynski-nvidia · 2025-12-18T01:08:43Z

Why?
Directly downloading AA-LCR_extracted-text.zip using wget.download often fails on compute clusters due to huggingface rate limits imposed on IP addresses that frequently download large amounts of data without proper authentication.

What?
Switch from wget.download to hf_hub_download which leverages HF_TOKEN authentication to avoid rate limits.

Summary by CodeRabbit

Chores
- Optimized dataset preparation process by updating the retrieval mechanism to use Hugging Face Hub, improving reliability and standardization of data handling.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

coderabbitai · 2025-12-18T01:12:02Z

📝 Walkthrough

Walkthrough

The change replaces wget-based ZIP download with HuggingFace Hub API download, extracting to a temporary directory instead of a hardcoded local folder, and updates all subsequent dataset processing to reference the temporary extraction location.

Changes

Cohort / File(s)	Summary
Dataset retrieval refactoring `nemo_skills/dataset/aalcr/prepare.py`	Replaces wget + local extraction with `hf_hub_download` API call for AA-LCR dataset ZIP. Extracts to temporary directory with validation. Updates LCR text folder path and all downstream references (dataset loading, output file path, write operations) to use temp location.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Verify HF Hub API integration and download validation logic
Confirm temporary directory handling and cleanup semantics
Check that all path references correctly point to temp location (potential for missed updates)

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main change: replacing wget download with hf_hub_download for AA-LCR_extracted-text.zip, which is the core objective.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch anowaczynski/aalcr-fix-hf

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

nemo_skills/dataset/aalcr/prepare.py (3)
209-216: Consider adding error handling for download failures.

The hf_hub_download call can fail due to network issues, authentication problems, or file availability. While the assertion on line 216 will catch some issues, wrapping the download in a try-except block with a more informative error message would improve the user experience.
🔎 View suggested enhancement:
-    extracted_text_zip_path = hf_hub_download(
-        repo_id="ArtificialAnalysis/AA-LCR",
-        filename="extracted_text/AA-LCR_extracted-text.zip",
-        repo_type="dataset",
-    )
-
-    extracted_text_zip_path = Path(extracted_text_zip_path)
-    assert extracted_text_zip_path.exists() and extracted_text_zip_path.is_file()
+    try:
+        extracted_text_zip_path = hf_hub_download(
+            repo_id="ArtificialAnalysis/AA-LCR",
+            filename="extracted_text/AA-LCR_extracted-text.zip",
+            repo_type="dataset",
+        )
+    except Exception as e:
+        LOG.error(f"Failed to download dataset from HuggingFace Hub. Ensure HF_TOKEN is set if required: {e}")
+        raise
+
+    extracted_text_zip_path = Path(extracted_text_zip_path)
+    if not (extracted_text_zip_path.exists() and extracted_text_zip_path.is_file()):
+        raise FileNotFoundError(f"Downloaded file not found at {extracted_text_zip_path}")
220-220: Consider adding error handling for ZIP extraction.

The ZIP extraction can fail due to corrupted files, insufficient disk space, or permission issues. Adding a try-except block would provide clearer error messages to users.
🔎 View suggested enhancement:
-        zipfile.ZipFile(extracted_text_zip_path).extractall(tmpdir)
+        try:
+            zipfile.ZipFile(extracted_text_zip_path).extractall(tmpdir)
+        except zipfile.BadZipFile as e:
+            LOG.error(f"Downloaded file is not a valid ZIP archive: {e}")
+            raise
+        except Exception as e:
+            LOG.error(f"Failed to extract ZIP file: {e}")
+            raise
209-213: Document HF_TOKEN authentication requirement.

According to the PR description, hf_hub_download uses HF_TOKEN authentication to avoid rate limits. Consider adding a comment or updating the module docstring to inform users that they may need to set the HF_TOKEN environment variable, especially when running on compute clusters.
🔎 View suggested enhancement:

Add a note to the usage docstring (around line 31-44):
 """
 Usage
+# Note: Set HF_TOKEN environment variable if running on compute clusters
+# to avoid HuggingFace rate limits: export HF_TOKEN=your_token_here
+
 # default. setup is aalcr (all).
 python prepare.py

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c0722a and 8fc8b59.

📒 Files selected for processing (1)

nemo_skills/dataset/aalcr/prepare.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: gpu-tests-qwen
GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (2)

nemo_skills/dataset/aalcr/prepare.py (2)

218-230: LGTM! Proper temporary directory management.

The use of tempfile.TemporaryDirectory() as a context manager ensures automatic cleanup of extracted files. The output file is correctly placed outside the temporary directory for persistence, while the extracted text files remain in the temporary location during processing.

24-24: The huggingface_hub dependency is already properly declared in requirements/main.txt and is available for use in the project. No action needed.

Likely an incorrect or invalid review comment.

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

replace wget.download with hf_hub_download

8fc8b59

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

anowaczynski-nvidia requested review from fayejf and gwarmstrong December 18, 2025 01:08

anowaczynski-nvidia added the run GPU tests label Dec 18, 2025

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

ruff format

53fa077

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

anowaczynski-nvidia removed the run GPU tests label Dec 18, 2025

Merge branch 'main' into anowaczynski/aalcr-fix-hf

b46dab8

gwarmstrong added the run GPU tests label Dec 18, 2025

Merge branch 'main' into anowaczynski/aalcr-fix-hf

cefa37c

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 18, 2025

gwarmstrong approved these changes Dec 18, 2025

View reviewed changes

anowaczynski-nvidia merged commit c01c0a7 into main Dec 18, 2025
6 checks passed

anowaczynski-nvidia deleted the anowaczynski/aalcr-fix-hf branch December 18, 2025 20:19

wasiahmad pushed a commit that referenced this pull request Dec 19, 2025

download AA-LCR_extracted-text.zip via hf_hub_download (#1126)

2ff5b51

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

wasiahmad pushed a commit that referenced this pull request Feb 4, 2026

download AA-LCR_extracted-text.zip via hf_hub_download (#1126)

fb866ea

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

download AA-LCR_extracted-text.zip via hf_hub_download (#1126)

dfa9e02

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download AA-LCR_extracted-text.zip via hf_hub_download#1126

download AA-LCR_extracted-text.zip via hf_hub_download#1126
anowaczynski-nvidia merged 4 commits intomainfrom
anowaczynski/aalcr-fix-hf

anowaczynski-nvidia commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 18, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anowaczynski-nvidia commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anowaczynski-nvidia commented Dec 18, 2025 •

edited by coderabbitai bot

Loading