Skip to content

download AA-LCR_extracted-text.zip via hf_hub_download#1126

Merged
anowaczynski-nvidia merged 4 commits intomainfrom
anowaczynski/aalcr-fix-hf
Dec 18, 2025
Merged

download AA-LCR_extracted-text.zip via hf_hub_download#1126
anowaczynski-nvidia merged 4 commits intomainfrom
anowaczynski/aalcr-fix-hf

Conversation

@anowaczynski-nvidia
Copy link
Collaborator

@anowaczynski-nvidia anowaczynski-nvidia commented Dec 18, 2025

Why?
Directly downloading AA-LCR_extracted-text.zip using wget.download often fails on compute clusters due to huggingface rate limits imposed on IP addresses that frequently download large amounts of data without proper authentication.

What?
Switch from wget.download to hf_hub_download which leverages HF_TOKEN authentication to avoid rate limits.

Summary by CodeRabbit

  • Chores
    • Optimized dataset preparation process by updating the retrieval mechanism to use Hugging Face Hub, improving reliability and standardization of data handling.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2025

📝 Walkthrough

Walkthrough

The change replaces wget-based ZIP download with HuggingFace Hub API download, extracting to a temporary directory instead of a hardcoded local folder, and updates all subsequent dataset processing to reference the temporary extraction location.

Changes

Cohort / File(s) Summary
Dataset retrieval refactoring
nemo_skills/dataset/aalcr/prepare.py
Replaces wget + local extraction with hf_hub_download API call for AA-LCR dataset ZIP. Extracts to temporary directory with validation. Updates LCR text folder path and all downstream references (dataset loading, output file path, write operations) to use temp location.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify HF Hub API integration and download validation logic
  • Confirm temporary directory handling and cleanup semantics
  • Check that all path references correctly point to temp location (potential for missed updates)

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: replacing wget download with hf_hub_download for AA-LCR_extracted-text.zip, which is the core objective.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch anowaczynski/aalcr-fix-hf

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
nemo_skills/dataset/aalcr/prepare.py (3)

209-216: Consider adding error handling for download failures.

The hf_hub_download call can fail due to network issues, authentication problems, or file availability. While the assertion on line 216 will catch some issues, wrapping the download in a try-except block with a more informative error message would improve the user experience.

🔎 View suggested enhancement:
-    extracted_text_zip_path = hf_hub_download(
-        repo_id="ArtificialAnalysis/AA-LCR",
-        filename="extracted_text/AA-LCR_extracted-text.zip",
-        repo_type="dataset",
-    )
-
-    extracted_text_zip_path = Path(extracted_text_zip_path)
-    assert extracted_text_zip_path.exists() and extracted_text_zip_path.is_file()
+    try:
+        extracted_text_zip_path = hf_hub_download(
+            repo_id="ArtificialAnalysis/AA-LCR",
+            filename="extracted_text/AA-LCR_extracted-text.zip",
+            repo_type="dataset",
+        )
+    except Exception as e:
+        LOG.error(f"Failed to download dataset from HuggingFace Hub. Ensure HF_TOKEN is set if required: {e}")
+        raise
+
+    extracted_text_zip_path = Path(extracted_text_zip_path)
+    if not (extracted_text_zip_path.exists() and extracted_text_zip_path.is_file()):
+        raise FileNotFoundError(f"Downloaded file not found at {extracted_text_zip_path}")

220-220: Consider adding error handling for ZIP extraction.

The ZIP extraction can fail due to corrupted files, insufficient disk space, or permission issues. Adding a try-except block would provide clearer error messages to users.

🔎 View suggested enhancement:
-        zipfile.ZipFile(extracted_text_zip_path).extractall(tmpdir)
+        try:
+            zipfile.ZipFile(extracted_text_zip_path).extractall(tmpdir)
+        except zipfile.BadZipFile as e:
+            LOG.error(f"Downloaded file is not a valid ZIP archive: {e}")
+            raise
+        except Exception as e:
+            LOG.error(f"Failed to extract ZIP file: {e}")
+            raise

209-213: Document HF_TOKEN authentication requirement.

According to the PR description, hf_hub_download uses HF_TOKEN authentication to avoid rate limits. Consider adding a comment or updating the module docstring to inform users that they may need to set the HF_TOKEN environment variable, especially when running on compute clusters.

🔎 View suggested enhancement:

Add a note to the usage docstring (around line 31-44):

 """
 Usage
+# Note: Set HF_TOKEN environment variable if running on compute clusters
+# to avoid HuggingFace rate limits: export HF_TOKEN=your_token_here
+
 # default. setup is aalcr (all).
 python prepare.py
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c0722a and 8fc8b59.

📒 Files selected for processing (1)
  • nemo_skills/dataset/aalcr/prepare.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: gpu-tests-qwen
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/aalcr/prepare.py (2)

218-230: LGTM! Proper temporary directory management.

The use of tempfile.TemporaryDirectory() as a context manager ensures automatic cleanup of extracted files. The output file is correctly placed outside the temporary directory for persistence, while the extracted text files remain in the temporary location during processing.


24-24: The huggingface_hub dependency is already properly declared in requirements/main.txt and is available for use in the project. No action needed.

Likely an incorrect or invalid review comment.

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
@anowaczynski-nvidia anowaczynski-nvidia merged commit c01c0a7 into main Dec 18, 2025
6 checks passed
@anowaczynski-nvidia anowaczynski-nvidia deleted the anowaczynski/aalcr-fix-hf branch December 18, 2025 20:19
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
blahblahasdf pushed a commit to blahblahasdf/Skills that referenced this pull request Jan 8, 2026
)

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants