download AA-LCR_extracted-text.zip via hf_hub_download#1126
download AA-LCR_extracted-text.zip via hf_hub_download#1126anowaczynski-nvidia merged 4 commits intomainfrom
Conversation
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
📝 WalkthroughWalkthroughThe change replaces wget-based ZIP download with HuggingFace Hub API download, extracting to a temporary directory instead of a hardcoded local folder, and updates all subsequent dataset processing to reference the temporary extraction location. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
nemo_skills/dataset/aalcr/prepare.py (3)
209-216: Consider adding error handling for download failures.The
hf_hub_downloadcall can fail due to network issues, authentication problems, or file availability. While the assertion on line 216 will catch some issues, wrapping the download in a try-except block with a more informative error message would improve the user experience.🔎 View suggested enhancement:
- extracted_text_zip_path = hf_hub_download( - repo_id="ArtificialAnalysis/AA-LCR", - filename="extracted_text/AA-LCR_extracted-text.zip", - repo_type="dataset", - ) - - extracted_text_zip_path = Path(extracted_text_zip_path) - assert extracted_text_zip_path.exists() and extracted_text_zip_path.is_file() + try: + extracted_text_zip_path = hf_hub_download( + repo_id="ArtificialAnalysis/AA-LCR", + filename="extracted_text/AA-LCR_extracted-text.zip", + repo_type="dataset", + ) + except Exception as e: + LOG.error(f"Failed to download dataset from HuggingFace Hub. Ensure HF_TOKEN is set if required: {e}") + raise + + extracted_text_zip_path = Path(extracted_text_zip_path) + if not (extracted_text_zip_path.exists() and extracted_text_zip_path.is_file()): + raise FileNotFoundError(f"Downloaded file not found at {extracted_text_zip_path}")
220-220: Consider adding error handling for ZIP extraction.The ZIP extraction can fail due to corrupted files, insufficient disk space, or permission issues. Adding a try-except block would provide clearer error messages to users.
🔎 View suggested enhancement:
- zipfile.ZipFile(extracted_text_zip_path).extractall(tmpdir) + try: + zipfile.ZipFile(extracted_text_zip_path).extractall(tmpdir) + except zipfile.BadZipFile as e: + LOG.error(f"Downloaded file is not a valid ZIP archive: {e}") + raise + except Exception as e: + LOG.error(f"Failed to extract ZIP file: {e}") + raise
209-213: Document HF_TOKEN authentication requirement.According to the PR description,
hf_hub_downloaduses HF_TOKEN authentication to avoid rate limits. Consider adding a comment or updating the module docstring to inform users that they may need to set theHF_TOKENenvironment variable, especially when running on compute clusters.🔎 View suggested enhancement:
Add a note to the usage docstring (around line 31-44):
""" Usage +# Note: Set HF_TOKEN environment variable if running on compute clusters +# to avoid HuggingFace rate limits: export HF_TOKEN=your_token_here + # default. setup is aalcr (all). python prepare.py
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/dataset/aalcr/prepare.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: gpu-tests-qwen
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/aalcr/prepare.py (2)
218-230: LGTM! Proper temporary directory management.The use of
tempfile.TemporaryDirectory()as a context manager ensures automatic cleanup of extracted files. The output file is correctly placed outside the temporary directory for persistence, while the extracted text files remain in the temporary location during processing.
24-24: Thehuggingface_hubdependency is already properly declared inrequirements/main.txtand is available for use in the project. No action needed.Likely an incorrect or invalid review comment.
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Why?
Directly downloading
AA-LCR_extracted-text.zipusingwget.downloadoften fails on compute clusters due to huggingface rate limits imposed on IP addresses that frequently download large amounts of data without proper authentication.What?
Switch from
wget.downloadtohf_hub_downloadwhich leveragesHF_TOKENauthentication to avoid rate limits.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.