Skip to content

build: move data preparation to beginning of gpu tests build#1077

Merged
gwarmstrong merged 6 commits intomainfrom
georgea/move-data-prep-to-beginning-of-integration
Dec 5, 2025
Merged

build: move data preparation to beginning of gpu tests build#1077
gwarmstrong merged 6 commits intomainfrom
georgea/move-data-prep-to-beginning-of-integration

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Dec 5, 2025

Summary by CodeRabbit

  • Tests

    • Refactored dataset discovery mechanism for GPU testing pipeline to streamline dataset preparation.
  • Chores

    • Added fail-fast dataset preparation step to CI workflow to identify issues earlier in the testing pipeline.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

This PR adds a dataset preparation validation step to the GPU tests CI workflow. It introduces a utility function to centrally discover datasets with prepare.py scripts (excluding specified datasets), and refactors the test to use this utility instead of inline discovery logic.

Changes

Cohort / File(s) Summary
CI Workflow
.github/workflows/gpu_tests.yml
Adds a new "Prepare all datasets (fail-fast check)" CI step that sets HF_TOKEN, discovers preparable datasets via dataset_utils.py, and runs dataset preparation before the main GPU tests.
Dataset Discovery Utility
tests/gpu-tests/dataset_utils.py
Introduces EXCLUDED_DATASETS set and get_preparable_datasets() function to locate datasets under nemo_skills/dataset with prepare.py, excluding specified datasets, and print a space-separated list when executed directly.
Test Refactoring
tests/gpu-tests/test_eval.py
Replaces inline dataset discovery logic with a single call to get_preparable_datasets(), simplifying test_prepare_and_eval_all_datasets().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify that EXCLUDED_DATASETS covers all necessary datasets and matches CI requirements
  • Confirm the dataset discovery logic correctly identifies all prepare.py files and handles edge cases
  • Validate CI workflow integration, HF_TOKEN handling, and dataset list parsing for the prepare_data call
  • Ensure test refactoring maintains the same behavior as the original inline discovery

Possibly related PRs

  • PR #1062: Also modifies tests/gpu-tests/test_eval.py with changes to judge-dataset detection and exclusion logic, potentially conflicting or overlapping with this PR's dataset handling refactor.

Suggested labels

run GPU tests

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: moving data preparation to the beginning of GPU tests workflow in the CI build process.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/move-data-prep-to-beginning-of-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/gpu-tests/dataset_utils.py (1)

41-48: Consider adding path validation for robustness.

The function assumes the datasets directory exists. If the path construction fails or the directory doesn't exist, the function will raise an error that may be unclear to users. Consider adding a check to provide a more informative error message.

Apply this diff to add validation:

 def get_preparable_datasets():
     """Get list of datasets that can be prepared for testing (used by CI workflow)."""
     datasets_dir = Path(__file__).absolute().parents[2] / "nemo_skills" / "dataset"
+    if not datasets_dir.exists():
+        raise FileNotFoundError(f"Datasets directory not found at {datasets_dir}")
     return sorted(
         dataset.name
         for dataset in datasets_dir.iterdir()
         if dataset.is_dir() and (dataset / "prepare.py").exists() and dataset.name not in EXCLUDED_DATASETS
     )
.github/workflows/gpu_tests.yml (1)

47-55: Consider adding error checking for the dataset list computation.

If the Python script fails, the DATASETS variable may be empty or contain unexpected output, potentially causing ns prepare_data to silently succeed without preparing any datasets. Adding validation would make the failure more explicit.

Apply this diff to add error checking:

     - name: Prepare all datasets (fail-fast check)
       env:
         HF_TOKEN: ${{ secrets.HF_TOKEN }}
       run: |
         cd ${{ github.run_id }}
         # Get dataset list from the same source as test_prepare_and_eval_all_datasets
         DATASETS=$(python tests/gpu-tests/dataset_utils.py)
+        if [ -z "$DATASETS" ]; then
+          echo "Error: No datasets found or script failed"
+          exit 1
+        fi
         echo "Preparing datasets: $DATASETS"
         ns prepare_data $DATASETS
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc1dcc and 17b0325.

📒 Files selected for processing (3)
  • .github/workflows/gpu_tests.yml (1 hunks)
  • tests/gpu-tests/dataset_utils.py (1 hunks)
  • tests/gpu-tests/test_eval.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/gpu-tests/test_eval.py (1)
tests/gpu-tests/dataset_utils.py (1)
  • get_preparable_datasets (41-48)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (4)
tests/gpu-tests/dataset_utils.py (2)

19-38: LGTM: Clear dataset exclusion list.

The EXCLUDED_DATASETS set is well-documented with clear rationale for exclusions. The inline comment for aalcr is helpful for future maintainers.


51-53: LGTM: Clean CI integration pattern.

The main guard provides a simple interface for CI workflows to consume the dataset list. The space-separated output format is appropriate for shell script consumption.

tests/gpu-tests/test_eval.py (2)

21-21: LGTM: Clean centralization of dataset discovery.

The import of get_preparable_datasets properly centralizes the dataset discovery logic, making it reusable across both the CI workflow and test code.


224-224: LGTM: Proper use of centralized dataset utility.

The refactoring successfully replaces inline discovery logic with the centralized utility function. The assertion on Line 226 provides good validation that datasets were found.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong merged commit 86e03bf into main Dec 5, 2025
5 checks passed
@gwarmstrong gwarmstrong deleted the georgea/move-data-prep-to-beginning-of-integration branch December 5, 2025 02:53
Jorjeous pushed a commit that referenced this pull request Dec 11, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 12, 2025
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant