build: move data preparation to beginning of gpu tests build by gwarmstrong · Pull Request #1077 · NVIDIA-NeMo/Skills

gwarmstrong · 2025-12-05T01:13:46Z

Summary by CodeRabbit

Tests
- Refactored dataset discovery mechanism for GPU testing pipeline to streamline dataset preparation.
Chores
- Added fail-fast dataset preparation step to CI workflow to identify issues earlier in the testing pipeline.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai · 2025-12-05T01:17:36Z

📝 Walkthrough

Walkthrough

This PR adds a dataset preparation validation step to the GPU tests CI workflow. It introduces a utility function to centrally discover datasets with prepare.py scripts (excluding specified datasets), and refactors the test to use this utility instead of inline discovery logic.

Changes

Cohort / File(s)	Summary
CI Workflow `.github/workflows/gpu_tests.yml`	Adds a new "Prepare all datasets (fail-fast check)" CI step that sets HF_TOKEN, discovers preparable datasets via dataset_utils.py, and runs dataset preparation before the main GPU tests.
Dataset Discovery Utility `tests/gpu-tests/dataset_utils.py`	Introduces EXCLUDED_DATASETS set and get_preparable_datasets() function to locate datasets under nemo_skills/dataset with prepare.py, excluding specified datasets, and print a space-separated list when executed directly.
Test Refactoring `tests/gpu-tests/test_eval.py`	Replaces inline dataset discovery logic with a single call to get_preparable_datasets(), simplifying test_prepare_and_eval_all_datasets().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify that EXCLUDED_DATASETS covers all necessary datasets and matches CI requirements
Confirm the dataset discovery logic correctly identifies all prepare.py files and handles edge cases
Validate CI workflow integration, HF_TOKEN handling, and dataset list parsing for the prepare_data call
Ensure test refactoring maintains the same behavior as the original inline discovery

Possibly related PRs

PR #1062: Also modifies tests/gpu-tests/test_eval.py with changes to judge-dataset detection and exclusion logic, potentially conflicting or overlapping with this PR's dataset handling refactor.

Suggested labels

run GPU tests

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: moving data preparation to the beginning of GPU tests workflow in the CI build process.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/move-data-prep-to-beginning-of-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tests/gpu-tests/dataset_utils.py (1)
41-48: Consider adding path validation for robustness.

The function assumes the datasets directory exists. If the path construction fails or the directory doesn't exist, the function will raise an error that may be unclear to users. Consider adding a check to provide a more informative error message.

Apply this diff to add validation:
 def get_preparable_datasets():
     """Get list of datasets that can be prepared for testing (used by CI workflow)."""
     datasets_dir = Path(__file__).absolute().parents[2] / "nemo_skills" / "dataset"
+    if not datasets_dir.exists():
+        raise FileNotFoundError(f"Datasets directory not found at {datasets_dir}")
     return sorted(
         dataset.name
         for dataset in datasets_dir.iterdir()
         if dataset.is_dir() and (dataset / "prepare.py").exists() and dataset.name not in EXCLUDED_DATASETS
     )
.github/workflows/gpu_tests.yml (1)
47-55: Consider adding error checking for the dataset list computation.

If the Python script fails, the DATASETS variable may be empty or contain unexpected output, potentially causing ns prepare_data to silently succeed without preparing any datasets. Adding validation would make the failure more explicit.

Apply this diff to add error checking:
     - name: Prepare all datasets (fail-fast check)
       env:
         HF_TOKEN: ${{ secrets.HF_TOKEN }}
       run: |
         cd ${{ github.run_id }}
         # Get dataset list from the same source as test_prepare_and_eval_all_datasets
         DATASETS=$(python tests/gpu-tests/dataset_utils.py)
+        if [ -z "$DATASETS" ]; then
+          echo "Error: No datasets found or script failed"
+          exit 1
+        fi
         echo "Preparing datasets: $DATASETS"
         ns prepare_data $DATASETS

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc1dcc and 17b0325.

📒 Files selected for processing (3)

.github/workflows/gpu_tests.yml (1 hunks)
tests/gpu-tests/dataset_utils.py (1 hunks)
tests/gpu-tests/test_eval.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/gpu-tests/test_eval.py (1)

tests/gpu-tests/dataset_utils.py (1)

get_preparable_datasets (41-48)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (4)

tests/gpu-tests/dataset_utils.py (2)

19-38: LGTM: Clear dataset exclusion list.

The EXCLUDED_DATASETS set is well-documented with clear rationale for exclusions. The inline comment for aalcr is helpful for future maintainers.

51-53: LGTM: Clean CI integration pattern.

The main guard provides a simple interface for CI workflows to consume the dataset list. The space-separated output format is appropriate for shell script consumption.

tests/gpu-tests/test_eval.py (2)

21-21: LGTM: Clean centralization of dataset discovery.

The import of get_preparable_datasets properly centralizes the dataset discovery logic, making it reusable across both the CI workflow and test code.

224-224: LGTM: Proper use of centralized dataset utility.

The refactoring successfully replaces inline discovery logic with the centralized utility function. The assertion on Line 226 provides good validation that datasets were found.

Signed-off-by: George Armstrong <georgea@nvidia.com>

.github/workflows/gpu_tests.yml

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

gwarmstrong added 2 commits December 4, 2025 17:11

build: move data preparation to beginnning of gpu tests build

2160eea

Signed-off-by: George Armstrong <georgea@nvidia.com>

fix: add dataset utils file

17b0325

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added the run GPU tests label Dec 5, 2025

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

prepare data in cluster mode

47ab0b8

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 5, 2025

gwarmstrong commented Dec 5, 2025

View reviewed changes

.github/workflows/gpu_tests.yml Outdated Show resolved Hide resolved

Apply suggestion from @gwarmstrong

350e587

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 5, 2025

alternate approach

fd2991b

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 5, 2025

MAINT move dataset get logic back into tests file

4278117

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong merged commit 86e03bf into main Dec 5, 2025
5 checks passed

gwarmstrong deleted the georgea/move-data-prep-to-beginning-of-integration branch December 5, 2025 02:53

Jorjeous pushed a commit that referenced this pull request Dec 11, 2025

build: move data preparation to beginning of gpu tests build (#1077)

5603f43

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

wasiahmad pushed a commit that referenced this pull request Dec 12, 2025

build: move data preparation to beginning of gpu tests build (#1077)

066a1a0

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad pushed a commit that referenced this pull request Feb 4, 2026

build: move data preparation to beginning of gpu tests build (#1077)

1545f73

Signed-off-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

build: move data preparation to beginning of gpu tests build (#1077)

edbdfcb

Signed-off-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

build: move data preparation to beginning of gpu tests build (#1077)

6afc17f

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: move data preparation to beginning of gpu tests build#1077

build: move data preparation to beginning of gpu tests build#1077
gwarmstrong merged 6 commits intomainfrom
georgea/move-data-prep-to-beginning-of-integration

gwarmstrong commented Dec 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 5, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwarmstrong commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 5, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gwarmstrong commented Dec 5, 2025 •

edited by coderabbitai bot

Loading