build: move data preparation to beginning of gpu tests build#1077
build: move data preparation to beginning of gpu tests build#1077gwarmstrong merged 6 commits intomainfrom
Conversation
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds a dataset preparation validation step to the GPU tests CI workflow. It introduces a utility function to centrally discover datasets with prepare.py scripts (excluding specified datasets), and refactors the test to use this utility instead of inline discovery logic. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/gpu-tests/dataset_utils.py (1)
41-48: Consider adding path validation for robustness.The function assumes the datasets directory exists. If the path construction fails or the directory doesn't exist, the function will raise an error that may be unclear to users. Consider adding a check to provide a more informative error message.
Apply this diff to add validation:
def get_preparable_datasets(): """Get list of datasets that can be prepared for testing (used by CI workflow).""" datasets_dir = Path(__file__).absolute().parents[2] / "nemo_skills" / "dataset" + if not datasets_dir.exists(): + raise FileNotFoundError(f"Datasets directory not found at {datasets_dir}") return sorted( dataset.name for dataset in datasets_dir.iterdir() if dataset.is_dir() and (dataset / "prepare.py").exists() and dataset.name not in EXCLUDED_DATASETS ).github/workflows/gpu_tests.yml (1)
47-55: Consider adding error checking for the dataset list computation.If the Python script fails, the
DATASETSvariable may be empty or contain unexpected output, potentially causingns prepare_datato silently succeed without preparing any datasets. Adding validation would make the failure more explicit.Apply this diff to add error checking:
- name: Prepare all datasets (fail-fast check) env: HF_TOKEN: ${{ secrets.HF_TOKEN }} run: | cd ${{ github.run_id }} # Get dataset list from the same source as test_prepare_and_eval_all_datasets DATASETS=$(python tests/gpu-tests/dataset_utils.py) + if [ -z "$DATASETS" ]; then + echo "Error: No datasets found or script failed" + exit 1 + fi echo "Preparing datasets: $DATASETS" ns prepare_data $DATASETS
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/gpu_tests.yml(1 hunks)tests/gpu-tests/dataset_utils.py(1 hunks)tests/gpu-tests/test_eval.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/gpu-tests/test_eval.py (1)
tests/gpu-tests/dataset_utils.py (1)
get_preparable_datasets(41-48)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (4)
tests/gpu-tests/dataset_utils.py (2)
19-38: LGTM: Clear dataset exclusion list.The EXCLUDED_DATASETS set is well-documented with clear rationale for exclusions. The inline comment for
aalcris helpful for future maintainers.
51-53: LGTM: Clean CI integration pattern.The main guard provides a simple interface for CI workflows to consume the dataset list. The space-separated output format is appropriate for shell script consumption.
tests/gpu-tests/test_eval.py (2)
21-21: LGTM: Clean centralization of dataset discovery.The import of
get_preparable_datasetsproperly centralizes the dataset discovery logic, making it reusable across both the CI workflow and test code.
224-224: LGTM: Proper use of centralized dataset utility.The refactoring successfully replaces inline discovery logic with the centralized utility function. The assertion on Line 226 provides good validation that datasets were found.
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.