-
Notifications
You must be signed in to change notification settings - Fork 69
Feature: Delete Cluster Command #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Delete Cluster Command #250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add an integ test, it has an added benefit of cleaning up the stacks after the tests have completed running.
987cff8 to
b9447d5
Compare
nargokul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing integ tests as well
| >>> # Delete with custom logger | ||
| >>> import logging | ||
| >>> logger = logging.getLogger(__name__) | ||
| >>> HpClusterStack.delete("my-stack-name", logger=logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the user waiting on the delete to complete ?
Is there a way to check the status of a delete ? Does the describe command handle this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the delete method returns immediately, i followed how the create SDK cmd works and that returns immediately as well
you can check the stack status with check_status, but you cant check the actual status of a delete. the describe cmd to describe the stack will not show any deleted stacks.
| def perform_stack_deletion(stack_name: str, region: str, retain_list: List[str], | ||
| logger: Optional[logging.Logger] = None) -> None: | ||
| """Perform the actual CloudFormation stack deletion. | ||
| This is a low-level function that directly calls the CloudFormation delete_stack API. | ||
| For most use cases, use delete_stack_with_confirmation() instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this a private function to avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the integ tests. Is it possible to keep only 1-2 happy cases, and let other features covered by unit tests?
In my experience repetitively creating and deleting stacks are error prone and we would have many sets of tests running at the same time. We can only test basic cases like what training/infernece/cluster creation integ tests did and put other features in unit tests
…ce launch (#249) * Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254) * changelog version update (#256) Co-authored-by: Mohamed Zeidan <[email protected]> * Fix README documentation and broken anchor links (#252) **Description** - Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks. - Fixed training SDK imports, observability utils import path, and cluster management workflow examples. **Testing Done** - Verified all anchor links work correctly in table of contents and usage sections - Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/ - Validated CLI commands match implementation in hyp_cli.py - Confirmed code examples use correct class names and method signatures * Small bug fix to print debug messages for inference logger (PySDK) (#246) * Draft of inference logger bug fix * Draft fix of inference logger for SDK * Revert adding --debug flag * Add debug parameter to failing unit tests * Fix create_from_dict to not have hardcoded debug flag * Add code-coverage workflow to GitHub workflows (#257) * Add code coverage workflow * Update artifact version to v4 * Fixed report upload * Simplified workflow using tox.ini * Make sure coverage is on right source files * Bug fix for 0 percent code coverage error * Bump version to 3.2.2 (#260) * Bump version to 3.2.2 **Description** Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files. **Testing Done** Version bump only - no functional changes requiring additional testing. * Changelog update for v3.2.2 **Description** Added detaisl for Health Monitoring Agent updates to changelog **Testing Done** Production canary failure fixes validated. * Changelog update for v3.2.2 **Description** Updated the release date to represent the correct date. **Testing Done** No breaking changes. * Bump hyperpod-pytorch-job-template to v1.1.2 **Description** Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert. **Testing Done** Version bump and changelog update - node-count validation revert functionality verified. * Update readme to include review guidelines (#261) * Update PR template * Update template * Update template format * Update format * Fix readme * Feature: Delete Cluster Command (#250) * delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]> * Code Coverage for Integ Tests (#262) * Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation * Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment. * Removing duplicate cluster-creating integ test (#266) * Access entry fix (#267) * Fix Slurm failures from missing orchestration key (#268) * slurm-eks-helper-fix * Small fix to test to reflect new changes * small fix after resolving merge conflict --------- Co-authored-by: Xichao Wang <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jiayelamazon <[email protected]>
…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output
…ce launch (#249) * Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254) * changelog version update (#256) Co-authored-by: Mohamed Zeidan <[email protected]> * Fix README documentation and broken anchor links (#252) **Description** - Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks. - Fixed training SDK imports, observability utils import path, and cluster management workflow examples. **Testing Done** - Verified all anchor links work correctly in table of contents and usage sections - Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/ - Validated CLI commands match implementation in hyp_cli.py - Confirmed code examples use correct class names and method signatures * Small bug fix to print debug messages for inference logger (PySDK) (#246) * Draft of inference logger bug fix * Draft fix of inference logger for SDK * Revert adding --debug flag * Add debug parameter to failing unit tests * Fix create_from_dict to not have hardcoded debug flag * Add code-coverage workflow to GitHub workflows (#257) * Add code coverage workflow * Update artifact version to v4 * Fixed report upload * Simplified workflow using tox.ini * Make sure coverage is on right source files * Bug fix for 0 percent code coverage error * Bump version to 3.2.2 (#260) * Bump version to 3.2.2 **Description** Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files. **Testing Done** Version bump only - no functional changes requiring additional testing. * Changelog update for v3.2.2 **Description** Added detaisl for Health Monitoring Agent updates to changelog **Testing Done** Production canary failure fixes validated. * Changelog update for v3.2.2 **Description** Updated the release date to represent the correct date. **Testing Done** No breaking changes. * Bump hyperpod-pytorch-job-template to v1.1.2 **Description** Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert. **Testing Done** Version bump and changelog update - node-count validation revert functionality verified. * Update readme to include review guidelines (#261) * Update PR template * Update template * Update template format * Update format * Fix readme * Feature: Delete Cluster Command (#250) * delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]> * Code Coverage for Integ Tests (#262) * Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation * Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment. * Removing duplicate cluster-creating integ test (#266) * Access entry fix (#267) * Fix Slurm failures from missing orchestration key (#268) * slurm-eks-helper-fix * Small fix to test to reflect new changes * small fix after resolving merge conflict --------- Co-authored-by: Xichao Wang <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jiayelamazon <[email protected]>
…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output
Design Doc: https://tiny.amazon.com/11p9jl2j9/quipIs0ODele
Implemented
hyp delete cluster-stackcommand.Implementation
list_stack_resourcesAPI to validate resources--retain-resourcesallows users to keep potentially locked resources if deletion is failing so they can delete the stack and keep any necessary resources--regionflag requiredOutput Examples
Successful Deletion
hyp delete cluster-stack my-stack --region us-west-2Failed Deletion
CloudFormation Retention Limitation
hyp delete cluster-stack my-stack --retain-resources TestS3Bucket --region us-west-2Resource Validation Warning
hyp delete cluster-stack my-stack --retain-resources NonExistentResource,TestIAMRole --region us-west-2Termination Protection Error
hyp delete cluster-stack protected-stack --region us-west-2Stack Not Found
hyp delete cluster-stack non-existent-stack --region us-west-2Missing Required Region
hyp delete cluster-stack my-stackUnit Tests
test_successful_deletion_without_retention - Verifies successful stack deletion
test_successful_deletion_with_retention - Tests resource retention functionality
test_user_cancellation - Validates user cancellation handling
test_stack_not_found - Tests non-existent stack handling
test_termination_protection_enabled - Verifies termination protection error handling
test_cloudformation_retention_limitation - Tests CloudFormation limitation guidance
test_partial_deletion_failure - Handles partial deletion scenarios
test_access_denied_error - Tests permission error handling
test_empty_stack_resources - Handles empty stacks
test_resource_categorization - Verifies proper resource grouping
test_retain_resources_parsing - Tests parameter parsing with spaces
test_debug_logging - Validates debug mode functionality
test_command_help - Tests help output
test_required_region_flag - Verifies region requirement
test_generic_error_handling - Tests unexpected error scenarios
PR Approval Steps
For Requester
For Reviewer
For Requestersection to double check each item.