Feature: Delete Cluster Command #250

mohamedzeidan2021 · 2025-08-30T00:10:01Z

Design Doc: https://tiny.amazon.com/11p9jl2j9/quipIs0ODele

Implemented hyp delete cluster-stack command.

Implementation

Uses CloudFormation list_stack_resources API to validate resources
Error handles CloudFormation scenarios
Groups EC2 instances, Networking, IAM, Storage, etc
--retain-resources allows users to keep potentially locked resources if deletion is failing so they can delete the stack and keep any necessary resources
--region flag required

Output Examples

Successful Deletion
hyp delete cluster-stack my-stack --region us-west-2

⚠ WARNING: This will delete the following 3 resources:

Networking (1):
 - TestSecurityGroup

IAM (1):
 - TestIAMRole

Storage (1):
 - TestS3Bucket

Continue? [y/N]: y
✓ Stack 'my-stack' deletion initiated successfully

Failed Deletion

✗ Stack deletion failed

Successfully deleted (2):
 ✓ TestS3Bucket
 ✓ TestIAMRole

Failed to delete (1):
 ✗ TestSecurityGroup (DependencyViolation: has dependent resources)

💡 Note: Some resources may have dependencies preventing deletion
   Check the AWS CloudFormation console for detailed dependency information

CloudFormation Retention Limitation
hyp delete cluster-stack my-stack --retain-resources TestS3Bucket --region us-west-2

❌ CloudFormation limitation: --retain-resources only works on failed deletions

💡 Recommended workflow:
1. First try deleting without --retain-resources:
   hyp delete cluster-stack my-stack --region us-west-2

2. If deletion fails, the stack will be in DELETE_FAILED state
3. Then retry with --retain-resources to keep specific resources:
   hyp delete cluster-stack my-stack --retain-resources TestS3Bucket --region us-west-2

Resource Validation Warning
hyp delete cluster-stack my-stack --retain-resources NonExistentResource,TestIAMRole --region us-west-2

⚠️  Warning: The following 1 resources don't exist in the stack:
 - NonExistentResource (not found)

⚠ WARNING: This will delete the following 2 resources:

Networking (1):
 - TestSecurityGroup

Storage (1):
 - TestS3Bucket

The following 1 resources will be RETAINED:
 ✓ TestIAMRole (retained)

Termination Protection Error
hyp delete cluster-stack protected-stack --region us-west-2

❌ Stack deletion blocked: Termination Protection is enabled

To delete this stack, first disable termination protection:
aws cloudformation update-termination-protection --no-enable-termination-protection --stack-name protected-stack --region us-west-2

Then retry the delete command.

Stack Not Found
hyp delete cluster-stack non-existent-stack --region us-west-2

❌ Stack 'non-existent-stack' not found

Missing Required Region
hyp delete cluster-stack my-stack

Usage: hyp delete cluster-stack [OPTIONS] STACK_NAME
Try 'hyp delete cluster-stack --help' for help.

Error: Missing option '--region'.

Unit Tests

test_successful_deletion_without_retention - Verifies successful stack deletion

test_successful_deletion_with_retention - Tests resource retention functionality

test_user_cancellation - Validates user cancellation handling

test_stack_not_found - Tests non-existent stack handling

test_termination_protection_enabled - Verifies termination protection error handling

test_cloudformation_retention_limitation - Tests CloudFormation limitation guidance

test_partial_deletion_failure - Handles partial deletion scenarios

test_access_denied_error - Tests permission error handling

test_empty_stack_resources - Handles empty stacks

test_resource_categorization - Verifies proper resource grouping

test_retain_resources_parsing - Tests parameter parsing with spaces

test_debug_logging - Validates debug mode functionality

test_command_help - Tests help output

test_required_region_flag - Verifies region requirement

test_generic_error_handling - Tests unexpected error scenarios

PR Approval Steps

For Requester

Description
- Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
- Ensure that the PR follows the contribution guidelines, if applicable.
Security requirements
- Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
- Ensure commit has GitHub Commit Signature
Manual review
1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
  - Code Quality: Check for coding standards, naming conventions, and readability.
  - Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
  - Security: Check for any security issues or vulnerabilities.
  - Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
Check for Merge Conflicts:
- Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

Go through For Requester section to double check each item.
Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
Merging the PR
1. Check the Merge Method:
  1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
  1. Click the Merge pull request button.
  2. Confirm the merge by clicking Confirm merge.

src/sagemaker/hyperpod/cli/commands/cluster_stack.py

papriwal

We should add an integ test, it has an added benefit of cleaning up the stacks after the tests have completed running.

src/sagemaker/hyperpod/cli/commands/cluster_stack.py

nargokul

Missing integ tests as well

nargokul · 2025-09-09T20:13:31Z

src/sagemaker/hyperpod/cluster_management/hp_cluster_stack.py

+              >>> # Delete with custom logger
+              >>> import logging
+              >>> logger = logging.getLogger(__name__)
+              >>> HpClusterStack.delete("my-stack-name", logger=logger)


Is the user waiting on the delete to complete ?

Is there a way to check the status of a delete ? Does the describe command handle this ?

the delete method returns immediately, i followed how the create SDK cmd works and that returns immediately as well

you can check the stack status with check_status, but you cant check the actual status of a delete. the describe cmd to describe the stack will not show any deleted stacks.

nargokul · 2025-09-09T20:18:04Z

src/sagemaker/hyperpod/cli/cluster_stack_utils.py

+def perform_stack_deletion(stack_name: str, region: str, retain_list: List[str], 
+                          logger: Optional[logging.Logger] = None) -> None:
+    """Perform the actual CloudFormation stack deletion.
+    
+    This is a low-level function that directly calls the CloudFormation delete_stack API.
+    For most use cases, use delete_stack_with_confirmation() instead.


We can make this a private function to avoid confusion

zhaoqizqwang · 2025-09-09T22:44:24Z

test/integration_tests/cluster_management/test_cli_cluster_stack_deletion.py

Thanks for adding the integ tests. Is it possible to keep only 1-2 happy cases, and let other features covered by unit tests?
In my experience repetitively creating and deleting stacks are error prone and we would have many sets of tests running at the same time. We can only test basic cases like what training/infernece/cluster creation integ tests did and put other features in unit tests

…ce launch (#249) * Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254) * changelog version update (#256) Co-authored-by: Mohamed Zeidan <[email protected]> * Fix README documentation and broken anchor links (#252) **Description** - Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks. - Fixed training SDK imports, observability utils import path, and cluster management workflow examples. **Testing Done** - Verified all anchor links work correctly in table of contents and usage sections - Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/ - Validated CLI commands match implementation in hyp_cli.py - Confirmed code examples use correct class names and method signatures * Small bug fix to print debug messages for inference logger (PySDK) (#246) * Draft of inference logger bug fix * Draft fix of inference logger for SDK * Revert adding --debug flag * Add debug parameter to failing unit tests * Fix create_from_dict to not have hardcoded debug flag * Add code-coverage workflow to GitHub workflows (#257) * Add code coverage workflow * Update artifact version to v4 * Fixed report upload * Simplified workflow using tox.ini * Make sure coverage is on right source files * Bug fix for 0 percent code coverage error * Bump version to 3.2.2 (#260) * Bump version to 3.2.2 **Description** Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files. **Testing Done** Version bump only - no functional changes requiring additional testing. * Changelog update for v3.2.2 **Description** Added detaisl for Health Monitoring Agent updates to changelog **Testing Done** Production canary failure fixes validated. * Changelog update for v3.2.2 **Description** Updated the release date to represent the correct date. **Testing Done** No breaking changes. * Bump hyperpod-pytorch-job-template to v1.1.2 **Description** Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert. **Testing Done** Version bump and changelog update - node-count validation revert functionality verified. * Update readme to include review guidelines (#261) * Update PR template * Update template * Update template format * Update format * Fix readme * Feature: Delete Cluster Command (#250) * delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]> * Code Coverage for Integ Tests (#262) * Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation * Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment. * Removing duplicate cluster-creating integ test (#266) * Access entry fix (#267) * Fix Slurm failures from missing orchestration key (#268) * slurm-eks-helper-fix * Small fix to test to reflect new changes * small fix after resolving merge conflict --------- Co-authored-by: Xichao Wang <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jiayelamazon <[email protected]>

…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output

…ce launch (#249) * Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254) * changelog version update (#256) Co-authored-by: Mohamed Zeidan <[email protected]> * Fix README documentation and broken anchor links (#252) **Description** - Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks. - Fixed training SDK imports, observability utils import path, and cluster management workflow examples. **Testing Done** - Verified all anchor links work correctly in table of contents and usage sections - Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/ - Validated CLI commands match implementation in hyp_cli.py - Confirmed code examples use correct class names and method signatures * Small bug fix to print debug messages for inference logger (PySDK) (#246) * Draft of inference logger bug fix * Draft fix of inference logger for SDK * Revert adding --debug flag * Add debug parameter to failing unit tests * Fix create_from_dict to not have hardcoded debug flag * Add code-coverage workflow to GitHub workflows (#257) * Add code coverage workflow * Update artifact version to v4 * Fixed report upload * Simplified workflow using tox.ini * Make sure coverage is on right source files * Bug fix for 0 percent code coverage error * Bump version to 3.2.2 (#260) * Bump version to 3.2.2 **Description** Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files. **Testing Done** Version bump only - no functional changes requiring additional testing. * Changelog update for v3.2.2 **Description** Added detaisl for Health Monitoring Agent updates to changelog **Testing Done** Production canary failure fixes validated. * Changelog update for v3.2.2 **Description** Updated the release date to represent the correct date. **Testing Done** No breaking changes. * Bump hyperpod-pytorch-job-template to v1.1.2 **Description** Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert. **Testing Done** Version bump and changelog update - node-count validation revert functionality verified. * Update readme to include review guidelines (#261) * Update PR template * Update template * Update template format * Update format * Fix readme * Feature: Delete Cluster Command (#250) * delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]> * Code Coverage for Integ Tests (#262) * Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation * Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment. * Removing duplicate cluster-creating integ test (#266) * Access entry fix (#267) * Fix Slurm failures from missing orchestration key (#268) * slurm-eks-helper-fix * Small fix to test to reflect new changes * small fix after resolving merge conflict --------- Co-authored-by: Xichao Wang <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jiayelamazon <[email protected]>

…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output

Mohamed Zeidan added 3 commits August 28, 2025 22:50

delete cluster stack

fdaf009

delete cluster stack

ced09aa

removed unnecessary file

bdeaba0

mohamedzeidan2021 requested a review from a team as a code owner August 30, 2025 00:10

mohamedzeidan2021 temporarily deployed to auto-approve August 30, 2025 00:10 — with GitHub Actions Inactive

unit tests

3000a00

mohamedzeidan2021 temporarily deployed to auto-approve August 30, 2025 06:14 — with GitHub Actions Inactive

papriwal reviewed Sep 5, 2025

View reviewed changes

src/sagemaker/hyperpod/cli/commands/cluster_stack.py Outdated Show resolved Hide resolved

papriwal requested changes Sep 5, 2025

View reviewed changes

mufaddal-rohawala reviewed Sep 5, 2025

View reviewed changes

src/sagemaker/hyperpod/cli/commands/cluster_stack.py Show resolved Hide resolved

src/sagemaker/hyperpod/cli/commands/cluster_stack.py Show resolved Hide resolved

more modular code

b9447d5

mohamedzeidan2021 temporarily deployed to auto-approve September 8, 2025 16:28 — with GitHub Actions Inactive

mohamedzeidan2021 temporarily deployed to auto-approve September 8, 2025 22:18 — with GitHub Actions Inactive

mohamedzeidan2021 force-pushed the deleteclustercmd branch from 987cff8 to b9447d5 Compare September 8, 2025 22:21

mohamedzeidan2021 temporarily deployed to auto-approve September 8, 2025 22:21 — with GitHub Actions Inactive

refactored modular code

4c552d7

mohamedzeidan2021 temporarily deployed to auto-approve September 8, 2025 22:50 — with GitHub Actions Inactive

sdk code added and improved modularity

ec69411

mohamedzeidan2021 temporarily deployed to auto-approve September 9, 2025 05:57 — with GitHub Actions Inactive

cleanup

12585f3

mohamedzeidan2021 temporarily deployed to auto-approve September 9, 2025 16:56 — with GitHub Actions Inactive

removed silent failure for sdk

4cdc25b

mohamedzeidan2021 temporarily deployed to auto-approve September 9, 2025 19:28 — with GitHub Actions Inactive

fixed unit tests

034f8a5

mohamedzeidan2021 temporarily deployed to auto-approve September 9, 2025 19:38 — with GitHub Actions Inactive

nargokul reviewed Sep 9, 2025

View reviewed changes

integ tests

3a87ee8

mohamedzeidan2021 temporarily deployed to auto-approve September 9, 2025 21:46 — with GitHub Actions Inactive

zhaoqizqwang reviewed Sep 9, 2025

View reviewed changes

2 integ happycase tests

b48f761

mohamedzeidan2021 temporarily deployed to auto-approve September 9, 2025 23:07 — with GitHub Actions Inactive

changed test to use iam role instead of s3 bucket

578a937

mohamedzeidan2021 temporarily deployed to auto-approve September 10, 2025 19:26 — with GitHub Actions Inactive

Merge remote-tracking branch 'upstream/main' into deleteclustercmd

15a0f96

mohamedzeidan2021 temporarily deployed to auto-approve September 10, 2025 22:33 — with GitHub Actions Inactive

mohamedzeidan2021 temporarily deployed to auto-approve September 11, 2025 13:42 — with GitHub Actions Inactive

zhaoqizqwang approved these changes Sep 11, 2025

View reviewed changes

nargokul approved these changes Sep 11, 2025

View reviewed changes

papriwal approved these changes Sep 11, 2025

View reviewed changes

mohamedzeidan2021 merged commit 883e534 into aws:main Sep 11, 2025
32 of 37 checks passed

rsareddy0329 pushed a commit that referenced this pull request Sep 23, 2025

add example notebooks for init experience, update README to match wit…

983e593

…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output

rsareddy0329 pushed a commit that referenced this pull request Sep 24, 2025

add example notebooks for init experience, update README to match wit…

8bc72cf

…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Delete Cluster Command #250

Feature: Delete Cluster Command #250

Uh oh!

mohamedzeidan2021 commented Aug 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

papriwal left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

nargokul left a comment

Uh oh!

nargokul Sep 9, 2025

Uh oh!

mohamedzeidan2021 Sep 9, 2025

Uh oh!

nargokul Sep 9, 2025

Uh oh!

zhaoqizqwang Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Feature: Delete Cluster Command #250

Feature: Delete Cluster Command #250

Uh oh!

Conversation

mohamedzeidan2021 commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Output Examples

Unit Tests

PR Approval Steps

For Requester

For Reviewer

Uh oh!

Uh oh!

papriwal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nargokul left a comment

Choose a reason for hiding this comment

Uh oh!

nargokul Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

mohamedzeidan2021 Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

nargokul Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

zhaoqizqwang Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mohamedzeidan2021 commented Aug 30, 2025 •

edited

Loading

papriwal left a comment •

edited

Loading