Skip to content

Conversation

@mohamedzeidan2021
Copy link
Collaborator

@mohamedzeidan2021 mohamedzeidan2021 commented Sep 23, 2025

Detailed reason why we are adding this command: https://tiny.amazon.com/jje9mvzv/quipLs4V

What's changing and why?

Added a new hyp describe cluster command to provide info about hp clusters.

  • this command is added to fill a gap in the CLI funcionality.
  • hyp describe cluster-stack existed, but there was not equivalent command to describe the cluster resources directly
  • This command will be outputted to the user to be used when they delete cluster's with the future hyp delete cluster cmd

Before/After UX

Before:

Users can previously only use hyp list-cluster to get information about their clusters.
The hyp list-cluster cmd outputs

{
        "Cluster": "ml-cluster-integ-test",
        "Instances": [
            {
                "InstanceType": "ml.c5.2xlarge",
                "TotalNodes": 30,
                "AcceleratorDevicesAvailable": "N/A",
                "NodeHealthStatus=Schedulable": 30,
                "DeepHealthCheckStatus=Passed": "N/A"
            },
            {
                "InstanceType": "ml.g5.8xlarge",
                "TotalNodes": 13,
                "AcceleratorDevicesAvailable": 13,
                "NodeHealthStatus=Schedulable": 13,
                "DeepHealthCheckStatus=Passed": 13
            },
            {
                "InstanceType": "ml.g5.2xlarge",
                "TotalNodes": 1,
                "AcceleratorDevicesAvailable": 1,
                "NodeHealthStatus=Schedulable": 1,
                "DeepHealthCheckStatus=Passed": "N/A"
            }
        ]
    }

After:

Now if users wanted details on a specific cluster, they can run:

$ hyp describe cluster hyperpod-cluster
📋 Cluster Details for: hyperpod-cluster
Status: InService
 ClusterArn               | arn:aws:sagemaker:us-east-2:123456789012:cluster/hyperpod-cluster
 ClusterName              | hyperpod-cluster
 ClusterStatus            | InService
 CreationTime             | 2025-09-23 14:35:38
 InstanceGroups           | [
                          |   {
                          |     "CurrentCount": 1,
                          |     "TargetCount": 1,
                          |     "InstanceGroupName": "controller-group",
                          |     "InstanceType": "ml.t3.medium",
                          |     "LifeCycleConfig": {
                          |       "SourceS3Uri": "s3://my-hyperpod-bucket",
                          |       "OnCreate": "on_create.sh"
                          |     },
                          |     "ExecutionRole": "arn:aws:iam::123456789012:role/HyperPodExecutionRole",
                          |     "ThreadsPerCore": 1,
                          |     "InstanceStorageConfigs": [
                          |       {
                          |         "EbsVolumeConfig": {
                          |           "VolumeSizeInGB": 500
                          |         }
                          |       }
                          |     ],
                          |     "Status": "InService"
                          |   }
                          | ]
 VpcConfig                | {
                          |   "SecurityGroupIds": ["sg-1234567890abcdef0"],
                          |   "Subnets": ["subnet-1234567890abcdef0"]
                          | }
 Orchestrator             | {
                          |   "Eks": {
                          |     "ClusterArn": "arn:aws:eks:us-east-2:123456789012:cluster/eks-cluster"
                          |   }
                          | }
 NodeRecovery             | Automatic

How was this change tested?

Tested the command manually with different scenarios including cluster names, invalid cluster names, and different aws regions.

Are unit tests added?

Yes, added 7 test cases

  • happy case with successful cluster output
  • region flag testing
  • unknown cluster name
  • access denied mock scenarios
  • generic aws api errors
  • debug flag functionality
  • empty response handling

Are integration tests added?

No

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

Mohamed Zeidan and others added 16 commits September 3, 2025 13:15
Co-authored-by: Mohamed Zeidan <[email protected]>
**Description**
- Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks.
- Fixed training SDK imports, observability utils import path, and cluster management workflow examples.

**Testing Done**
- Verified all anchor links work correctly in table of contents and usage sections
- Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/
- Validated CLI commands match implementation in hyp_cli.py
- Confirmed code examples use correct class names and method signatures
…ws#246)

* Draft of inference logger bug fix

* Draft fix of inference logger for SDK

* Revert adding --debug flag

* Add debug parameter to failing unit tests

* Fix create_from_dict to not have hardcoded debug flag
* slurm-eks-helper-fix

* Small fix to test to reflect new changes
mohamedzeidan2021 pushed a commit that referenced this pull request Nov 21, 2025
* feat: Implement elastic training cli arguments

* Add elastic training unified config and unit test

* Add graceful shutdown and scaling timeout to cli args
mohamedzeidan2021 pushed a commit that referenced this pull request Nov 21, 2025
This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259.
jam-jee pushed a commit that referenced this pull request Nov 21, 2025
* feat: Implement elastic training cli arguments (#273)

* feat: Implement elastic training cli arguments

* Add elastic training unified config and unit test

* Add graceful shutdown and scaling timeout to cli args

* Revert "feat: Implement elastic training cli arguments (#273)"

This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259.

* Add dev_space_constants.py (#255)

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space_access_constants.py (#256)

Co-authored-by: Brian Xia <[email protected]>

* Add space_admin_config_constants.py (#257)

Co-authored-by: Brian Xia <[email protected]>

* Add template package only (#261)

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space.py CLI command (#263)

* Add dev_space.py CLI command

* Add dev space unit tests

---------

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space_utils.py to work with the dev space template model (#262)

* Add dev_space_utils.py

* Add unit tests for dev_space_utils

---------

Co-authored-by: Brian Xia <[email protected]>

* Add dev space CLI (#269)

* Rename dev space to space (#272)

* Update the Space model and constants per latest operator (#275)

* Add space_admin_config.py CLI command (#260)

* Add space_admin_config.py CLI command

* Update the space admin config to space template

---------

Co-authored-by: Brian Xia <[email protected]>

* Implement CRUD operations for Space PySDK (#267)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Update Space PySDK per new schema

* Implement the pySDK for the Space Template (#282)

* Refactor Space CLI using the Space PySDK (#281)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Refactor CLI to use the PySDK

* Add dev_space_access.py CLI command (#259)

* Add dev_space_access.py CLI command

* Add space access creation to pySDK and refactor space access CLI

---------

Co-authored-by: Brian Xia <[email protected]>

* Listing space will filter out the spaces not created by the current user (#285)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Update Space PySDK per new schema

* Implement space list pagination and creator filtering

* Refactor space template with PySDK (#286)

* Add additional Space parameters for resources including the fractional GPU (#287)

* Implement validation for mig profiles for Spaces (#291)

* Implement validation for mig profiles when creating/updating spaces

* Update Space parameter model

* Make Space Template namespaced resource

* Parker GA issues (#296)

* Update Space Template CLI to be namespaced

* Space get-logs default to the workspace container

* Remove error handling to bubble up the actual K8s errors

* Listing public Spaces

* Fix typos, elaborated text, add logic to parse idle-shutdown

* Fix the template ref regression (#300)

* Update SageMaker Space documentation (#301)

* Implement Space integration tests (#298)

Inference tests succeeded with parker-cli code - https://quip-amazon.com/fhwhAAMht0Mm/Project-Parker-HyperPod-User-Experience-for-Data-Scientist-persona

Parker-cli integ tests pass (shown below)

These inference tests failing are known to be flaky- https://w.amazon.com/bin/view/AWS/AmazonAI/Platform/Codex/CodexInfra/Runbooks/HyperPodCLI/TroubleshootInferenceTests#HTroubleshooting
ticket has been created to fix these flaky tests - https://t.corp.amazon.com/V1943878058


Parker-cli integ tests passing

============================= test session starts ==============================
platform linux -- Python 3.11.14, pytest-8.3.2, pluggy-1.6.0 -- /root/.pyenv/versions/3.11.14/bin/python3.11
cachedir: .pytest_cache
rootdir: /codebuild/output/src1458832038/src/github.com/aws/private-sagemaker-hyperpod-cli-staging
configfile: setup.cfg
plugins: hydra-core-1.3.2, order-1.3.0, dependency-0.6.0, cov-5.0.0
collecting ... collected 39 items
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_create PASSED [  2%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_table PASSED [  5%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_json PASSED [  7%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_yaml PASSED [ 10%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_json PASSED [ 12%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_stop PASSED [ 15%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_start PASSED [ 17%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_update PASSED [ 20%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_get_logs PASSED [ 23%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete PASSED [ 25%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_empty_namespace PASSED [ 28%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_nonexistent PASSED [ 30%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete_nonexistent PASSED [ 33%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_create PASSED [ 35%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_table PASSED [ 38%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_json PASSED [ 41%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_yaml PASSED [ 43%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_json PASSED [ 46%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_update PASSED [ 48%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete PASSED [ 51%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_empty_namespace PASSED [ 53%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_nonexistent PASSED [ 56%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete_nonexistent PASSED [ 58%]
test/integration_tests/space/sdk/test_sdk_space.py::test_create_space PASSED [ 61%]
test/integration_tests/space/sdk/test_sdk_space.py::test_list_spaces PASSED [ 64%]
test/integration_tests/space/sdk/test_sdk_space.py::test_get_space PASSED [ 66%]
test/integration_tests/space/sdk/test_sdk_space.py::test_wait_until_running PASSED [ 69%]
test/integration_tests/space/sdk/test_sdk_space.py::test_update_space PASSED [ 71%]
test/integration_tests/space/sdk/test_sdk_space.py::test_stop_space PASSED [ 74%]
test/integration_tests/space/sdk/test_sdk_space.py::test_start_space PASSED [ 76%]
test/integration_tests/space/sdk/test_sdk_space.py::test_list_pods PASSED [ 79%]
test/integration_tests/space/sdk/test_sdk_space.py::test_get_logs PASSED [ 82%]
test/integration_tests/space/sdk/test_sdk_space.py::test_create_space_access SKIPPED [ 84%]
test/integration_tests/space/sdk/test_sdk_space.py::test_delete_space PASSED [ 87%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_create_template PASSED [ 89%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_list_templates PASSED [ 92%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_get_template PASSED [ 94%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_update_template PASSED [ 97%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_delete_template PASSED [100%]
=============================== warnings summary ===============================

* merge conflicts fixed

* Update README for fractional gpu support (#294)

* Update README for fractional gpu support

* update pytorch job example

* add example for accelerator partitions

* merge conflicts from js template and inference

* update changelog

* uncommented install req

* uncommented

* fixed uncomment

---------

Co-authored-by: Sophia <[email protected]>
Co-authored-by: Molly He <[email protected]>
Co-authored-by: Brian Xia <[email protected]>
Co-authored-by: Brian Xia <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Ophelia Yang <[email protected]>
mollyheamazon added a commit that referenced this pull request Dec 3, 2025
* feat: Implement elastic training cli arguments (#273)

* feat: Implement elastic training cli arguments

* Add elastic training unified config and unit test

* Add graceful shutdown and scaling timeout to cli args

* Revert "feat: Implement elastic training cli arguments (#273)"

This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259.

* feat: Implement elastic training cli arguments (#295)

* feat: implement elastic training cli args

* Rename args name to match crd for elastic training

* Add unit test for replcia discrete values

* Add integ test for elastic training cli

---------

Co-authored-by: Sophia <[email protected]>
Co-authored-by: Molly He <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
mollyheamazon added a commit that referenced this pull request Dec 3, 2025
* Upgrade Inference Operator Version (#327)

* pyproj version update (#328)

Co-authored-by: Mohamed Zeidan <[email protected]>

* version change (#329)

Co-authored-by: Mohamed Zeidan <[email protected]>

* elastic training to keynote3 (#307)

* feat: Implement elastic training cli arguments (#273)

* feat: Implement elastic training cli arguments

* Add elastic training unified config and unit test

* Add graceful shutdown and scaling timeout to cli args

* Revert "feat: Implement elastic training cli arguments (#273)"

This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259.

* feat: Implement elastic training cli arguments (#295)

* feat: implement elastic training cli args

* Rename args name to match crd for elastic training

* Add unit test for replcia discrete values

* Add integ test for elastic training cli

---------

Co-authored-by: Sophia <[email protected]>
Co-authored-by: Molly He <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>

* version update for v3.5.0

---------

Co-authored-by: Shantanu Tripathi <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Sophia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants