Skip to content

Conversation

@rsareddy0329
Copy link
Collaborator

@rsareddy0329 rsareddy0329 commented Sep 23, 2025

What's changing and why?

This PR includes all changes that are required for user init experience in hyperpod cli

Before/After UX

Before:

  • There is no init experience enabled

After:
Users can initialize a template and it generates related config files which enables user to submit a training job or deploy to an endpoint with exisiting defaults:
hyp init [OPTIONS] {hyp-jumpstart-endpoint|hyp-custom-endpoint|hyp-pytorch-job|cluster-stack} [DIRECTORY]

  • Users can configure parameters related to the endpoint, PyTorch job, cluster stack they would like to create. These updates are reflected in config.yaml on their local computer.
    hyp configure [OPTIONS]

  • Users can syntactically validate their config.yaml file to ensure it is ready for creation.
    hyp validate

  • Users can reset their config.yaml to clear previously used values
    hyp reset

  • Users can create an endpoint, cluster stack, or PyTorch job based on the template they are using. Endpoint creation can be monitored with hyp list and hyp describe
    hyp create [OPTIONS]

How was this change tested?

Using unit test and integration tests.
And manually running the commands with local setup.

Are unit tests added?

yes, they are included

Are integration tests added?

yes, they are included

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

aviruthen and others added 10 commits September 23, 2025 13:23
* First draft integ tests

* Mini fixes to ensure integ tests work

* Allow integ tests to run from clean directory

* Change torch job creation namespace to default
* decouple template from src code

* update unit tests for init

* remove field validator from SDK pydantic model, fix minor parsing problem with list, update kubernetes_version type from str to float

* Update pyproject.toml for cluster stack template to include json, update read_only to be boolean

* change type handler from class to module functions, change some public function to private, update unit tests

* update create for pytorch job template, remove redundant integ test code for init
* return SDK class in pytorch model.py for v1_0 and v1_1, update pytorch_create function, update unit test

* remove name and namespace from create for inference SDK to match with training SDK, functionality remains the same

* fix unit test, add metadata class usage to example notebook, remove skip test

* fix unit test again

* update integ tests

* update create call
* decouple template from src code

* remove field validator from SDK pydantic model, fix minor parsing problem with list, update kubernetes_version type from str to float

* change type handler from class to module functions, change some public function to private, update unit tests

* cluster-stack template agnostic change

* update unit tests

* update integ test

* resolve circular import for cluster_stack

* resolve rebase merge conflict

* rename to_domain to to_config for cluster_stack

* increase timeout for endpoint integ test from 15min to 20min
* decouple template from src code

* remove field validator from SDK pydantic model, fix minor parsing problem with list, update kubernetes_version type from str to float

* change type handler from class to module functions, change some public function to private, update unit tests

* cluster-stack template agnostic change

* update unit tests

* update integ test

* resolve circular import for cluster_stack

* resolve rebase merge conflict

* rename to_domain to to_config for cluster_stack

* increase timeout for endpoint integ test from 15min to 20min

* move jinja template to schema template

* lazy loading in pytorch-job template to resolve import issue

* tasks_per_node validation added, correct typo for task governance related parameter

* get default namespace applied to inference for init experience, ignore pydantic warning, update logging experience

* update integ test

* fix integ test

* Update default namespace logic, init_constants.py naming change

* update unit test
* add telemetry to init experience, remove duplicate code in init_constants

* add filter for deprecation warning, fix hyp --version

* change default instance group name for instance group settings
…ce launch (#249)

* Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254)

* changelog version update (#256)

Co-authored-by: Mohamed Zeidan <[email protected]>

* Fix README documentation and broken anchor links (#252)

**Description**
- Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks.
- Fixed training SDK imports, observability utils import path, and cluster management workflow examples.

**Testing Done**
- Verified all anchor links work correctly in table of contents and usage sections
- Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/
- Validated CLI commands match implementation in hyp_cli.py
- Confirmed code examples use correct class names and method signatures

* Small bug fix to print debug messages for inference logger (PySDK) (#246)

* Draft of inference logger bug fix

* Draft fix of inference logger for SDK

* Revert adding --debug flag

* Add debug parameter to failing unit tests

* Fix create_from_dict to not have hardcoded debug flag

* Add code-coverage workflow to GitHub workflows (#257)

* Add code coverage workflow

* Update artifact version to v4

* Fixed report upload

* Simplified workflow using tox.ini

* Make sure coverage is on right source files

* Bug fix for 0 percent code coverage error

* Bump version to 3.2.2 (#260)

* Bump version to 3.2.2

**Description**
Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files.

**Testing Done**
Version bump only - no functional changes requiring additional testing.

* Changelog update for v3.2.2

**Description**
Added detaisl for Health Monitoring Agent updates to changelog

**Testing Done**
Production canary failure fixes validated.

* Changelog update for v3.2.2

**Description**
Updated the release date to represent the correct date.

**Testing Done**
No breaking changes.

* Bump hyperpod-pytorch-job-template to v1.1.2

**Description**
Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert.

**Testing Done**
Version bump and changelog update - node-count validation revert functionality verified.

* Update readme to include review guidelines (#261)

* Update PR template

* Update template

* Update template format

* Update format

* Fix readme

* Feature: Delete Cluster Command (#250)

* delete cluster stack

* delete cluster stack

* removed unnecessary file

* unit tests

* more modular code

* refactored modular code

* sdk code added and improved modularity

* cleanup

* removed silent failure for sdk

* fixed unit tests

* integ tests

* 2 integ happycase tests

* changed test to use iam role instead of s3 bucket

---------

Co-authored-by: Mohamed Zeidan <[email protected]>

* Code Coverage for Integ Tests (#262)

* Code Coverage for Integ Tests

* Making sure target of coverage is correct

* Removing duplicate implementation

* Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265)

1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check
2. HMA will be able to detect File system read only error
3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment.

* Removing duplicate cluster-creating integ test (#266)

* Access entry fix (#267)

* Fix Slurm failures from missing orchestration key (#268)

* slurm-eks-helper-fix

* Small fix to test to reflect new changes

* small fix after resolving merge conflict

---------

Co-authored-by: Xichao Wang <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: papriwal <[email protected]>
Co-authored-by: aviruthen <[email protected]>
Co-authored-by: Zhaoqi <[email protected]>
Co-authored-by: jiayelamazon <[email protected]>
…h new documentation (#250)

* add example notebooks for init experience, update README to match with new documentation

* clear output
@rsareddy0329 rsareddy0329 requested a review from a team as a code owner September 23, 2025 21:56
Copy link
Collaborator

@mollyheamazon mollyheamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only need to make sure integ tests pass.

@rsareddy0329 rsareddy0329 merged commit 315f7ec into main Sep 24, 2025
24 of 27 checks passed
jam-jee pushed a commit that referenced this pull request Nov 21, 2025
* feat: Implement elastic training cli arguments (#273)

* feat: Implement elastic training cli arguments

* Add elastic training unified config and unit test

* Add graceful shutdown and scaling timeout to cli args

* Revert "feat: Implement elastic training cli arguments (#273)"

This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259.

* Add dev_space_constants.py (#255)

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space_access_constants.py (#256)

Co-authored-by: Brian Xia <[email protected]>

* Add space_admin_config_constants.py (#257)

Co-authored-by: Brian Xia <[email protected]>

* Add template package only (#261)

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space.py CLI command (#263)

* Add dev_space.py CLI command

* Add dev space unit tests

---------

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space_utils.py to work with the dev space template model (#262)

* Add dev_space_utils.py

* Add unit tests for dev_space_utils

---------

Co-authored-by: Brian Xia <[email protected]>

* Add dev space CLI (#269)

* Rename dev space to space (#272)

* Update the Space model and constants per latest operator (#275)

* Add space_admin_config.py CLI command (#260)

* Add space_admin_config.py CLI command

* Update the space admin config to space template

---------

Co-authored-by: Brian Xia <[email protected]>

* Implement CRUD operations for Space PySDK (#267)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Update Space PySDK per new schema

* Implement the pySDK for the Space Template (#282)

* Refactor Space CLI using the Space PySDK (#281)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Refactor CLI to use the PySDK

* Add dev_space_access.py CLI command (#259)

* Add dev_space_access.py CLI command

* Add space access creation to pySDK and refactor space access CLI

---------

Co-authored-by: Brian Xia <[email protected]>

* Listing space will filter out the spaces not created by the current user (#285)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Update Space PySDK per new schema

* Implement space list pagination and creator filtering

* Refactor space template with PySDK (#286)

* Add additional Space parameters for resources including the fractional GPU (#287)

* Implement validation for mig profiles for Spaces (#291)

* Implement validation for mig profiles when creating/updating spaces

* Update Space parameter model

* Make Space Template namespaced resource

* Parker GA issues (#296)

* Update Space Template CLI to be namespaced

* Space get-logs default to the workspace container

* Remove error handling to bubble up the actual K8s errors

* Listing public Spaces

* Fix typos, elaborated text, add logic to parse idle-shutdown

* Fix the template ref regression (#300)

* Update SageMaker Space documentation (#301)

* Implement Space integration tests (#298)

Inference tests succeeded with parker-cli code - https://quip-amazon.com/fhwhAAMht0Mm/Project-Parker-HyperPod-User-Experience-for-Data-Scientist-persona

Parker-cli integ tests pass (shown below)

These inference tests failing are known to be flaky- https://w.amazon.com/bin/view/AWS/AmazonAI/Platform/Codex/CodexInfra/Runbooks/HyperPodCLI/TroubleshootInferenceTests#HTroubleshooting
ticket has been created to fix these flaky tests - https://t.corp.amazon.com/V1943878058


Parker-cli integ tests passing

============================= test session starts ==============================
platform linux -- Python 3.11.14, pytest-8.3.2, pluggy-1.6.0 -- /root/.pyenv/versions/3.11.14/bin/python3.11
cachedir: .pytest_cache
rootdir: /codebuild/output/src1458832038/src/github.com/aws/private-sagemaker-hyperpod-cli-staging
configfile: setup.cfg
plugins: hydra-core-1.3.2, order-1.3.0, dependency-0.6.0, cov-5.0.0
collecting ... collected 39 items
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_create PASSED [  2%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_table PASSED [  5%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_json PASSED [  7%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_yaml PASSED [ 10%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_json PASSED [ 12%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_stop PASSED [ 15%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_start PASSED [ 17%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_update PASSED [ 20%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_get_logs PASSED [ 23%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete PASSED [ 25%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_empty_namespace PASSED [ 28%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_nonexistent PASSED [ 30%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete_nonexistent PASSED [ 33%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_create PASSED [ 35%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_table PASSED [ 38%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_json PASSED [ 41%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_yaml PASSED [ 43%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_json PASSED [ 46%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_update PASSED [ 48%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete PASSED [ 51%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_empty_namespace PASSED [ 53%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_nonexistent PASSED [ 56%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete_nonexistent PASSED [ 58%]
test/integration_tests/space/sdk/test_sdk_space.py::test_create_space PASSED [ 61%]
test/integration_tests/space/sdk/test_sdk_space.py::test_list_spaces PASSED [ 64%]
test/integration_tests/space/sdk/test_sdk_space.py::test_get_space PASSED [ 66%]
test/integration_tests/space/sdk/test_sdk_space.py::test_wait_until_running PASSED [ 69%]
test/integration_tests/space/sdk/test_sdk_space.py::test_update_space PASSED [ 71%]
test/integration_tests/space/sdk/test_sdk_space.py::test_stop_space PASSED [ 74%]
test/integration_tests/space/sdk/test_sdk_space.py::test_start_space PASSED [ 76%]
test/integration_tests/space/sdk/test_sdk_space.py::test_list_pods PASSED [ 79%]
test/integration_tests/space/sdk/test_sdk_space.py::test_get_logs PASSED [ 82%]
test/integration_tests/space/sdk/test_sdk_space.py::test_create_space_access SKIPPED [ 84%]
test/integration_tests/space/sdk/test_sdk_space.py::test_delete_space PASSED [ 87%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_create_template PASSED [ 89%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_list_templates PASSED [ 92%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_get_template PASSED [ 94%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_update_template PASSED [ 97%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_delete_template PASSED [100%]
=============================== warnings summary ===============================

* merge conflicts fixed

* Update README for fractional gpu support (#294)

* Update README for fractional gpu support

* update pytorch job example

* add example for accelerator partitions

* merge conflicts from js template and inference

* update changelog

* uncommented install req

* uncommented

* fixed uncomment

---------

Co-authored-by: Sophia <[email protected]>
Co-authored-by: Molly He <[email protected]>
Co-authored-by: Brian Xia <[email protected]>
Co-authored-by: Brian Xia <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Ophelia Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants