Skip to content

Conversation

@aviruthen
Copy link
Collaborator

What's changing and why?

This is a bug fix relating to the following GitHub issue: #264
Essentially, the "Orchestrator" key is only available for EKS clusters and not Slurm clusters. This leads to KeyErrors whenever a user has a running Slurm Cluster but calls commands like hyp create (which calls list_clusters() in the backend and was failing previously) or hyp set-cluster-context (which calls set_cluster_context() in the background, which gives an odd KeyError if trying to set your cluster context to a slurm cluster).

Before/After UX

Before:
Customers could not use certain HP CLI commands (such as create() or set_cluster_context()) effectively if they had at least one running Slurm cluster in their account.

After:
Customers can use HP CLI normally while simultaneously having running Slurm clusters in their account.

How was this change tested?

Verified that fixes enable endpoint deployment for customers with Slurm cluster in their account. Waiting for unit/integ tests to run in GItHub workflow

Are unit tests added?

Not needed

Are integration tests added?

We may want to add a placeholder Slurm cluster in our beta/prod account to ensure that this isn't an issue for the future. This does not require any changes to existing tests, but adding this could help us prevent issues like this in the future.

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

@aviruthen aviruthen requested a review from a team as a code owner September 16, 2025 20:39
@aviruthen aviruthen merged commit da2df2f into aws:main Sep 17, 2025
7 checks passed
mohamedzeidan2021 pushed a commit to mohamedzeidan2021/sagemaker-hyperpod-cli that referenced this pull request Sep 23, 2025
* slurm-eks-helper-fix

* Small fix to test to reflect new changes
rsareddy0329 pushed a commit that referenced this pull request Sep 23, 2025
…ce launch (#249)

* Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254)

* changelog version update (#256)

Co-authored-by: Mohamed Zeidan <[email protected]>

* Fix README documentation and broken anchor links (#252)

**Description**
- Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks.
- Fixed training SDK imports, observability utils import path, and cluster management workflow examples.

**Testing Done**
- Verified all anchor links work correctly in table of contents and usage sections
- Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/
- Validated CLI commands match implementation in hyp_cli.py
- Confirmed code examples use correct class names and method signatures

* Small bug fix to print debug messages for inference logger (PySDK) (#246)

* Draft of inference logger bug fix

* Draft fix of inference logger for SDK

* Revert adding --debug flag

* Add debug parameter to failing unit tests

* Fix create_from_dict to not have hardcoded debug flag

* Add code-coverage workflow to GitHub workflows (#257)

* Add code coverage workflow

* Update artifact version to v4

* Fixed report upload

* Simplified workflow using tox.ini

* Make sure coverage is on right source files

* Bug fix for 0 percent code coverage error

* Bump version to 3.2.2 (#260)

* Bump version to 3.2.2

**Description**
Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files.

**Testing Done**
Version bump only - no functional changes requiring additional testing.

* Changelog update for v3.2.2

**Description**
Added detaisl for Health Monitoring Agent updates to changelog

**Testing Done**
Production canary failure fixes validated.

* Changelog update for v3.2.2

**Description**
Updated the release date to represent the correct date.

**Testing Done**
No breaking changes.

* Bump hyperpod-pytorch-job-template to v1.1.2

**Description**
Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert.

**Testing Done**
Version bump and changelog update - node-count validation revert functionality verified.

* Update readme to include review guidelines (#261)

* Update PR template

* Update template

* Update template format

* Update format

* Fix readme

* Feature: Delete Cluster Command (#250)

* delete cluster stack

* delete cluster stack

* removed unnecessary file

* unit tests

* more modular code

* refactored modular code

* sdk code added and improved modularity

* cleanup

* removed silent failure for sdk

* fixed unit tests

* integ tests

* 2 integ happycase tests

* changed test to use iam role instead of s3 bucket

---------

Co-authored-by: Mohamed Zeidan <[email protected]>

* Code Coverage for Integ Tests (#262)

* Code Coverage for Integ Tests

* Making sure target of coverage is correct

* Removing duplicate implementation

* Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265)

1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check
2. HMA will be able to detect File system read only error
3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment.

* Removing duplicate cluster-creating integ test (#266)

* Access entry fix (#267)

* Fix Slurm failures from missing orchestration key (#268)

* slurm-eks-helper-fix

* Small fix to test to reflect new changes

* small fix after resolving merge conflict

---------

Co-authored-by: Xichao Wang <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: papriwal <[email protected]>
Co-authored-by: aviruthen <[email protected]>
Co-authored-by: Zhaoqi <[email protected]>
Co-authored-by: jiayelamazon <[email protected]>
rsareddy0329 pushed a commit that referenced this pull request Sep 24, 2025
…ce launch (#249)

* Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254)

* changelog version update (#256)

Co-authored-by: Mohamed Zeidan <[email protected]>

* Fix README documentation and broken anchor links (#252)

**Description**
- Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks.
- Fixed training SDK imports, observability utils import path, and cluster management workflow examples.

**Testing Done**
- Verified all anchor links work correctly in table of contents and usage sections
- Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/
- Validated CLI commands match implementation in hyp_cli.py
- Confirmed code examples use correct class names and method signatures

* Small bug fix to print debug messages for inference logger (PySDK) (#246)

* Draft of inference logger bug fix

* Draft fix of inference logger for SDK

* Revert adding --debug flag

* Add debug parameter to failing unit tests

* Fix create_from_dict to not have hardcoded debug flag

* Add code-coverage workflow to GitHub workflows (#257)

* Add code coverage workflow

* Update artifact version to v4

* Fixed report upload

* Simplified workflow using tox.ini

* Make sure coverage is on right source files

* Bug fix for 0 percent code coverage error

* Bump version to 3.2.2 (#260)

* Bump version to 3.2.2

**Description**
Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files.

**Testing Done**
Version bump only - no functional changes requiring additional testing.

* Changelog update for v3.2.2

**Description**
Added detaisl for Health Monitoring Agent updates to changelog

**Testing Done**
Production canary failure fixes validated.

* Changelog update for v3.2.2

**Description**
Updated the release date to represent the correct date.

**Testing Done**
No breaking changes.

* Bump hyperpod-pytorch-job-template to v1.1.2

**Description**
Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert.

**Testing Done**
Version bump and changelog update - node-count validation revert functionality verified.

* Update readme to include review guidelines (#261)

* Update PR template

* Update template

* Update template format

* Update format

* Fix readme

* Feature: Delete Cluster Command (#250)

* delete cluster stack

* delete cluster stack

* removed unnecessary file

* unit tests

* more modular code

* refactored modular code

* sdk code added and improved modularity

* cleanup

* removed silent failure for sdk

* fixed unit tests

* integ tests

* 2 integ happycase tests

* changed test to use iam role instead of s3 bucket

---------

Co-authored-by: Mohamed Zeidan <[email protected]>

* Code Coverage for Integ Tests (#262)

* Code Coverage for Integ Tests

* Making sure target of coverage is correct

* Removing duplicate implementation

* Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265)

1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check
2. HMA will be able to detect File system read only error
3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment.

* Removing duplicate cluster-creating integ test (#266)

* Access entry fix (#267)

* Fix Slurm failures from missing orchestration key (#268)

* slurm-eks-helper-fix

* Small fix to test to reflect new changes

* small fix after resolving merge conflict

---------

Co-authored-by: Xichao Wang <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: papriwal <[email protected]>
Co-authored-by: aviruthen <[email protected]>
Co-authored-by: Zhaoqi <[email protected]>
Co-authored-by: jiayelamazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants