Skip to content

Conversation

@mohamedzeidan2021
Copy link
Collaborator

What's changing and why?

the pytorch job expects 1 of the following 2 (not both)

  1. node-count
  2. combination of accelerators, vcpu, memory-in-gib must be specified for instance-type ml.p4d.24xlarge

this change reverts the default of node_count back to None, so users can specify on the pytorch job flags whether they want one or the other.

How was this change tested?

tested locally by running commands like

hyp create hyp-pytorch-job \
  --version 1.1 \
  --job-name my-training-job \
  --image pytorch/pytorch:latest \
  --command '[python, train.py]' \ 
  --instance-type ml.g5.8xlarge \
  --tasks-per-node 1 \
  --accelerators 1 \
  --accelerators-limit 1 \
  --vcpu 8 \
  --vcpu-limit 8 \
  --memory 32 \
  --memory-limit 32

Unit test coverage

  • All new/modified code has unit tests
  • Coverage verified for changed code
  • N/A - no testable code changes

Do we need integration tests?

  • Yes - integration tests added
  • No - unit tests sufficient
  • No - infrastructure/config change only
  • Unsure - please advise

Checklist

  • PR title clearly describes the change
  • No sensitive information exposed and security is maintained
  • Ready for review

@mohamedzeidan2021 mohamedzeidan2021 merged commit d30a69b into aws:main Sep 2, 2025
5 of 6 checks passed
papriwal pushed a commit to papriwal/sagemaker-hyperpod-cli that referenced this pull request Sep 4, 2025
Co-authored-by: Mohamed Zeidan <[email protected]>
mohamedzeidan2021 added a commit to mohamedzeidan2021/sagemaker-hyperpod-cli that referenced this pull request Sep 10, 2025
Co-authored-by: Mohamed Zeidan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants