diff --git a/doc/cli/cluster_management/cli_cluster_management.md b/doc/cli/cluster_management/cli_cluster_management.md index e626d0a5..efe41433 100644 --- a/doc/cli/cluster_management/cli_cluster_management.md +++ b/doc/cli/cluster_management/cli_cluster_management.md @@ -10,9 +10,9 @@ Complete reference for SageMaker HyperPod cluster management parameters and conf * [Initialize Configuration](#hyp-init) * [Create Cluster Stack](#hyp-create) -* [Update Cluster](#hyp-update-hyp-cluster) -* [List Cluster Stacks](#hyp-list-hyp-cluster) -* [Describe Cluster Stack](#hyp-describe-hyp-cluster) +* [Update Cluster](#hyp-update-cluster) +* [List Cluster Stacks](#hyp-list-cluster-stack) +* [Describe Cluster Stack](#hyp-describe-cluster-stack) * [List HyperPod Clusters](#hyp-list-cluster) * [Set Cluster Context](#hyp-set-cluster-context) * [Get Cluster Context](#hyp-get-cluster-context) @@ -36,12 +36,14 @@ hyp init TEMPLATE [DIRECTORY] [OPTIONS] | Parameter | Type | Required | Description | |-----------|------|----------|-------------| -| `TEMPLATE` | CHOICE | Yes | Template type (hyp-cluster, hyp-pytorch-job, hyp-custom-endpoint, hyp-jumpstart-endpoint) | +| `TEMPLATE` | CHOICE | Yes | Template type (cluster-stack, hyp-pytorch-job, hyp-custom-endpoint, hyp-jumpstart-endpoint) | | `DIRECTORY` | PATH | No | Target directory (default: current directory) | | `--version` | TEXT | No | Schema version to use | ```{important} The `resource_name_prefix` parameter in the generated `config.yaml` file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness. + +**Cluster stack names must be unique within each AWS region.** If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail. ``` ## hyp create @@ -61,14 +63,18 @@ hyp create [OPTIONS] | `--region` | TEXT | No | AWS region where the cluster stack will be created | | `--debug` | FLAG | No | Enable debug logging | -## hyp update hyp-cluster +## hyp update cluster Update an existing HyperPod cluster configuration. +```{important} +**Runtime vs Configuration Commands**: This command modifies an **existing, deployed cluster's** runtime settings (instance groups, node recovery). This is different from `hyp configure`, which only modifies local configuration files before cluster creation. +``` + #### Syntax ```bash -hyp update hyp-cluster [OPTIONS] +hyp update cluster [OPTIONS] ``` #### Parameters @@ -82,14 +88,14 @@ hyp update hyp-cluster [OPTIONS] | `--node-recovery` | TEXT | No | Node recovery setting (Automatic or None) | | `--debug` | FLAG | No | Enable debug logging | -## hyp list hyp-cluster +## hyp list cluster-stack List all HyperPod cluster stacks (CloudFormation stacks). #### Syntax ```bash -hyp list hyp-cluster [OPTIONS] +hyp list cluster-stack [OPTIONS] ``` #### Parameters @@ -100,14 +106,18 @@ hyp list hyp-cluster [OPTIONS] | `--status` | TEXT | No | Filter by stack status. Format: "['CREATE_COMPLETE', 'UPDATE_COMPLETE']" | | `--debug` | FLAG | No | Enable debug logging | -## hyp describe hyp-cluster +## hyp describe cluster-stack Describe a specific HyperPod cluster stack. +```{note} +**Region-Specific Stack Names**: Cluster stack names are unique within each AWS region. When describing a stack, ensure you specify the correct region where the stack was created, or the command will fail to find the stack. +``` + #### Syntax ```bash -hyp describe hyp-cluster STACK-NAME [OPTIONS] +hyp describe cluster-stack STACK-NAME [OPTIONS] ``` #### Parameters @@ -195,6 +205,10 @@ hyp get-monitoring [OPTIONS] Configure cluster parameters interactively or via command line. +```{important} +**Pre-Deployment Configuration**: This command modifies local `config.yaml` files **before** cluster creation. For updating **existing, deployed clusters**, use `hyp update cluster` instead. +``` + #### Syntax ```bash @@ -208,13 +222,23 @@ This command dynamically supports all configuration parameters available in the | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `--resource-name-prefix` | TEXT | No | Prefix for all AWS resources | -| `--stage` | TEXT | No | Deployment stage ("gamma" or "prod") | -| `--vpc-cidr` | TEXT | No | VPC CIDR block | -| `--kubernetes-version` | TEXT | No | Kubernetes version for EKS cluster | +| `--create-hyperpod-cluster-stack` | BOOLEAN | No | Create HyperPod Cluster Stack | +| `--hyperpod-cluster-name` | TEXT | No | Name of SageMaker HyperPod Cluster | +| `--create-eks-cluster-stack` | BOOLEAN | No | Create EKS Cluster Stack | +| `--kubernetes-version` | TEXT | No | Kubernetes version | +| `--eks-cluster-name` | TEXT | No | Name of the EKS cluster | +| `--create-helm-chart-stack` | BOOLEAN | No | Create Helm Chart Stack | +| `--namespace` | TEXT | No | Namespace to deploy HyperPod Helm chart | +| `--node-provisioning-mode` | TEXT | No | Continuous provisioning mode | | `--node-recovery` | TEXT | No | Node recovery setting ("Automatic" or "None") | -| `--env` | JSON | No | Environment variables as JSON object | -| `--args` | JSON | No | Command arguments as JSON array | -| `--command` | JSON | No | Command to run as JSON array | +| `--create-vpc-stack` | BOOLEAN | No | Create VPC Stack | +| `--vpc-id` | TEXT | No | Existing VPC ID | +| `--vpc-cidr` | TEXT | No | VPC CIDR block | +| `--create-security-group-stack` | BOOLEAN | No | Create Security Group Stack | +| `--enable-hp-inference-feature` | BOOLEAN | No | Enable inference operator | +| `--stage` | TEXT | No | Deployment stage ("gamma" or "prod") | +| `--create-fsx-stack` | BOOLEAN | No | Create FSx Stack | +| `--storage-capacity` | INTEGER | No | FSx storage capacity in GiB | | `--tags` | JSON | No | Resource tags as JSON object | **Note:** The exact parameters available depend on your current template type and version. Run `hyp configure --help` to see all available options for your specific configuration. @@ -302,18 +326,56 @@ The `config.yaml` file supports the following parameters: | Parameter | Type | Description | Default | |-----------|------|-------------|---------| -| `template` | TEXT | Template name | "hyp-cluster" | -| `namespace` | TEXT | Kubernetes namespace | "kube-system" | -| `stage` | TEXT | Deployment stage | "gamma" | -| `resource_name_prefix` | TEXT | Resource name prefix | "sagemaker-hyperpod-eks" | -| `vpc_cidr` | TEXT | VPC CIDR block | "10.192.0.0/16" | +| `resource_name_prefix` | TEXT | Prefix for all AWS resources (4-digit UUID added during submission) | "hyp-eks-stack" | +| `create_hyperpod_cluster_stack` | BOOLEAN | Create HyperPod Cluster Stack | true | +| `hyperpod_cluster_name` | TEXT | Name of SageMaker HyperPod Cluster | "hyperpod-cluster" | +| `create_eks_cluster_stack` | BOOLEAN | Create EKS Cluster Stack | true | | `kubernetes_version` | TEXT | Kubernetes version | "1.31" | -| `node_recovery` | TEXT | Node recovery setting | "Automatic" | -| `create_vpc_stack` | BOOLEAN | Create new VPC | true | -| `create_eks_cluster_stack` | BOOLEAN | Create new EKS cluster | true | -| `create_hyperpod_cluster_stack` | BOOLEAN | Create HyperPod cluster | true | - -**Note:** The actual available configuration parameters depend on the specific template schema version. Use `hyp init hyp-cluster` to see all available parameters for your version. +| `eks_cluster_name` | TEXT | Name of the EKS cluster | "eks-cluster" | +| `create_helm_chart_stack` | BOOLEAN | Create Helm Chart Stack | true | +| `namespace` | TEXT | Namespace to deploy HyperPod Helm chart | "kube-system" | +| `helm_repo_url` | TEXT | URL of Helm repo containing HyperPod Helm chart | "https://github.com/aws/sagemaker-hyperpod-cli.git" | +| `helm_repo_path` | TEXT | Path to HyperPod Helm chart in repo | "helm_chart/HyperPodHelmChart" | +| `helm_operators` | TEXT | Configuration of HyperPod Helm chart | "mlflow.enabled=true,trainingOperators.enabled=true,..." | +| `helm_release` | TEXT | Name for Helm chart release | "dependencies" | +| `node_provisioning_mode` | TEXT | Continuous provisioning mode ("Continuous" or empty) | "Continuous" | +| `node_recovery` | TEXT | Automatic node recovery ("Automatic" or "None") | "Automatic" | +| `instance_group_settings` | ARRAY | List of instance group configurations | [Default controller group] | +| `rig_settings` | ARRAY | Restricted instance group configurations | null | +| `rig_s3_bucket_name` | TEXT | S3 bucket for RIG resources | null | +| `tags` | ARRAY | Custom tags for SageMaker HyperPod cluster | null | +| `create_vpc_stack` | BOOLEAN | Create VPC Stack | true | +| `vpc_id` | TEXT | Existing VPC ID (if not creating new) | null | +| `vpc_cidr` | TEXT | IP range for VPC | "10.192.0.0/16" | +| `availability_zone_ids` | ARRAY | List of AZs to deploy subnets | null | +| `create_security_group_stack` | BOOLEAN | Create Security Group Stack | true | +| `security_group_id` | TEXT | Existing security group ID | null | +| `security_group_ids` | ARRAY | Security groups for HyperPod cluster | null | +| `private_subnet_ids` | ARRAY | Private subnet IDs for HyperPod cluster | null | +| `eks_private_subnet_ids` | ARRAY | Private subnet IDs for EKS cluster | null | +| `nat_gateway_ids` | ARRAY | NAT Gateway IDs for internet routing | null | +| `private_route_table_ids` | ARRAY | Private route table IDs | null | +| `create_s3_endpoint_stack` | BOOLEAN | Create S3 Endpoint stack | true | +| `enable_hp_inference_feature` | BOOLEAN | Enable inference operator | false | +| `stage` | TEXT | Deployment stage ("gamma" or "prod") | "prod" | +| `custom_bucket_name` | TEXT | S3 bucket name for templates | "sagemaker-hyperpod-cluster-stack-bucket" | +| `create_life_cycle_script_stack` | BOOLEAN | Create Life Cycle Script Stack | true | +| `create_s3_bucket_stack` | BOOLEAN | Create S3 Bucket Stack | true | +| `s3_bucket_name` | TEXT | S3 bucket for cluster lifecycle scripts | "s3-bucket" | +| `github_raw_url` | TEXT | Raw GitHub URL for lifecycle script | "https://raw.githubusercontent.com/aws-samples/..." | +| `on_create_path` | TEXT | File name of lifecycle script | "sagemaker-hyperpod-eks-bucket" | +| `create_sagemaker_iam_role_stack` | BOOLEAN | Create SageMaker IAM Role Stack | true | +| `sagemaker_iam_role_name` | TEXT | IAM role name for SageMaker cluster creation | "create-cluster-role" | +| `create_fsx_stack` | BOOLEAN | Create FSx Stack | true | +| `fsx_subnet_id` | TEXT | Subnet ID for FSx creation | "" | +| `fsx_availability_zone_id` | TEXT | Availability zone for FSx subnet | "" | +| `per_unit_storage_throughput` | INTEGER | Per unit storage throughput | 250 | +| `data_compression_type` | TEXT | Data compression type ("NONE" or "LZ4") | "NONE" | +| `file_system_type_version` | FLOAT | File system type version | 2.15 | +| `storage_capacity` | INTEGER | Storage capacity in GiB | 1200 | +| `fsx_file_system_id` | TEXT | Existing FSx file system ID | "" | + +**Note:** The actual available configuration parameters depend on the specific template schema version. Use `hyp init cluster-stack` to see all available parameters for your version. ## Examples @@ -325,7 +387,7 @@ mkdir my-hyperpod-cluster cd my-hyperpod-cluster # Initialize cluster configuration -hyp init hyp-cluster +hyp init cluster-stack # Configure basic parameters hyp configure --resource-name-prefix my-cluster --stage prod @@ -341,7 +403,7 @@ hyp create --region us-west-2 ```bash # Update instance groups -hyp update hyp-cluster \ +hyp update cluster \ --cluster-name my-cluster \ --instance-groups '[{"InstanceCount":2,"InstanceGroupName":"worker-nodes","InstanceType":"ml.m5.large"}]' \ --region us-west-2 @@ -351,10 +413,10 @@ hyp update hyp-cluster \ ```bash # List all cluster stacks -hyp list hyp-cluster --region us-west-2 +hyp list cluster-stack --region us-west-2 # Describe specific cluster stack -hyp describe hyp-cluster my-stack-name --region us-west-2 +hyp describe cluster-stack my-stack-name --region us-west-2 # List HyperPod clusters with capacity info hyp list-cluster --region us-west-2 --output table diff --git a/doc/cli/cluster_management/cli_cluster_management_autogen.rst b/doc/cli/cluster_management/cli_cluster_management_autogen.rst index 63d3aa27..c6dee4e0 100644 --- a/doc/cli/cluster_management/cli_cluster_management_autogen.rst +++ b/doc/cli/cluster_management/cli_cluster_management_autogen.rst @@ -4,13 +4,13 @@ .. ======================================== .. .. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:create_cluster_stack -.. .. :prog: hyp create hyp-cluster +.. .. :prog: hyp create cluster-stack .. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:describe_cluster_stack -.. :prog: hyp describe hyp-cluster +.. :prog: hyp describe cluster-stack .. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:list_cluster_stacks -.. :prog: hyp list hyp-cluster +.. :prog: hyp list cluster-stack .. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:update_cluster -.. :prog: hyp update hyp-cluster \ No newline at end of file +.. :prog: hyp update cluster \ No newline at end of file diff --git a/doc/examples.md b/doc/examples.md index afda4a66..ff5252b0 100644 --- a/doc/examples.md +++ b/doc/examples.md @@ -2,6 +2,29 @@ # Example Notebooks +## Cluster Management Example Notebooks + +For detailed examples of cluster management with HyperPod, see: + +::::{grid} 1 2 2 2 +:gutter: 3 + +:::{grid-item-card} CLI Cluster Management Example +:link: https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/cluster_management/cluster_creation_init_experience.ipynb +:class-card: sd-border-primary + +**Cluster Management Examples** Refer the Cluster Management CLI Example. +::: + +:::{grid-item-card} SDK Cluster Management Example +:link: https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/cluster_management/cluster_creation_sdk_experience.ipynb +:class-card: sd-border-primary + +**Cluster Management Examples** Refer the Cluster Management SDK Example. +::: + +:::: + ## Training Example Notebooks For detailed examples of training with HyperPod, see: @@ -47,4 +70,4 @@ For detailed examples of inference with HyperPod, see: ::: -:::: +:::: \ No newline at end of file diff --git a/doc/getting_started/cluster_management.rst b/doc/getting_started/cluster_management.rst index ad4f3dea..cf873689 100644 --- a/doc/getting_started/cluster_management.rst +++ b/doc/getting_started/cluster_management.rst @@ -15,6 +15,8 @@ Before you begin, ensure you have: .. note:: **Region Configuration**: For commands that accept the ``--region`` option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration. + **Cluster stack names must be unique within each AWS region.** If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail. + Creating Your First Cluster ---------------------------- @@ -37,7 +39,7 @@ It's recommended to start with a new and clean directory for each cluster config .. code-block:: bash - hyp init hyp-cluster + hyp init cluster-stack This creates three files: @@ -59,12 +61,12 @@ The config.yaml file contains key parameters like: .. code-block:: yaml - template: hyp-cluster + template: cluster-stack namespace: kube-system stage: gamma resource_name_prefix: sagemaker-hyperpod-eks -**Option 2: Use CLI/SDK commands** +**Option 2: Use CLI/SDK commands (Pre-Deployment)** .. tab-set:: @@ -72,11 +74,17 @@ The config.yaml file contains key parameters like: .. code-block:: bash - hyp configure --resource-name-prefix your-resource-prefix + hyp configure --resource-name-prefix your-resource-prefix + +.. note:: + The ``hyp configure`` command only modifies local configuration files. It does not affect existing deployed clusters. 4. Create the Cluster ~~~~~~~~~~~~~~~~~~~~~ +.. warning:: + **Cluster Stack Name Uniqueness**: Cluster stack names must be unique within each AWS region. Ensure your ``resource_name_prefix`` in ``config.yaml`` generates a unique stack name for the target region to avoid deployment conflicts. + .. tab-set:: .. tab-item:: CLI @@ -102,7 +110,7 @@ Check the status of your cluster: .. code-block:: bash - hyp describe hyp-cluster your-cluster-name --region your-region + hyp describe cluster-stack your-cluster-name --region your-region .. tab-item:: SDK @@ -114,6 +122,9 @@ Check the status of your cluster: response = HpClusterStack.describe("your-cluster-name", region="your-region") print(f"Stack Status: {response['Stacks'][0]['StackStatus']}") print(f"Stack Name: {response['Stacks'][0]['StackName']}") + +.. note:: + **Region-Specific Stack Names**: Cluster stack names are unique within each AWS region. When describing a stack, ensure you specify the correct region where the stack was created, or the command will fail to find the stack. List all clusters: @@ -124,7 +135,7 @@ List all clusters: .. code-block:: bash - hyp list hyp-cluster --region your-region + hyp list cluster-stack --region your-region .. tab-item:: SDK @@ -144,13 +155,21 @@ Common Operations Update a Cluster ~~~~~~~~~~~~~~~~~ +.. important:: + **Runtime vs Configuration Commands**: + + - ``hyp update cluster`` modifies **existing, deployed clusters** (runtime settings like instance groups, node recovery) + - ``hyp configure`` modifies local ``config.yaml`` files **before** cluster creation + + Use the appropriate command based on whether your cluster is already deployed or not. + .. tab-set:: .. tab-item:: CLI .. code-block:: bash - hyp update hyp-cluster \ + hyp update cluster \ --cluster-name your-cluster-name \ --instance-groups "[]" \ --region your-region diff --git a/doc/sdk/cluster_management/hp_cluster_stack.rst b/doc/sdk/cluster_management/hp_cluster_stack.rst index f89de192..354c38d1 100644 --- a/doc/sdk/cluster_management/hp_cluster_stack.rst +++ b/doc/sdk/cluster_management/hp_cluster_stack.rst @@ -2,6 +2,75 @@ Cluster Management ================================ .. automodule:: sagemaker.hyperpod.cluster_management.hp_cluster_stack - :exclude-members: model_config + :exclude-members: model_config, __init__ :no-undoc-members: - :no-show-inheritance: \ No newline at end of file + :no-show-inheritance: + + + +SageMaker Core Cluster Update Method +==================================== + +The cluster management also supports updating cluster properties using the SageMaker Core Cluster update method from ``sagemaker_core.main.resources``: + +.. py:method:: Cluster.update(instance_groups=None, restricted_instance_groups=None, node_recovery=None, instance_groups_to_delete=None) + + Update a SageMaker Core Cluster resource. + + **Parameters:** + + .. list-table:: + :header-rows: 1 + :widths: 25 20 55 + + * - Parameter + - Type + - Description + * - instance_groups + - List[ClusterInstanceGroupSpecification] + - List of instance group specifications to update + * - restricted_instance_groups + - List[ClusterRestrictedInstanceGroupSpecification] + - List of restricted instance group specifications + * - node_recovery + - str + - Node recovery setting ("Automatic" or "None") + * - instance_groups_to_delete + - List[str] + - List of instance group names to delete + + **Returns:** + + The updated Cluster resource + + **Raises:** + + - ``botocore.exceptions.ClientError``: AWS service related errors + - ``ConflictException``: Conflict when modifying SageMaker entity + - ``ResourceLimitExceeded``: SageMaker resource limit exceeded + - ``ResourceNotFound``: Resource being accessed is not found + + + .. dropdown:: Usage Examples + :open: + + .. code-block:: python + + from sagemaker_core.main.resources import Cluster + from sagemaker_core.main.shapes import ClusterInstanceGroupSpecification + + # Get existing cluster + cluster = Cluster.get(cluster_name="my-cluster") + + # Update cluster with new instance groups and node recovery + cluster.update( + instance_groups=[ + ClusterInstanceGroupSpecification( + InstanceCount=2, + InstanceGroupName="worker-nodes", + InstanceType="ml.m5.large" + ) + ], + node_recovery="Automatic", + instance_groups_to_delete=["old-group-name"] + ) \ No newline at end of file diff --git a/setup.py b/setup.py index 4292d5a0..af4cc6c0 100644 --- a/setup.py +++ b/setup.py @@ -89,9 +89,8 @@ "pydantic>=2.10.6,<3.0.0", "hyperpod-pytorch-job-template>=1.0.0, <2.0.0", "hyperpod-custom-inference-template>=1.0.0, <2.0.0", - "hyperpod-jumpstart-inference-template>=1.0.0, <2.0.0", - # To be enabled after launch - #"hyperpod-cluster-stack-template>=1.0.0, <2.0.0" + "hyperpod-jumpstart-inference-template>=1.0.0, <2.0.0", + "hyperpod-cluster-stack-template>=1.0.0, <2.0.0" ], entry_points={ "console_scripts": [