Skip to content

Conversation

@KeitaW
Copy link
Contributor

@KeitaW KeitaW commented Sep 29, 2025

Updates the AWS EFA Kubernetes device plugin to version 0.5.10 or later to ensure proper device identification on p6-b200.48xlarge instances and prevent NCCL failures during distributed training.

Problem

On p6-b200.48xlarge instances, the InfiniBand subsystem exposes 10 devices:

  • 2 ibp* devices (e.g., ibp115s0f0, ibp116s0f0)
  • 8 rdmap* devices (e.g., rdmap79s0, rdmap80s0, etc.)

Critical Issue: The ibp* devices are NVLink controllers for GPU interconnect, not EFA devices. However, EFA device plugin versions prior to 0.5.10 incorrectly identify these NVLink controllers as EFA devices.

Impact of Current Version

When using EFA device plugin < 0.5.10:

  1. Kubernetes incorrectly reports 10 available EFA devices instead of 8
  2. Pods may be scheduled with NVLink controllers allocated as EFA devices
  3. NCCL gets confused when trying to use NVLink controllers for RDMA communication
  4. Result: Training jobs fail with NCCL errors and communication timeouts

According to AWS EFA team : "Everything will eventually break because eventually EKS will schedule them as EFA devices, but they aren't, and then NCCL will get confused."

Solution

Update to EFA device plugin version 0.5.10 or later, which correctly:

  • Identifies only the 8 rdmap* interfaces as EFA devices
  • Excludes the 2 ibp* NVLink controllers from the EFA resource pool
  • Reports the correct count of 8 EFA devices to Kubernetes

Testing

Current State - EFA Device Plugin v0.5.4

# Check current EFA device plugin version
$ kubectl get daemonset aws-efa-k8s-device-plugin -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}'
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.4

# Check EFA resources on p6-b200.48xlarge node
$ kubectl describe node ip-10-0-85-35.ec2.internal | grep -A2 -B2 "vpc.amazonaws.com/efa"
memory:                 2092981788Ki
  pods:                   250
  vpc.amazonaws.com/efa:  10    # INCORRECT - includes 2 NVLink controllers
Allocatable:
  cpu:                    191450m
--
  memory:                 2079885852Ki
  pods:                   250
  vpc.amazonaws.com/efa:  10    # INCORRECT - should be 8

Problem: The plugin v0.5.4 incorrectly identifies NVLink controllers (ibp115s0f0, ibp116s0f0) as EFA devices, reporting 10 instead of 8.

Update Process and Results - EFA Device Plugin v0.5.10 (VERIFIED)

# Step 1: Update the EFA device plugin to v0.5.10
$ kubectl set image daemonset/aws-efa-k8s-device-plugin -n kube-system \
    aws-efa-k8s-device-plugin=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.10
daemonset.apps/aws-efa-k8s-device-plugin image updated

# Step 2: Wait for rollout (note: plugin may crash on non-EFA nodes - this is expected)
$ kubectl rollout status daemonset/aws-efa-k8s-device-plugin -n kube-system

# Step 3: Restart the plugin pod on p6-b200 node to pick up changes
$ kubectl delete pod aws-efa-k8s-device-plugin-<pod-id> -n kube-system

# Step 4: Verify the updated version
$ kubectl get daemonset aws-efa-k8s-device-plugin -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}'
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.10

# Step 5: Check EFA resources on same node - NOW CORRECT!
$ kubectl describe node ip-10-0-85-35.ec2.internal | grep -A2 -B2 "vpc.amazonaws.com/efa"
  memory:                 2092981788Ki
  pods:                   250
  vpc.amazonaws.com/efa:  8    # CORRECT - Only 8 EFA devices
Allocatable:
  cpu:                    191450m
--
  memory:                 2079885852Ki
  pods:                   250
  vpc.amazonaws.com/efa:  8    # CORRECT - Excludes NVLink controllers

Device Breakdown on p6-b200.48xlarge

# List all InfiniBand devices visible to the system
$ kubectl exec -it <pod-on-p6-b200> -- ls /sys/class/infiniband/
ibp115s0f0    # ❌ NVLink controller (NOT an EFA device)
ibp116s0f0    # ❌ NVLink controller (NOT an EFA device)
rdmap79s0     # ✅ EFA device
rdmap80s0     # ✅ EFA device
rdmap81s0     # ✅ EFA device
rdmap132s0    # ✅ EFA device
rdmap133s0    # ✅ EFA device
rdmap134s0    # ✅ EFA device
rdmap135s0    # ✅ EFA device
rdmap136s0    # ✅ EFA device

# Total: 10 InfiniBand devices (2 NVLink + 8 EFA)

Verification Steps

  1. Deploy test pod requesting all EFA resources
  2. Verify only rdmap* devices are allocated
  3. Confirm NCCL initializes successfully
  4. Run multi-node training job to validate communication

References

@KeitaW KeitaW requested a review from a team as a code owner September 29, 2025 22:49
@pintaoz-aws pintaoz-aws merged commit dc2096a into aws:main Sep 30, 2025
6 checks passed
@KeitaW KeitaW deleted the patch-2 branch September 30, 2025 22:08
jam-jee pushed a commit that referenced this pull request Nov 21, 2025
* feat: Implement elastic training cli arguments (#273)

* feat: Implement elastic training cli arguments

* Add elastic training unified config and unit test

* Add graceful shutdown and scaling timeout to cli args

* Revert "feat: Implement elastic training cli arguments (#273)"

This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259.

* Add dev_space_constants.py (#255)

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space_access_constants.py (#256)

Co-authored-by: Brian Xia <[email protected]>

* Add space_admin_config_constants.py (#257)

Co-authored-by: Brian Xia <[email protected]>

* Add template package only (#261)

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space.py CLI command (#263)

* Add dev_space.py CLI command

* Add dev space unit tests

---------

Co-authored-by: Brian Xia <[email protected]>

* Add dev_space_utils.py to work with the dev space template model (#262)

* Add dev_space_utils.py

* Add unit tests for dev_space_utils

---------

Co-authored-by: Brian Xia <[email protected]>

* Add dev space CLI (#269)

* Rename dev space to space (#272)

* Update the Space model and constants per latest operator (#275)

* Add space_admin_config.py CLI command (#260)

* Add space_admin_config.py CLI command

* Update the space admin config to space template

---------

Co-authored-by: Brian Xia <[email protected]>

* Implement CRUD operations for Space PySDK (#267)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Update Space PySDK per new schema

* Implement the pySDK for the Space Template (#282)

* Refactor Space CLI using the Space PySDK (#281)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Refactor CLI to use the PySDK

* Add dev_space_access.py CLI command (#259)

* Add dev_space_access.py CLI command

* Add space access creation to pySDK and refactor space access CLI

---------

Co-authored-by: Brian Xia <[email protected]>

* Listing space will filter out the spaces not created by the current user (#285)

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Implement CRUD operations for Space PySDK

* Update Space PySDK per new schema

* Update Space PySDK per new schema

* Implement space list pagination and creator filtering

* Refactor space template with PySDK (#286)

* Add additional Space parameters for resources including the fractional GPU (#287)

* Implement validation for mig profiles for Spaces (#291)

* Implement validation for mig profiles when creating/updating spaces

* Update Space parameter model

* Make Space Template namespaced resource

* Parker GA issues (#296)

* Update Space Template CLI to be namespaced

* Space get-logs default to the workspace container

* Remove error handling to bubble up the actual K8s errors

* Listing public Spaces

* Fix typos, elaborated text, add logic to parse idle-shutdown

* Fix the template ref regression (#300)

* Update SageMaker Space documentation (#301)

* Implement Space integration tests (#298)

Inference tests succeeded with parker-cli code - https://quip-amazon.com/fhwhAAMht0Mm/Project-Parker-HyperPod-User-Experience-for-Data-Scientist-persona

Parker-cli integ tests pass (shown below)

These inference tests failing are known to be flaky- https://w.amazon.com/bin/view/AWS/AmazonAI/Platform/Codex/CodexInfra/Runbooks/HyperPodCLI/TroubleshootInferenceTests#HTroubleshooting
ticket has been created to fix these flaky tests - https://t.corp.amazon.com/V1943878058


Parker-cli integ tests passing

============================= test session starts ==============================
platform linux -- Python 3.11.14, pytest-8.3.2, pluggy-1.6.0 -- /root/.pyenv/versions/3.11.14/bin/python3.11
cachedir: .pytest_cache
rootdir: /codebuild/output/src1458832038/src/github.com/aws/private-sagemaker-hyperpod-cli-staging
configfile: setup.cfg
plugins: hydra-core-1.3.2, order-1.3.0, dependency-0.6.0, cov-5.0.0
collecting ... collected 39 items
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_create PASSED [  2%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_table PASSED [  5%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_json PASSED [  7%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_yaml PASSED [ 10%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_json PASSED [ 12%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_stop PASSED [ 15%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_start PASSED [ 17%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_update PASSED [ 20%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_get_logs PASSED [ 23%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete PASSED [ 25%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_list_empty_namespace PASSED [ 28%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_describe_nonexistent PASSED [ 30%]
test/integration_tests/space/cli/test_cli_space.py::TestSpaceCLI::test_space_delete_nonexistent PASSED [ 33%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_create PASSED [ 35%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_table PASSED [ 38%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_json PASSED [ 41%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_yaml PASSED [ 43%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_json PASSED [ 46%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_update PASSED [ 48%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete PASSED [ 51%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_list_empty_namespace PASSED [ 53%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_describe_nonexistent PASSED [ 56%]
test/integration_tests/space/cli/test_cli_space_template.py::TestSpaceTemplateCLI::test_space_template_delete_nonexistent PASSED [ 58%]
test/integration_tests/space/sdk/test_sdk_space.py::test_create_space PASSED [ 61%]
test/integration_tests/space/sdk/test_sdk_space.py::test_list_spaces PASSED [ 64%]
test/integration_tests/space/sdk/test_sdk_space.py::test_get_space PASSED [ 66%]
test/integration_tests/space/sdk/test_sdk_space.py::test_wait_until_running PASSED [ 69%]
test/integration_tests/space/sdk/test_sdk_space.py::test_update_space PASSED [ 71%]
test/integration_tests/space/sdk/test_sdk_space.py::test_stop_space PASSED [ 74%]
test/integration_tests/space/sdk/test_sdk_space.py::test_start_space PASSED [ 76%]
test/integration_tests/space/sdk/test_sdk_space.py::test_list_pods PASSED [ 79%]
test/integration_tests/space/sdk/test_sdk_space.py::test_get_logs PASSED [ 82%]
test/integration_tests/space/sdk/test_sdk_space.py::test_create_space_access SKIPPED [ 84%]
test/integration_tests/space/sdk/test_sdk_space.py::test_delete_space PASSED [ 87%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_create_template PASSED [ 89%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_list_templates PASSED [ 92%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_get_template PASSED [ 94%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_update_template PASSED [ 97%]
test/integration_tests/space/sdk/test_sdk_space_template.py::TestHPSpaceTemplate::test_delete_template PASSED [100%]
=============================== warnings summary ===============================

* merge conflicts fixed

* Update README for fractional gpu support (#294)

* Update README for fractional gpu support

* update pytorch job example

* add example for accelerator partitions

* merge conflicts from js template and inference

* update changelog

* uncommented install req

* uncommented

* fixed uncomment

---------

Co-authored-by: Sophia <[email protected]>
Co-authored-by: Molly He <[email protected]>
Co-authored-by: Brian Xia <[email protected]>
Co-authored-by: Brian Xia <[email protected]>
Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Ophelia Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants