-
Notifications
You must be signed in to change notification settings - Fork 689
feat: extract deploymentType as interface #2405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces a MultinodeDeployer interface and concrete Grove/LWS implementations. Refactors SGLang, TRTLLM, and vLLM backends to use the deployer instead of deployment-type constants. Updates Backend interface and graph orchestration to construct a deployer via a factory. Adjusts tests accordingly. Removes an obsolete Grove hostname helper. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as GenerateBasePodSpec
participant Factory as MultinodeDeployerFactory
participant Backend as Backend (SGLang/TRTLLM/vLLM)
Caller->>Factory: Create deployer from multinodeDeploymentType
Factory-->>Caller: MultinodeDeployer (Grove/LWS)
Caller->>Backend: UpdateContainer(..., serviceName, deployer)
Backend->>deployer: GetLeaderHostname()/GetHostNames()/GetNodeRank()
deployer-->>Backend: Values for flags/hosts
Caller->>Backend: UpdatePodSpec(..., serviceName)
sequenceDiagram
participant SGLang as SGLangBackend.UpdateContainer
participant Deployer as MultinodeDeployer
SGLang->>Deployer: GetLeaderHostname(serviceName)
Deployer-->>SGLang: leaderHost
SGLang->>Deployer: GetNodeRank()
Deployer-->>SGLang: nodeRank
SGLang-->>SGLang: Build multinode flags (dist addr, rank)
SGLang-->>SGLang: Remove probes if worker
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (8)
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
2545-2568: Prefer constant over magic port 8080 in probesUse the shared service port constant to avoid drift.
- ReadinessProbe: &corev1.Probe{ + ReadinessProbe: &corev1.Probe{ ProbeHandler: corev1.ProbeHandler{ HTTPGet: &corev1.HTTPGetAction{ Path: "/ready", - Port: intstr.FromInt(8080), + Port: intstr.FromInt(commonconsts.DynamoServicePort), }, }, }, - LivenessProbe: &corev1.Probe{ + LivenessProbe: &corev1.Probe{ ProbeHandler: corev1.ProbeHandler{ HTTPGet: &corev1.HTTPGetAction{ Path: "/health", - Port: intstr.FromInt(8080), + Port: intstr.FromInt(commonconsts.DynamoServicePort), }, }, }, - StartupProbe: &corev1.Probe{ + StartupProbe: &corev1.Probe{ ProbeHandler: corev1.ProbeHandler{ HTTPGet: &corev1.HTTPGetAction{ Path: "/startup", - Port: intstr.FromInt(8080), + Port: intstr.FromInt(commonconsts.DynamoServicePort), }, }, },deploy/cloud/operator/internal/dynamo/lws.go (1)
5-7: Remove unnecessary interface embeddingEmbedding an interface in a struct is a no-op. Keep the struct empty and implement the methods.
-type LWSMultinodeDeployer struct { - MultinodeDeployer -} +type LWSMultinodeDeployer struct{}deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)
13-27: Update doc comment to match behavior (probes removed only for workers)Comment says “worker and leader” but code removes only for RoleWorker.
- // Remove probes for multinode worker and leader + // Remove probes for multinode worker only (keep leader probes) if role == RoleWorker {deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (4)
40-48: Make probe-removal assertion meaningful for leader (and align expectation)Tests currently pass with nil probes by default. Provide non-nil probes and set expectation to match code (leader probes are not removed).
- name: "multinode leader prepends ray start --head", + name: "multinode leader prepends ray start --head", numberOfNodes: 3, role: RoleLeader, component: &v1alpha1.DynamoComponentDeploymentOverridesSpec{}, multinodeDeployer: &GroveMultinodeDeployer{}, initialArgs: []string{"python3", "-m", "dynamo.vllm", "--model", "test"}, expectContains: []string{"ray start --head --port=6379 &&", "python3", "-m", "dynamo.vllm", "--model", "test"}, - expectProbesRemoved: true, + initialLivenessProbe: &corev1.Probe{}, + initialReadinessProbe: &corev1.Probe{}, + initialStartupProbe: &corev1.Probe{}, + expectProbesRemoved: false,
50-58: Verify worker probe removal by starting with non-nil probesEnsure the test would fail if UpdateContainer didn’t strip probes for workers.
name: "multinode worker replaces args with ray start --block", numberOfNodes: 3, role: RoleWorker, component: &v1alpha1.DynamoComponentDeploymentOverridesSpec{}, multinodeDeployer: &GroveMultinodeDeployer{}, initialArgs: []string{"python3", "-m", "dynamo.vllm", "--model", "test"}, + initialLivenessProbe: &corev1.Probe{}, + initialReadinessProbe: &corev1.Probe{}, + initialStartupProbe: &corev1.Probe{}, expectedArgs: []string{"ray start --address=${GROVE_PCSG_NAME}-${GROVE_PCSG_INDEX}-test-service-ldr-0.${GROVE_HEADLESS_SERVICE}:6379 --block"}, expectProbesRemoved: true,
60-68: Do the same for LWS worker pathMirror the probe checks for LWS as well.
name: "multinode worker with LWS deployment type", numberOfNodes: 2, role: RoleWorker, component: &v1alpha1.DynamoComponentDeploymentOverridesSpec{}, multinodeDeployer: &LWSMultinodeDeployer{}, initialArgs: []string{"python3", "-m", "dynamo.vllm"}, + initialLivenessProbe: &corev1.Probe{}, + initialReadinessProbe: &corev1.Probe{}, + initialStartupProbe: &corev1.Probe{}, expectedArgs: []string{"ray start --address=${LWS_LEADER_ADDRESS}:6379 --block"}, expectProbesRemoved: true,
140-152: Strengthen unit test for leader path (non-nil probes, no removal)Same rationale for the lower-level helper test.
name: "leader prepends ray start --head", role: RoleLeader, multinodeDeployer: &GroveMultinodeDeployer{}, initialArgs: []string{"python3", "-m", "dynamo.vllm"}, expectContains: []string{"ray start --head --port=6379 &&", "python3", "-m", "dynamo.vllm"},Note: Also consider adding a companion assertion that probes remain intact for the leader in UpdateContainer tests (as suggested above).
deploy/cloud/operator/internal/dynamo/graph.go (1)
664-673: Consider explicit error handling in factoryThe factory returns nil for unsupported deployment types. While the caller (GenerateBasePodSpec) checks for nil, consider returning an error tuple for more explicit error handling:
func MultinodeDeployerFactory(multinodeDeploymentType commonconsts.MultinodeDeploymentType) (MultinodeDeployer, error) { switch multinodeDeploymentType { case commonconsts.MultinodeDeploymentTypeGrove: return &GroveMultinodeDeployer{}, nil case commonconsts.MultinodeDeploymentTypeLWS: return &LWSMultinodeDeployer{}, nil default: return nil, fmt.Errorf("unsupported multinode deployment type: %s", multinodeDeploymentType) } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
deploy/cloud/operator/internal/dynamo/backend_common.go(0 hunks)deploy/cloud/operator/internal/dynamo/backend_sglang.go(2 hunks)deploy/cloud/operator/internal/dynamo/backend_sglang_test.go(4 hunks)deploy/cloud/operator/internal/dynamo/backend_trtllm.go(4 hunks)deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go(15 hunks)deploy/cloud/operator/internal/dynamo/backend_vllm.go(2 hunks)deploy/cloud/operator/internal/dynamo/backend_vllm_test.go(4 hunks)deploy/cloud/operator/internal/dynamo/graph.go(3 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(1 hunks)deploy/cloud/operator/internal/dynamo/grove.go(1 hunks)deploy/cloud/operator/internal/dynamo/lws.go(1 hunks)
💤 Files with no reviewable changes (1)
- deploy/cloud/operator/internal/dynamo/backend_common.go
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🧬 Code Graph Analysis (8)
deploy/cloud/operator/internal/dynamo/grove.go (2)
deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
MultinodeDeployer(12-16)deploy/cloud/operator/internal/consts/consts.go (2)
GroveRoleSuffixLeader(55-55)GroveRoleSuffixWorker(56-56)
deploy/cloud/operator/internal/dynamo/lws.go (1)
deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
MultinodeDeployer(12-16)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (3)
deploy/cloud/operator/internal/dynamo/graph.go (2)
Role(592-592)RoleWorker(596-596)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
DynamoComponentDeploymentOverridesSpec(52-54)deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
MultinodeDeployer(12-16)
deploy/cloud/operator/internal/dynamo/graph.go (6)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
DynamoComponentDeploymentOverridesSpec(52-54)deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
MultinodeDeployer(12-16)deploy/cloud/operator/internal/consts/consts.go (3)
MultinodeDeploymentType(61-61)MultinodeDeploymentTypeGrove(64-64)MultinodeDeploymentTypeLWS(65-65)deploy/cloud/operator/internal/dynamo/grove.go (1)
GroveMultinodeDeployer(9-11)deploy/cloud/operator/internal/dynamo/lws.go (1)
LWSMultinodeDeployer(5-7)deploy/cloud/operator/api/dynamo/common/common.go (1)
ExtraPodSpec(57-60)
deploy/cloud/operator/internal/dynamo/backend_trtllm.go (3)
deploy/cloud/operator/internal/dynamo/graph.go (2)
Role(592-592)RoleWorker(596-596)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
DynamoComponentDeploymentOverridesSpec(52-54)deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
MultinodeDeployer(12-16)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (5)
deploy/cloud/operator/internal/dynamo/graph.go (4)
Role(592-592)RoleMain(597-597)RoleLeader(595-595)RoleWorker(596-596)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
DynamoComponentDeploymentOverridesSpec(52-54)deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
MultinodeDeployer(12-16)deploy/cloud/operator/internal/dynamo/grove.go (1)
GroveMultinodeDeployer(9-11)deploy/cloud/operator/internal/dynamo/lws.go (1)
LWSMultinodeDeployer(5-7)
deploy/cloud/operator/internal/dynamo/backend_sglang.go (2)
deploy/cloud/operator/internal/dynamo/graph.go (2)
Role(592-592)RoleWorker(596-596)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
DynamoComponentDeploymentOverridesSpec(52-54)
deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (5)
deploy/cloud/operator/internal/dynamo/graph.go (4)
Role(592-592)RoleMain(597-597)RoleLeader(595-595)RoleWorker(596-596)deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
MultinodeDeployer(12-16)deploy/cloud/operator/internal/dynamo/grove.go (1)
GroveMultinodeDeployer(9-11)deploy/cloud/operator/internal/dynamo/lws.go (1)
LWSMultinodeDeployer(5-7)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
DynamoComponentDeploymentOverridesSpec(52-54)
🔇 Additional comments (18)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)
34-46: No action needed: shell command is set upstream
TheupdateVLLMMultinodeArgshelper only rewritescontainer.Args—it does not touchcontainer.Command. In all vLLM deployments (both Grove PodGangSet viaGeneratePodSpecForComponentand the CR YAMLs incomponents/backends/vllm/deploy), thecommand: ["/bin/sh","-c"]entry is always provided. This ensures the injected"ray start … && …"string is executed by a shell.Likely an incorrect or invalid review comment.
deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (4)
15-22: LGTM! Clean transition to interface-based patternThe test structure properly reflects the architectural change from constants to the MultinodeDeployer interface, maintaining good test organization.
23-95: Excellent test coverage for multinode deployment scenariosThe test cases comprehensively cover various deployment scenarios including:
- Single vs multinode configurations
- Grove vs LWS deployment types
- Different role types (Leader, Worker, Main)
- Complex command structures with pipes and multiple arguments
The expected outputs correctly reflect the environment variables specific to each deployer type.
103-103: Correct UpdateContainer signature usageThe call properly passes the multinodeDeployer as the last parameter, aligning with the updated Backend interface.
122-201: Well-structured probe removal testsThe tests properly validate that:
- Single node deployments keep all probes
- Multinode workers have probes removed
- Leaders and main roles retain their probes
This aligns with the expected behavior for distributed deployments.
deploy/cloud/operator/internal/dynamo/backend_trtllm.go (4)
18-18: Correct interface adoption in UpdateContainerThe signature properly includes the MultinodeDeployer parameter, maintaining consistency with the Backend interface.
56-56: Clean delegation to MultinodeDeployerThe setupLeaderContainer function correctly passes the deployer to generateWorkerHostnames, enabling polymorphic hostname resolution.
Also applies to: 89-91
62-62: Appropriate simplification of UpdatePodSpecRemoving the unused multinodeDeploymentType parameter reduces complexity while maintaining functionality.
169-171: Excellent simplification through delegationThe generateWorkerHostnames function now cleanly delegates to the MultinodeDeployer interface, removing all deployment-specific branching logic. This is a good example of the Strategy pattern.
deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go (3)
14-153: Comprehensive test coverage with good structureThe TestTRTLLMBackend_UpdateContainer test cases properly cover:
- Single node scenarios
- Multinode leader/worker configurations
- Both Grove and LWS deployment types
- SSH volume mount verification
- Probe configuration validation
The use of helper validation functions (validateVolumeMounts, validateCommand, etc.) improves readability and maintainability.
157-283: Well-designed test helper functionsThe validation helper functions effectively reduce test complexity and improve maintainability by:
- Following single responsibility principle
- Providing clear error messages
- Enabling reuse across multiple test cases
389-477: Thorough validation of hostname generation delegationThe test properly validates that:
- Grove deployer generates hostnames with GROVE_* environment variables
- LWS deployer generates hostnames with LWS_* environment variables
- Correct number of hostnames are generated
- No empty hostnames are produced
deploy/cloud/operator/internal/dynamo/graph.go (3)
633-635: Clean interface evolutionThe Backend interface properly incorporates the MultinodeDeployer parameter, enabling polymorphic multinode deployment strategies.
795-803: Proper integration of MultinodeDeployer patternThe GenerateBasePodSpec function correctly:
- Creates the deployer via factory
- Validates the deployer is not nil
- Passes it to backend methods
- Simplifies UpdatePodSpec call by removing unused parameter
Good error handling for unsupported deployment types.
Also applies to: 811-811
640-645: Correct NoopBackend interface implementationThe NoopBackend properly implements the updated Backend interface while maintaining its intended no-op behavior for non-worker components.
deploy/cloud/operator/internal/dynamo/backend_sglang.go (3)
12-16: Well-designed MultinodeDeployer interfaceThe interface provides a clean abstraction for multinode deployment strategies with:
GetLeaderHostname: Returns leader node hostnameGetHostNames: Returns all node hostnamesGetNodeRank: Returns node rank expressionThis design effectively encapsulates deployment-specific logic.
20-45: Clean implementation using MultinodeDeployerThe UpdateContainer method properly:
- Handles single-node early return
- Removes probes for worker nodes in multinode setups
- Delegates flag generation to the deployer
- Injects flags into Python commands
Good separation of concerns.
52-60: Correct multinode flag generationThe method properly:
- Gets leader hostname from deployer with port 29500
- Retrieves node rank from deployer
- Overrides rank to "0" for leader role
This maintains the expected behavior while leveraging the deployer abstraction.
nvrohanv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, had some minor questions!
dc88430 to
e041038
Compare
e041038 to
d71ef0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nit: I think maybe instead of a top level numberOfNodes, it might be better to introduce a multinode block on the CR
Something like:
multinode:
nodeCount: 2
This will be useful in case we might have to expose more multinode fields in the future.
Signed-off-by: Hannah Zhang <[email protected]>
Overview:
extract deploymentType as interface
Summary by CodeRabbit