Skip to content

Conversation

ajcasagrande
Copy link

Summary

This proposal outlines the enhancement of AIPerf to support distributed deployment on Kubernetes clusters. The enhancement enables AIPerf to generate significantly higher concurrent loads by distributing work across multiple pods in a Kubernetes cluster, overcoming single-node performance limitations. The solution adopts a true per-service pod architecture where each AIPerf service runs in its own dedicated pod, enabling independent scaling and resource allocation.

AIPerf currently supports only single-node multiprocess deployment. This enhancement proposes implementing the existing KubernetesServiceManager stub to enable distributed deployment while maintaining full compatibility with existing service management patterns, ZMQ communication protocols, and configuration systems.

Comment on lines +164 to +165
# - 200 worker pods (100K / 500 connections per worker)
# - 50 record processor pods

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these going to be the default numbers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will scale based on the given concurrency. However its a little TBD. The final goal was to have something based on CPU usage, but for first implementation it will be formula based. It can however always be manually set like it is now using cli args or potentially env.

# 4. Runs benchmark for 5 minutes
# 5. Retrieves results to local ./artifacts/ directory
# 6. Cleans up all Kubernetes resources
# 7. Displays metrics summary in terminal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the information be stored locally in an output-dir or just printed in terminal?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be saved to file locally in artifacts dir as well. Metrics summaries such as for the csv and json are already published via ZMQ, and larger output files we plan to implement a file retrieval system.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it retrieve the result?

Comment on lines +175 to +176
# Use custom namespace (won't auto-delete)
aiperf profile --kubernetes --kubernetes-namespace my-benchmark ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if my-benchmark doesn't yet exist? Will this create it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, good question. I think in this case we should create it, and then auto-remove it? thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me, and with a log notifying the user that that's the case -- the namespace they specified doesn't yet exist, so you'll create it for this aiperf run but will clean it up after

--streaming \
-u http://my-llm-service.default.svc.cluster.local:8080 \
-m my-llm-model \
--concurrency 100000 \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this support multiple concurrencies? Will it support SLAs/isl/osl?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current plan is to just be a 1:1 implementation of the existing aiperf features, but run on kubernetes instead of multi-processing. Additional features may come later. Are you referring to sweeps, or any of the work you have been doing? Ideally we would like to find a way to integrate things once we have baseline support going.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I have in my sweeps is a concurrency loop, would be cool to have it inherently in AIPerf! But yeah that can be a later thing

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one question here is whether as we add those other features, will they "just work" in the distributed setting, or will each feature require to be designed for distributed too? The latter would be unfortunate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itay Yes everything should "just work" as even single-node aiperf uses this exact same architecture already but just with python multiprocessing instead of pods. The only things that would need custom work would be things that introduce something special like an output file that would need to be retrieved (but could still utilize the base k8s implementation for results gathering)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - that's helpful. I think that should be taken as a key design principle for AIPerf, so that we don't get into sticky situations.

@ajcasagrande ajcasagrande self-assigned this Oct 2, 2025
- Scale horizontally based on target concurrency requirements (1 to N replicas)
- Each pod can handle concurrent connections up to the configured `AIPERF_HTTP_CONNECTION_LIMIT`

#### Record Processor (Scalable Pods)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a sidecar to the worker pods, so that the scaling is the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record processors are designed as separate entities in the single-node instance to prevent resource starvation from the workers who are doing time sensitive work, and the spike in CPU utilization from the tokenization of the results will potentially cause jitter. Record processors are less time critical as they can take their time working their way through the zmq queue. Right now the recommendation is 4 workers to 1 record processor, but that could change based on testing.

The single-node AIPerf already handles this 4:1 approach and in theory it will just be able to work the same way on k8s. If needed we can also consider using Pod Affinity to co-locate them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can have dedicated capacity even as sidecars, as requests/limits are per container.

### Network Configuration
- System controller pod exposes ZMQ proxy ports via Kubernetes services
- All service pods connect to system controller services using Kubernetes DNS
- Each singleton service pod exposes its own service endpoint for direct communication
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. System Controller Service (aiperf-system-controller)

Exposes all ZMQ proxy ports:

  • 5661-5666 (dataset, event bus, raw inference proxies)
  1. Timing Manager Service (timing-manager)

Exposes:

  • 5562 (credit_drop - PUSH socket that it BINDS)
  • 5563 (credit_return - PULL socket that it BINDS)
  1. Records Manager Service (records-manager)

Exposes:

  • 5557 (records - PULL socket that it BINDS)

--streaming \
-u http://my-llm-service.default.svc.cluster.local:8080 \
-m my-llm-model \
--concurrency 100000 \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one question here is whether as we add those other features, will they "just work" in the distributed setting, or will each feature require to be designed for distributed too? The latter would be unfortunate.

```bash
# Run 100K concurrent connections against inference service
aiperf profile \
--kubernetes \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it know which Kubernetes to talk to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think I should move up the explanation of that. Its hidden in the advanced features. By default it will use your local kubeconfig file, or you can sepcify a custom one:

# Use custom kubeconfig (defaults to ~/.kube/config)
aiperf profile --kubernetes --kubeconfig ~/.kube/prod-cluster ...

# 4. Runs benchmark for 5 minutes
# 5. Retrieves results to local ./artifacts/ directory
# 6. Cleans up all Kubernetes resources
# 7. Displays metrics summary in terminal
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it retrieve the result?


## Artifact and Export File Retrieval

AIPerf generates output files including metrics exports (JSON, CSV) and logs that users need to access after benchmark completion. In the Kubernetes deployment, these files are generated by the Records Manager pod and must be retrieved to the user's local filesystem via the Kubernetes Python API.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here re: not using the Kubernetes API to move files around.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I think an HTTP API implementation would be ideal. I can add that in. Thanks!


**Reason Rejected:**

* Per-service pods provide maximum flexibility and isolation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flexibility is not free.

**Reason Rejected:**

* Per-service pods provide maximum flexibility and isolation
* Resource requirements for singleton services are sufficiently small that co-location benefits are minimal
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your service pods amount to 7 CPUs and 8GB of RAM. Even if you co-locate a lot of pods, you're looking at a minimum of a c7g.2xl just to barely fit.

* Limited real-time control and monitoring during benchmark execution
* Difficult to implement dynamic scaling based on runtime metrics
* Reduced flexibility for interactive benchmark sessions
* May not support complex coordination patterns required by AIPerf
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example?

└──────────────────────────────────────────────────────────────────┘
```

# Alternate Solutions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using an Operator and a custom resource to define a AIPerfJob?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itay this does sounds like a valid option. I will investigate this. Do you recommend any open source operators we can leverage for this, or does dynamo-runtime have/use one we can leverage? I found this one: https://github.com/nolar/kopf not sure if you have experience with it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the specific technical implementation of the Operator, it might be worthwhile sketching out the entire user flow/journey and see whether we think it's worthwhile to pursue.

Comment on lines +218 to +219
### Service Exposure
The system controller pod exposes ZMQ proxy endpoints via Kubernetes ClusterIP service:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, I need to add the network service pieces for the records manager and timing manager exposed ports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants