-
Notifications
You must be signed in to change notification settings - Fork 5
feat: add AIPerf Kubernetes Deployment Enhancement doc #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# - 200 worker pods (100K / 500 connections per worker) | ||
# - 50 record processor pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these going to be the default numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will scale based on the given concurrency. However its a little TBD. The final goal was to have something based on CPU usage, but for first implementation it will be formula based. It can however always be manually set like it is now using cli args or potentially env.
# 4. Runs benchmark for 5 minutes | ||
# 5. Retrieves results to local ./artifacts/ directory | ||
# 6. Cleans up all Kubernetes resources | ||
# 7. Displays metrics summary in terminal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the information be stored locally in an output-dir or just printed in terminal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be saved to file locally in artifacts dir as well. Metrics summaries such as for the csv and json are already published via ZMQ, and larger output files we plan to implement a file retrieval system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it retrieve the result?
# Use custom namespace (won't auto-delete) | ||
aiperf profile --kubernetes --kubernetes-namespace my-benchmark ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if my-benchmark
doesn't yet exist? Will this create it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, good question. I think in this case we should create it, and then auto-remove it? thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me, and with a log notifying the user that that's the case -- the namespace they specified doesn't yet exist, so you'll create it for this aiperf run but will clean it up after
--streaming \ | ||
-u http://my-llm-service.default.svc.cluster.local:8080 \ | ||
-m my-llm-model \ | ||
--concurrency 100000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this support multiple concurrencies? Will it support SLAs/isl/osl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current plan is to just be a 1:1 implementation of the existing aiperf features, but run on kubernetes instead of multi-processing. Additional features may come later. Are you referring to sweeps, or any of the work you have been doing? Ideally we would like to find a way to integrate things once we have baseline support going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I have in my sweeps is a concurrency loop, would be cool to have it inherently in AIPerf! But yeah that can be a later thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one question here is whether as we add those other features, will they "just work" in the distributed setting, or will each feature require to be designed for distributed too? The latter would be unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itay Yes everything should "just work" as even single-node aiperf uses this exact same architecture already but just with python multiprocessing instead of pods. The only things that would need custom work would be things that introduce something special like an output file that would need to be retrieved (but could still utilize the base k8s implementation for results gathering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - that's helpful. I think that should be taken as a key design principle for AIPerf, so that we don't get into sticky situations.
- Scale horizontally based on target concurrency requirements (1 to N replicas) | ||
- Each pod can handle concurrent connections up to the configured `AIPERF_HTTP_CONNECTION_LIMIT` | ||
|
||
#### Record Processor (Scalable Pods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a sidecar to the worker pods, so that the scaling is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Record processors are designed as separate entities in the single-node instance to prevent resource starvation from the workers who are doing time sensitive work, and the spike in CPU utilization from the tokenization of the results will potentially cause jitter. Record processors are less time critical as they can take their time working their way through the zmq queue. Right now the recommendation is 4 workers to 1 record processor, but that could change based on testing.
The single-node AIPerf already handles this 4:1 approach and in theory it will just be able to work the same way on k8s. If needed we can also consider using Pod Affinity to co-locate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can have dedicated capacity even as sidecars, as requests/limits are per container.
### Network Configuration | ||
- System controller pod exposes ZMQ proxy ports via Kubernetes services | ||
- All service pods connect to system controller services using Kubernetes DNS | ||
- Each singleton service pod exposes its own service endpoint for direct communication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- System Controller Service (aiperf-system-controller)
Exposes all ZMQ proxy ports:
- 5661-5666 (dataset, event bus, raw inference proxies)
- Timing Manager Service (timing-manager)
Exposes:
- 5562 (credit_drop - PUSH socket that it BINDS)
- 5563 (credit_return - PULL socket that it BINDS)
- Records Manager Service (records-manager)
Exposes:
- 5557 (records - PULL socket that it BINDS)
--streaming \ | ||
-u http://my-llm-service.default.svc.cluster.local:8080 \ | ||
-m my-llm-model \ | ||
--concurrency 100000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one question here is whether as we add those other features, will they "just work" in the distributed setting, or will each feature require to be designed for distributed too? The latter would be unfortunate.
```bash | ||
# Run 100K concurrent connections against inference service | ||
aiperf profile \ | ||
--kubernetes \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it know which Kubernetes to talk to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I think I should move up the explanation of that. Its hidden in the advanced features. By default it will use your local kubeconfig file, or you can sepcify a custom one:
# Use custom kubeconfig (defaults to ~/.kube/config)
aiperf profile --kubernetes --kubeconfig ~/.kube/prod-cluster ...
# 4. Runs benchmark for 5 minutes | ||
# 5. Retrieves results to local ./artifacts/ directory | ||
# 6. Cleans up all Kubernetes resources | ||
# 7. Displays metrics summary in terminal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it retrieve the result?
|
||
## Artifact and Export File Retrieval | ||
|
||
AIPerf generates output files including metrics exports (JSON, CSV) and logs that users need to access after benchmark completion. In the Kubernetes deployment, these files are generated by the Records Manager pod and must be retrieved to the user's local filesystem via the Kubernetes Python API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here re: not using the Kubernetes API to move files around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I think an HTTP API implementation would be ideal. I can add that in. Thanks!
|
||
**Reason Rejected:** | ||
|
||
* Per-service pods provide maximum flexibility and isolation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flexibility is not free.
**Reason Rejected:** | ||
|
||
* Per-service pods provide maximum flexibility and isolation | ||
* Resource requirements for singleton services are sufficiently small that co-location benefits are minimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your service pods amount to 7 CPUs and 8GB of RAM. Even if you co-locate a lot of pods, you're looking at a minimum of a c7g.2xl just to barely fit.
* Limited real-time control and monitoring during benchmark execution | ||
* Difficult to implement dynamic scaling based on runtime metrics | ||
* Reduced flexibility for interactive benchmark sessions | ||
* May not support complex coordination patterns required by AIPerf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example?
└──────────────────────────────────────────────────────────────────┘ | ||
``` | ||
|
||
# Alternate Solutions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using an Operator and a custom resource to define a AIPerfJob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itay this does sounds like a valid option. I will investigate this. Do you recommend any open source operators we can leverage for this, or does dynamo-runtime have/use one we can leverage? I found this one: https://github.com/nolar/kopf not sure if you have experience with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the specific technical implementation of the Operator, it might be worthwhile sketching out the entire user flow/journey and see whether we think it's worthwhile to pursue.
### Service Exposure | ||
The system controller pod exposes ZMQ proxy endpoints via Kubernetes ClusterIP service: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, I need to add the network service pieces for the records manager and timing manager exposed ports.
Summary
This proposal outlines the enhancement of AIPerf to support distributed deployment on Kubernetes clusters. The enhancement enables AIPerf to generate significantly higher concurrent loads by distributing work across multiple pods in a Kubernetes cluster, overcoming single-node performance limitations. The solution adopts a true per-service pod architecture where each AIPerf service runs in its own dedicated pod, enabling independent scaling and resource allocation.
AIPerf currently supports only single-node multiprocess deployment. This enhancement proposes implementing the existing
KubernetesServiceManager
stub to enable distributed deployment while maintaining full compatibility with existing service management patterns, ZMQ communication protocols, and configuration systems.