-
Notifications
You must be signed in to change notification settings - Fork 688
Adding README for benchmark using perf.sh #2444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Belal Yahouni <[email protected]>
Using old readme that went missing. Signed-off-by: Belal Yahouni <[email protected]>
Signed-off-by: Belal Yahouni <[email protected]>
|
👋 Hi belalyahouni! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughReplaces benchmarks/llm/README.md with a full LLM deployment benchmarking guide, adding prerequisites, setup, single- and multi-node disaggregated workflows, vLLM aggregated baseline steps, performance data collection via perf.sh, monitoring readiness, metrics/visualization, troubleshooting, and environment setup guidance. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant BenchmarkGuide
participant Infra as NATS/ETCD
participant Workers as Prefill/Decode Workers
participant Perf as perf.sh
participant Metrics as Grafana
User->>BenchmarkGuide: Follow setup (images, model, env)
User->>Infra: Start services (docker compose)
User->>Workers: Launch containers (single/multi-node)
BenchmarkGuide-->>User: Provide readiness checks
User->>Perf: Run measurements with config
Perf->>Workers: Send load (requests)
Workers-->>Perf: Return metrics/artifacts
Perf->>Metrics: Produce results for visualization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (9)
benchmarks/llm/README.md (9)
88-90: Proper noun and phrasing: “Hugging Face cache mount”Minor wording fix for clarity and correctness.
-> The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`. +> The Hugging Face cache mount can be changed by setting `--hf-cache ~/.cache/huggingface`.-> The Hugging Face home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`. +> The Hugging Face cache mount can be changed by setting `--hf-cache ~/.cache/huggingface`.Also applies to: 191-193
107-109: Tighten wording and fix pluralization in perf.sh notesImproves clarity and fixes “option” → “options”. Also capitalizes Dynamo consistently.
-> We should be careful in specifying these options in `perf.sh` script. They should closely reflect the deployment config that is being benchmarked. See `perf.sh --help` to learn more about these option. In the above command, we described that our deployment is using disaggregated serving in dynamo with vLLM backend. We have also accurately described that we have 4 prefill workers with TP=1 and 1 decode worker with TP=4 +> Be careful to specify options in the `perf.sh` script that closely reflect the deployment being benchmarked. See `perf.sh --help` to learn more about these options. In the above command, we described a disaggregated deployment in Dynamo with a vLLM backend and accurately indicated 4 prefill workers (TP=1) and 1 decode worker (TP=4).-> We should be careful in specifying these options in `perf.sh` script. They should closely reflect the deployment config that is being benchmarked. See `perf.sh --help` to learn more about these option. In the above command, we described that our deployment is using disaggregated serving in dynamo with vLLM backend. We have also accurately described that we have 8 prefill workers with TP=1 and 1 decode worker with TP=8 +> Be careful to specify options in the `perf.sh` script that closely reflect the deployment being benchmarked. See `perf.sh --help` to learn more about these options. In the above command, we described a disaggregated deployment in Dynamo with a vLLM backend and accurately indicated 8 prefill workers (TP=1) and 1 decode worker (TP=8).-> We should be careful in specifying these options in `perf.sh` script. They should closely reflect the deployment config that is being benchmarked. See `perf.sh --help` to learn more about these option. In the above command, we described that our deployment is using aggregated serving in `vllm serve`. We have also accurately described that we have 2 workers with TP=4(or TP=8 for two nodes). +> Be careful to specify options in the `perf.sh` script that closely reflect the deployment being benchmarked. See `perf.sh --help` to learn more about these options. In the above command, we described an aggregated deployment using `vllm serve` and accurately indicated 2 workers with TP=4 (or TP=8 for two nodes).Also applies to: 172-174, 262-264
114-116: Grammar: “required for the following instructions”Fixes missing preposition.
-> Two 8xH100-80GB nodes are required the following instructions. +> Two 8xH100-80GB nodes are required for the following instructions.-> One (or two) 8xH100-80GB nodes are required the following instructions. +> One (or two) 8xH100-80GB nodes are required for the following instructions.Also applies to: 180-182
136-142: Wording: “Configure NATS and ETCD”Improve the step title wording.
- 2. Config NATS and ETCD (node 1) + 2. Configure NATS and ETCD (node 1)
220-227: Grammar: “Use NGINX as a load balancer” and clarify stepSmall grammar fix and clarifies the intent of the step.
- 3. Use NGINX as load balancer + 3. Use NGINX as a load balancer
318-321: Hyphenate “speed-up” and tighten the example textMinor editorial polish.
-For example, at 45 tokens/s/user, the increase in tokens/s/gpu is `145 - 80 = 65`, from the orange baseline to the -blue disaggregated line, so the improvement is around 1.44x speed up: +For example, at 45 tokens/s/user, the increase in tokens/s/GPU is `145 - 80 = 65` from the orange baseline to the +blue disaggregated line, so the improvement is around a 1.44x speed-up:
347-349: Grammar: article usage and phrasing (“an HTTP”, “a GET request”)Minor grammar fixes.
-In addition to that, the watcher will create -a HTTP server on port 7001 by default, which you can use to send GET request for readiness to build external benchmarking workflow. +In addition, the watcher will create +an HTTP server on port 7001 by default, which you can use to send a GET request for readiness to build an external benchmarking workflow.
37-41: Optional: include HF authentication step for private modelsIf the model is gated/private, users need to authenticate first.
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic +# If the model is gated/private, first authenticate: +# huggingface-cli login --token "$HF_TOKEN"Please verify whether the referenced model requires authentication, and if so, add the login step above.
365-389: Add a clickable reference to vllm_multinode_setup.shMake the helper script discoverable via a direct link if it exists in-repo.
-### vLLM -- `vllm_multinode_setup.sh` is a helper script to configure the node for dynamo deployment for +### vLLM +- [`vllm_multinode_setup.sh`](./vllm_multinode_setup.sh) is a helper script to configure the node for dynamo deployment forPlease verify the actual path of the script and update the link accordingly.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
benchmarks/llm/README.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
benchmarks/llm/README.md
[grammar] ~22-~22: There might be a mistake here.
Context: ...nd multi-node configurations. > [!NOTE] > We recommend trying out the [LLM Deplo...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ...king. ## Prerequisites > [!Important] > At least one 8xH100-80GB node is requi...
(QB_NEW_EN)
[grammar] ~49-~49: There might be a mistake here.
Context: ...ker_compose.yml up -d ``` > [!NOTE] > This guide was tested on node(s) with ...
(QB_NEW_EN)
[grammar] ~61-~61: There might be a mistake here.
Context: ..., 18 Links per GPU > > * InfiniBand: > 8x400Gbit/s (Compute Links), 2x400Gb...
(QB_NEW_EN)
[grammar] ~69-~69: There might be a mistake here.
Context: ...Single Node Benchmarking > [!Important] > One 8xH100-80GB node is required for t...
(QB_NEW_EN)
[grammar] ~72-~72: There might be a mistake here.
Context: ...Dynamo disaggregated vLLM performance to [native vLLM Aggregated Baseline](#vllm-...
(QB_NEW_EN)
[grammar] ~73-~73: There might be a mistake here.
Context: ...ngle node. These were chosen to optimize for Output Token Throughput (per sec) wh...
(QB_NEW_EN)
[grammar] ~74-~74: There might be a mistake here.
Context: ... under similar Inter Token Latency (ms). For more details on your use case please...
(QB_NEW_EN)
[grammar] ~88-~88: There might be a mistake here.
Context: ...n.sh --mount-workspace ``` > [!Tip] > The huggingface home source mount can ...
(QB_NEW_EN)
[grammar] ~98-~98: There might be a mistake here.
Context: ...l 1> disagg.log 2>&1 & ``` > [!Tip] > Check the disagg.log to make sure th...
(QB_NEW_EN)
[grammar] ~108-~108: There might be a mistake here.
Context: ...s using disaggregated serving in dynamo with vLLM backend. We have also accurately d...
(QB_NEW_EN)
[grammar] ~114-~114: There might be a mistake here.
Context: ...d Multinode Benchmarking > [!Important] > Two 8xH100-80GB nodes are required the...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...Dynamo disaggregated vLLM performance to [native vLLM Aggregated Baseline](#vllm-...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...two nodes. These were chosen to optimize for Output Token Throughput (per sec) wh...
(QB_NEW_EN)
[grammar] ~119-~119: There might be a mistake here.
Context: ... under similar Inter Token Latency (ms). For more details on your use case please...
(QB_NEW_EN)
[grammar] ~133-~133: There might be a mistake here.
Context: ...n.sh --mount-workspace ``` > [!Tip] > The huggingface home source mount can ...
(QB_NEW_EN)
[grammar] ~143-~143: There might be a mistake here.
Context: ...0_ip_addr>:2379" ``` > [!Important] > Node 1 must be able to reach Node 0 ov...
(QB_NEW_EN)
[grammar] ~153-~153: There might be a mistake here.
Context: ...g_multinode.log 2>&1 & ``` > [!Tip] > Check the disagg_multinode.log to ma...
(QB_NEW_EN)
[grammar] ~163-~163: There might be a mistake here.
Context: ...l_multinode.log 2>&1 & ``` > [!Tip] > Check the prefill_multinode.log to m...
(QB_NEW_EN)
[grammar] ~173-~173: There might be a mistake here.
Context: ...s using disaggregated serving in dynamo with vLLM backend. We have also accurately d...
(QB_NEW_EN)
[grammar] ~180-~180: There might be a mistake here.
Context: ...ed Baseline Benchmarking > [!Important] > One (or two) 8xH100-80GB nodes are req...
(QB_NEW_EN)
[grammar] ~191-~191: There might be a mistake here.
Context: ...n.sh --mount-workspace ``` > [!Tip] > The Hugging Face home source mount can...
(QB_NEW_EN)
[grammar] ~215-~215: There might be a mistake here.
Context: ...2 1> vllm_1.log 2>&1 & ``` > [!Tip] > Check the vllm_0.log and `vllm_1.log...
(QB_NEW_EN)
[grammar] ~220-~220: There might be a mistake here.
Context: ...erve` instance per node. 3. Use NGINX as load balancer ```bash apt upda...
(QB_NEW_EN)
[grammar] ~228-~228: There might be a mistake here.
Context: ...service nginx restart ``` > [!Note] > If benchmarking over 2 nodes, the `ups...
(QB_NEW_EN)
[grammar] ~269-~269: There might be a mistake here.
Context: ...tent way of obtaining the configuration of deployment service. Hence, we need to p...
(QB_NEW_EN)
[grammar] ~269-~269: There might be a mistake here.
Context: ... provide this information to the script in form of command line arguments. The ben...
(QB_NEW_EN)
[grammar] ~269-~269: There might be a mistake here.
Context: ...ipt /workspace/benchmarks/llm/perf.sh uses GenAI-Perf tool to collect the performa...
(QB_NEW_EN)
[style] ~269-~269: This phrase is redundant. Consider writing “various” or “different”.
Context: ...l to collect the performance numbers at various different request concurrencies. The perf.sh scri...
(VARIOUS_DIFFERENT)
[style] ~269-~269: This phrase is redundant. Consider writing “various” or “different”.
Context: ...n multiple times to collect numbers for various different deployments. Each script execution will...
(VARIOUS_DIFFERENT)
[grammar] ~271-~271: There might be a mistake here.
Context: ...proper care should be taken that we are starting experiment with clean artifacts_root ...
(QB_NEW_EN)
[grammar] ~271-~271: There might be a mistake here.
Context: ...e taken that we are starting experiment with clean artifacts_root so we include on...
(QB_NEW_EN)
[grammar] ~273-~273: There might be a mistake here.
Context: ... runs that we want to compare. > [!Tip] > See [GenAI-Perf tutorial](https://gith...
(QB_NEW_EN)
[grammar] ~274-~274: There might be a mistake here.
Context: ...r/blob/main/genai-perf/docs/tutorial.md) > @ [GitHub](https://github.com/triton-i...
(QB_NEW_EN)
[grammar] ~275-~275: There might be a mistake here.
Context: ... information about how to run GenAI-Perf > and how to interpret results. ## Int...
(QB_NEW_EN)
[grammar] ~285-~285: There might be a mistake here.
Context: ...ing image, install the dependencies for plotting Pareto graph ```bash pip3 install matpl...
(QB_NEW_EN)
[grammar] ~306-~306: There might be a mistake here.
Context: ...ughput can be improved by switching from aggregated to disaggregated serving when...
(QB_NEW_EN)
[grammar] ~309-~309: There might be a mistake here.
Context: ... pair. The x-axis on the Pareto graph is latency (tokens/s/user), which the laten...
(QB_NEW_EN)
[grammar] ~311-~311: There might be a mistake here.
Context: ...areto graph. A line (Pareto Frontier) is formed when the dots from different conc...
(QB_NEW_EN)
[grammar] ~314-~314: There might be a mistake here.
Context: ...lotted on the graph, we can look for the greatest increase in throughput (along t...
(QB_NEW_EN)
[grammar] ~318-~318: There might be a mistake here.
Context: ...0 = 65`, from the orange baseline to the blue disaggregated line, so the improvem...
(QB_NEW_EN)
[grammar] ~319-~319: Ensure spelling is correct
Context: ...ine, so the improvement is around 1.44x speed up: 
[grammar] ~320-~320: There might be a mistake here.
Context: ...ample_plots/single_node_pareto_plot.png) Note: The above example was collected ov...
(QB_NEW_EN)
[grammar] ~327-~327: There might be a mistake here.
Context: ...enAI-Perf still apply and can be reused. The specifics of deploying with differen...
(QB_NEW_EN)
[grammar] ~334-~334: There might be a mistake here.
Context: ...xamples - Dynamo Multinode Deployments - [Dynamo TensorRT LLM Deployments](../../....
(QB_NEW_EN)
[grammar] ~335-~335: There might be a mistake here.
Context: ...e.md) - Dynamo TensorRT LLM Deployments - [Aggregated Deployment of Very Large Mode...
(QB_NEW_EN)
[grammar] ~336-~336: There might be a mistake here.
Context: ...gregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment) - [Dynamo vLLM Deployments](../../../docs/e...
(QB_NEW_EN)
[grammar] ~342-~342: There might be a mistake here.
Context: ...ker kind for a particular benchmark run. The deployment can process the workflow ...
(QB_NEW_EN)
[grammar] ~343-~343: There might be a mistake here.
Context: ..., in the case where the benchmark is run as soon as dynamo is responsive to infer...
(QB_NEW_EN)
[grammar] ~344-~344: There might be a mistake here.
Context: ...ate benchmark result at the beginning of the benchmark. In such a case, you may a...
(QB_NEW_EN)
[grammar] ~345-~345: There might be a mistake here.
Context: ...k. In such a case, you may additionally deploy benchmark watcher to provide signal on ...
(QB_NEW_EN)
[grammar] ~345-~345: There might be a mistake here.
Context: ...dditionally deploy benchmark watcher to provide signal on whether the full deployment i...
(QB_NEW_EN)
[grammar] ~345-~345: There might be a mistake here.
Context: ...de signal on whether the full deployment is ready. For instance, if you expect th...
(QB_NEW_EN)
[grammar] ~347-~347: There might be a mistake here.
Context: ...ddition to that, the watcher will create a HTTP server on port 7001 by default, wh...
(QB_NEW_EN)
[grammar] ~348-~348: There might be a mistake here.
Context: ...se to send GET request for readiness to build external benchmarking workflow. ```bas...
(QB_NEW_EN)
[grammar] ~368-~368: There might be a mistake here.
Context: ...igure the node for dynamo deployment for vLLM. Depending on whether environment v...
(QB_NEW_EN)
[grammar] ~369-~369: There might be a mistake here.
Context: ...ODE_IPandRAY_LEADER_NODE_IP` are set when the script is invoked, it will: -...
(QB_NEW_EN)
[grammar] ~372-~372: There might be a mistake here.
Context: ...ronment variables as expected by dynamo. - run Ray and connect to the Ray cluster s...
(QB_NEW_EN)
[grammar] ~374-~374: There might be a mistake here.
Context: ..._IP`, otherwise start the Ray cluster with current node as the head node. - prin...
(QB_NEW_EN)
[grammar] ~374-~374: There might be a mistake here.
Context: ...ster with current node as the head node. - print the command with HEAD_NODE_IP an...
(QB_NEW_EN)
[grammar] ~393-~393: There might be a mistake here.
Context: ...etrics and visualize them using Grafana, please see the provided [Visualization w...
(QB_NEW_EN)
[grammar] ~398-~398: There might be a mistake here.
Context: ...there can be cases where the latency and throughput number don't match the expect...
(QB_NEW_EN)
[grammar] ~399-~399: There might be a mistake here.
Context: ...ome margin. Below is a list of scenarios that have been encountered, and details ...
(QB_NEW_EN)
[grammar] ~406-~406: There might be a mistake here.
Context: ... Therefore this can be the cause if you observe abnormal TTFT increase when splitting p...
(QB_NEW_EN)
[grammar] ~406-~406: There might be a mistake here.
Context: ... you observe abnormal TTFT increase when splitting prefill workers and decode wor...
(QB_NEW_EN)
[grammar] ~410-~410: There might be a mistake here.
Context: ...by using backend specific debug options. In the case of UCX backend, you can use ...
(QB_NEW_EN)
[grammar] ~411-~411: There might be a mistake here.
Context: ...end specific debug options. In the case of UCX backend, you can use ucx_info -d ...
(QB_NEW_EN)
[grammar] ~411-~411: There might be a mistake here.
Context: ...-d` to check if the desired interconnect devices are being recognized. At runtime...
(QB_NEW_EN)
[grammar] ~413-~413: There might be a mistake here.
Context: ...tailed logs on UCX activities. This will reveal whether the desired transport is ...
(QB_NEW_EN)
[grammar] ~420-~420: There might be a mistake here.
Context: ...in the aggregated baseline benchmarking, a user can miss updating the ip address ...
(QB_NEW_EN)
[grammar] ~421-~421: There might be a mistake here.
Context: ...dating the ip address to the other node in upstream section of nginx.conf. This ...
(QB_NEW_EN)
[grammar] ~421-~421: There might be a mistake here.
Context: ...ode in upstream section of nginx.conf. This could lead to only one of the nodes...
(QB_NEW_EN)
[grammar] ~422-~422: There might be a mistake here.
Context: ... nodes serving requests. In such a case, the benchmark can still run to completio...
(QB_NEW_EN)
[grammar] ~423-~423: There might be a mistake here.
Context: ...e result will not reflect the deployment capacity, because not all the compute re...
(QB_NEW_EN)
[grammar] ~431-~431: There might be a mistake here.
Context: ...you start the deployment, you can follow the instructions in [monitor benchmark s...
(QB_NEW_EN)
[grammar] ~432-~432: There might be a mistake here.
Context: ...tus](#Monitor-Benchmark-Startup-Status), which will periodically poll the workers...
(QB_NEW_EN)
[grammar] ~433-~433: There might be a mistake here.
Context: ...he workers exposed to specific endpoints and return HTTP 200 code when the expect...
(QB_NEW_EN)
[grammar] ~434-~434: There might be a mistake here.
Context: ... the expected number of workers are met.
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
benchmarks/llm/README.md
334-334: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
335-335: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
336-336: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
336-336: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
337-337: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
368-368: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
371-371: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
373-373: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
375-375: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
428-428: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
431-431: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
432-432: Link fragments should be valid
Expected: #monitor-benchmark-startup-status; Actual: #Monitor-Benchmark-Startup-Status
(MD051, link-fragments)
🔇 Additional comments (1)
benchmarks/llm/README.md (1)
220-230: Validate nginx upstream matches vLLM ports and inter-node addressesThe provided nginx.conf should route to ports 8001 and 8002 (or the multi-node addresses if applicable). A mismatch silently skews results.
You can sanity-check nginx upstream targets in
/workspace/benchmarks/llm/nginx.confand ensure:
- Single node: upstream includes 127.0.0.1:8001 and 127.0.0.1:8002
- Two nodes: upstream includes Node0:8001 and Node1:8001 (or the intended ports)
| > [!NOTE] | ||
| > We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix self-referential link to point to the actual deployment examples doc
The link currently points to this README, which can confuse users. Point directly to the deployment examples doc.
-> We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking.
+> We recommend trying out the [LLM Deployment Examples](../../../docs/examples/llm_deployment.md) before benchmarking.Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 LanguageTool
[grammar] ~22-~22: There might be a mistake here.
Context: ...nd multi-node configurations. > [!NOTE] > We recommend trying out the [LLM Deplo...
(QB_NEW_EN)
🤖 Prompt for AI Agents
In benchmarks/llm/README.md around lines 22 to 24 the note links to ./README.md
which points back to the same file; update the link target to the actual
deployment examples document (e.g., the correct relative path to the deployment
examples doc such as ../deployments/README.md or the specific file path in the
repo) so the note directs users to the real LLM Deployment Examples page instead
of this README.
| Currently, there is no consistent way of obtaining the configuration of deployment service. Hence, we need to provide this information to the script in form of command line arguments. The benchmarking script `/workspace/benchmarks/llm/perf.sh` uses GenAI-Perf tool to collect the performance numbers at various different request concurrencies. The perf.sh script can be run multiple times to collect numbers for various different deployments. Each script execution will create a new artifacts directory in `artifacts_root` and dump these numbers in it. See [Plotting Pareto Graphs](#plotting-pareto-graphs) to learn how to convert the data from this `artifacts_root` to generate pareto graphs for the performance. | ||
|
|
||
| Note: As each `perf.sh` adds a new artifacts directory in the `artifacts_root` always, proper care should be taken that we are starting experiment with clean `artifacts_root` so we include only results from runs that we want to compare. | ||
|
|
||
| > [!Tip] | ||
| > See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md) | ||
| > @ [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-Perf | ||
| > and how to interpret results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clarify artifacts handling; remove redundancy; fix grammar
Reduces “various different” redundancy, clarifies “in the form of”, and consistently refers to the artifacts root directory.
-Currently, there is no consistent way of obtaining the configuration of deployment service. Hence, we need to provide this information to the script in form of command line arguments. The benchmarking script `/workspace/benchmarks/llm/perf.sh` uses GenAI-Perf tool to collect the performance numbers at various different request concurrencies. The perf.sh script can be run multiple times to collect numbers for various different deployments. Each script execution will create a new artifacts directory in `artifacts_root` and dump these numbers in it. See [Plotting Pareto Graphs](#plotting-pareto-graphs) to learn how to convert the data from this `artifacts_root` to generate pareto graphs for the performance.
-
-Note: As each `perf.sh` adds a new artifacts directory in the `artifacts_root` always, proper care should be taken that we are starting experiment with clean `artifacts_root` so we include only results from runs that we want to compare.
+Currently, there is no consistent way to obtain the deployment configuration automatically. Therefore, provide this information to the script in the form of command-line arguments. The benchmarking script `/workspace/benchmarks/llm/perf.sh` uses the GenAI-Perf tool to collect performance numbers at different request concurrencies. You can run `perf.sh` multiple times to collect numbers for different deployments. Each run creates a new artifacts directory under the artifacts root directory and writes the results there. See [Plotting Pareto Graphs](#plotting-pareto-graphs) to learn how to generate Pareto graphs from the artifacts root directory.
+
+Note: Because each `perf.sh` run adds a new artifacts directory under the artifacts root directory, start from a clean artifacts root when you want to include only results from the runs you intend to compare.-> @ [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-Perf
+> on [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-PerfAlso applies to: 271-271
🧰 Tools
🪛 LanguageTool
[grammar] ~269-~269: There might be a mistake here.
Context: ...tent way of obtaining the configuration of deployment service. Hence, we need to p...
(QB_NEW_EN)
[grammar] ~269-~269: There might be a mistake here.
Context: ... provide this information to the script in form of command line arguments. The ben...
(QB_NEW_EN)
[grammar] ~269-~269: There might be a mistake here.
Context: ...ipt /workspace/benchmarks/llm/perf.sh uses GenAI-Perf tool to collect the performa...
(QB_NEW_EN)
[style] ~269-~269: This phrase is redundant. Consider writing “various” or “different”.
Context: ...l to collect the performance numbers at various different request concurrencies. The perf.sh scri...
(VARIOUS_DIFFERENT)
[style] ~269-~269: This phrase is redundant. Consider writing “various” or “different”.
Context: ...n multiple times to collect numbers for various different deployments. Each script execution will...
(VARIOUS_DIFFERENT)
[grammar] ~271-~271: There might be a mistake here.
Context: ...proper care should be taken that we are starting experiment with clean artifacts_root ...
(QB_NEW_EN)
[grammar] ~271-~271: There might be a mistake here.
Context: ...e taken that we are starting experiment with clean artifacts_root so we include on...
(QB_NEW_EN)
[grammar] ~273-~273: There might be a mistake here.
Context: ... runs that we want to compare. > [!Tip] > See [GenAI-Perf tutorial](https://gith...
(QB_NEW_EN)
[grammar] ~274-~274: There might be a mistake here.
Context: ...r/blob/main/genai-perf/docs/tutorial.md) > @ [GitHub](https://github.com/triton-i...
(QB_NEW_EN)
[grammar] ~275-~275: There might be a mistake here.
Context: ... information about how to run GenAI-Perf > and how to interpret results. ## Int...
(QB_NEW_EN)
🤖 Prompt for AI Agents
In benchmarks/llm/README.md around lines 269 to 276 (also touching line 271),
reword the paragraph to remove redundant phrases like “various different” (use
“various” or “different”), change “in form of” to “in the form of”, and
consistently refer to the artifacts root directory as `artifacts_root`; also
tighten grammar for clarity (e.g., “Each perf.sh execution creates a new
artifacts directory in artifacts_root and stores results there. Ensure
artifacts_root is clean before starting an experiment so only intended runs are
compared.”) Make these edits so the paragraph reads clearly and consistently
without changing the meaning.
| For each concurrency benchmarked, it produces a latency and throughput value pair. The x-axis on the Pareto graph is | ||
| latency (tokens/s/user), which the latency is lower if the value is higher. The y-axis on the Pareto graph is throughput | ||
| (tokens/s/gpu). The latency and throughput value pair forms a dot on the Pareto graph. A line (Pareto Frontier) is | ||
| formed when the dots from different concurrency values are plotted on the graph. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clarify axis definitions; current phrasing is confusing
“Latency (tokens/s/user)” reads contradictory. Rephrase to use “inter-token rate” and clearly indicate “higher is better”.
-For each concurrency benchmarked, it produces a latency and throughput value pair. The x-axis on the Pareto graph is
-latency (tokens/s/user), which the latency is lower if the value is higher. The y-axis on the Pareto graph is throughput
-(tokens/s/gpu). The latency and throughput value pair forms a dot on the Pareto graph. A line (Pareto Frontier) is
-formed when the dots from different concurrency values are plotted on the graph.
+For each concurrency, the benchmark produces a pair of values. The x-axis shows the inter-token rate (tokens/s/user) — higher is better (lower latency). The y-axis shows throughput (tokens/s/GPU). Each pair forms a point on the Pareto graph, and connecting the best points yields the Pareto frontier.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| For each concurrency benchmarked, it produces a latency and throughput value pair. The x-axis on the Pareto graph is | |
| latency (tokens/s/user), which the latency is lower if the value is higher. The y-axis on the Pareto graph is throughput | |
| (tokens/s/gpu). The latency and throughput value pair forms a dot on the Pareto graph. A line (Pareto Frontier) is | |
| formed when the dots from different concurrency values are plotted on the graph. | |
| For each concurrency, the benchmark produces a pair of values. The x-axis shows the inter-token rate (tokens/s/user) — higher is better (lower latency). The y-axis shows throughput (tokens/s/GPU). Each pair forms a point on the Pareto graph, and connecting the best points yields the Pareto frontier. |
🧰 Tools
🪛 LanguageTool
[grammar] ~309-~309: There might be a mistake here.
Context: ... pair. The x-axis on the Pareto graph is latency (tokens/s/user), which the laten...
(QB_NEW_EN)
[grammar] ~311-~311: There might be a mistake here.
Context: ...areto graph. A line (Pareto Frontier) is formed when the dots from different conc...
(QB_NEW_EN)
🤖 Prompt for AI Agents
In benchmarks/llm/README.md around lines 309 to 313, the x-axis label "latency
(tokens/s/user)" is confusing and contradictory; change the wording to
"inter-token rate (tokens/s/user) — higher is better (faster per-user token
rate, i.e., lower latency)" and clearly state the y-axis as "throughput
(tokens/s/GPU) — higher is better"; update the paragraph to say that each
concurrency produces an (inter-token rate, throughput) pair plotted as a dot,
and the Pareto frontier is formed by connecting the dots across concurrencies.
| - [Dynamo Multinode Deployments](../../../docs/examples/multinode.md) | ||
| - [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md) | ||
| - [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment) | ||
| - [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix markdownlint issues: bullet style and indentation (MD004, MD007)
Switch dashes to asterisks and correct the nested bullet indentation.
-- [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
-- [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
- - [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
-- [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)
+* [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
+* [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
+ * [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
+* [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - [Dynamo Multinode Deployments](../../../docs/examples/multinode.md) | |
| - [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md) | |
| - [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment) | |
| - [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md) | |
| * [Dynamo Multinode Deployments](../../../docs/examples/multinode.md) | |
| * [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md) | |
| * [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment) | |
| * [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md) |
🧰 Tools
🪛 LanguageTool
[grammar] ~334-~334: There might be a mistake here.
Context: ...xamples - Dynamo Multinode Deployments - [Dynamo TensorRT LLM Deployments](../../....
(QB_NEW_EN)
[grammar] ~335-~335: There might be a mistake here.
Context: ...e.md) - Dynamo TensorRT LLM Deployments - [Aggregated Deployment of Very Large Mode...
(QB_NEW_EN)
[grammar] ~336-~336: There might be a mistake here.
Context: ...gregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment) - [Dynamo vLLM Deployments](../../../docs/e...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
334-334: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
335-335: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
336-336: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
336-336: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
337-337: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
🤖 Prompt for AI Agents
In benchmarks/llm/README.md around lines 334 to 338, the bulleted list uses
dashes and incorrect nested indentation triggering markdownlint rules
MD004/MD007; replace the leading dashes with asterisks for all list items and
adjust the nested item indentation so the nested bullet is indented by two or
four spaces (consistent with the file's list style) under its parent item to
produce a properly nested asterisk-style list.
| - `vllm_multinode_setup.sh` is a helper script to configure the node for dynamo deployment for | ||
| vLLM. Depending on whether environment variable `HEAD_NODE_IP` and `RAY_LEADER_NODE_IP` are set | ||
| when the script is invoked, it will: | ||
| - start nats server and etcd on the current node if `HEAD_NODE_IP` is not set, otherwise | ||
| set the environment variables as expected by dynamo. | ||
| - run Ray and connect to the Ray cluster started by `RAY_LEADER_NODE_IP`, otherwise start | ||
| the Ray cluster with current node as the head node. | ||
| - print the command with `HEAD_NODE_IP` and `RAY_LEADER_NODE_IP` set, which can be used in | ||
| another node to setup connectivity with the current node. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix markdownlint issues: bullet style (MD004)
Use asterisks for unordered lists.
-- `vllm_multinode_setup.sh` is a helper script to configure the node for dynamo deployment for
+- `vllm_multinode_setup.sh` is a helper script to configure the node for dynamo deployment for
vLLM. Depending on whether environment variable `HEAD_NODE_IP` and `RAY_LEADER_NODE_IP` are set
when the script is invoked, it will:
- - start nats server and etcd on the current node if `HEAD_NODE_IP` is not set, otherwise
+ * start nats server and etcd on the current node if `HEAD_NODE_IP` is not set, otherwise
set the environment variables as expected by dynamo.
- - run Ray and connect to the Ray cluster started by `RAY_LEADER_NODE_IP`, otherwise start
+ * run Ray and connect to the Ray cluster started by `RAY_LEADER_NODE_IP`, otherwise start
the Ray cluster with current node as the head node.
- - print the command with `HEAD_NODE_IP` and `RAY_LEADER_NODE_IP` set, which can be used in
+ * print the command with `HEAD_NODE_IP` and `RAY_LEADER_NODE_IP` set, which can be used in
another node to setup connectivity with the current node.Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 LanguageTool
[grammar] ~368-~368: There might be a mistake here.
Context: ...igure the node for dynamo deployment for vLLM. Depending on whether environment v...
(QB_NEW_EN)
[grammar] ~369-~369: There might be a mistake here.
Context: ...ODE_IPandRAY_LEADER_NODE_IP` are set when the script is invoked, it will: -...
(QB_NEW_EN)
[grammar] ~372-~372: There might be a mistake here.
Context: ...ronment variables as expected by dynamo. - run Ray and connect to the Ray cluster s...
(QB_NEW_EN)
[grammar] ~374-~374: There might be a mistake here.
Context: ..._IP`, otherwise start the Ray cluster with current node as the head node. - prin...
(QB_NEW_EN)
[grammar] ~374-~374: There might be a mistake here.
Context: ...ster with current node as the head node. - print the command with HEAD_NODE_IP an...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
368-368: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
371-371: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
373-373: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
375-375: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
🤖 Prompt for AI Agents
In benchmarks/llm/README.md around lines 368 to 377, the unordered list items
use hyphens which triggers markdownlint MD004; replace the hyphen bullets with
asterisks so the block consistently uses '*' for each list item and nested lines
(i.e., change "- start nats server..." and "- run Ray..." and "- print the
command..." to use "*") while keeping the same text and indentation.
| - **Framework-only benchmark** The simplest way is to send sample requests and check | ||
| the logs of all workers. Each framework may provide utilities for readiness checks, so please | ||
| refer to the framework's documentation for those details. | ||
| - **Dynamo based benchmark** Once you start the deployment, you can follow | ||
| the instructions in [monitor benchmark startup status](#Monitor-Benchmark-Startup-Status), | ||
| which will periodically poll the workers exposed to specific endpoints | ||
| and return HTTP 200 code when the expected number of workers are met. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix markdownlint issues and broken anchor (MD004, MD051)
- Use asterisks for unordered lists.
- Fix the link fragment to the lowercase anchor expected by GitHub.
-- **Framework-only benchmark** The simplest way is to send sample requests and check
+- **Framework-only benchmark** The simplest way is to send sample requests and check
the logs of all workers. Each framework may provide utilities for readiness checks, so please
refer to the framework's documentation for those details.
-- **Dynamo based benchmark** Once you start the deployment, you can follow
-the instructions in [monitor benchmark startup status](#Monitor-Benchmark-Startup-Status),
+- **Dynamo based benchmark** Once you start the deployment, you can follow
+the instructions in [Monitor Benchmark Startup Status](#monitor-benchmark-startup-status),
which will periodically poll the workers exposed to specific endpoints
and return HTTP 200 code when the expected number of workers are met.Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 LanguageTool
[grammar] ~431-~431: There might be a mistake here.
Context: ...you start the deployment, you can follow the instructions in [monitor benchmark s...
(QB_NEW_EN)
[grammar] ~432-~432: There might be a mistake here.
Context: ...tus](#Monitor-Benchmark-Startup-Status), which will periodically poll the workers...
(QB_NEW_EN)
[grammar] ~433-~433: There might be a mistake here.
Context: ...he workers exposed to specific endpoints and return HTTP 200 code when the expect...
(QB_NEW_EN)
[grammar] ~434-~434: There might be a mistake here.
Context: ... the expected number of workers are met.
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
428-428: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
431-431: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
432-432: Link fragments should be valid
Expected: #monitor-benchmark-startup-status; Actual: #Monitor-Benchmark-Startup-Status
(MD051, link-fragments)
🤖 Prompt for AI Agents
In benchmarks/llm/README.md around lines 428 to 434, the unordered list items
currently use hyphens and the link fragment uses a mixed-case anchor; change the
list markers to asterisks (*) and update the link fragment to the lowercase
GitHub-compatible anchor (e.g. #monitor-benchmark-startup-status) so the list
conforms to MD004 and the link resolves for MD051.
Overview:
Added back README for benchmark.
Details:
Previously there was a README describing the process of how to benchmark using GenAI Perf. In v0.4.0 this was supposedly moved, however I was not able to locate it.
Where should the reviewer start?
n/a
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
n/a
Summary by CodeRabbit