-
Notifications
You must be signed in to change notification settings - Fork 163
Add SPEED-Bench (within repo) #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
d031708
Add SPEED-Bench
talorabr b237015
Fix metric and prepare data script for speed-bench
talorabr 0ac8cb4
Fix data dir argument
talorabr 15df539
Add specdec generation module
talorabr 445213a
Update docs/evaluation/speculative-decoding.md
talorabr e326fde
Update nemo_skills/dataset/speed-bench/prepare.py
talorabr 9c73e12
Merge branch 'main' into speed-bench
gwarmstrong 3143529
Merge branch 'main' of github.com:NVIDIA-NeMo/Skills into speed-bench
talorabr f42d153
CR fix
talorabr 8b4a5d3
Remove dataset group
talorabr 858317a
Raise exception when no specdec_stats
talorabr aeda1da
Remove nemo skill dep in prepare data of speed-bench
talorabr dce2ed7
Fix documentation
talorabr a4185aa
Added metrics to documentation
talorabr 7715161
Merge branch 'main' into talora/speed-bench
talorabr b2aea0e
Merge branch 'main' into talora/speed-bench
gwarmstrong d47dae5
stackselect prompt token count fix
talorabr 78645cc
Merge branch 'talora/speed-bench' of github.com:NVIDIA-NeMo/Skills in…
talorabr 55be3c1
Exclude speed-bench from GPU CI tests (exhausts disk space)
gwarmstrong c8c2e69
Merge branch 'main' into talora/speed-bench
gwarmstrong File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # Speculative Decoding | ||
|
|
||
| This section details how to evaluate speculative decoding (SD) benchmarks. | ||
| SD has emerged as a leading technique for accelerating LLM inference. By allowing a smaller draft model to propose multiple future tokens that are verified in a single forward pass by a larger target model, SD can significantly increase system throughput. | ||
|
|
||
| In all SD benchmarks we want to measure two qualitative metrics for draft accuracy/quality: acceptance length (AL), acceptance rate (AR). | ||
| Other metric in this group is conditional acceptance rate (or per-position acceptance rate), which measures the acceptance rate in a given position conditioned that all previous tokens were accepted. | ||
|
|
||
| For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench). | ||
|
|
||
|
|
||
| ## How we evaluate? | ||
|
|
||
| !!! note | ||
| The current evaluation supports only SGLang and VLLM servers. | ||
|
|
||
| The evaluation is executed by the following process: | ||
|
|
||
| 1. Get SD metrics from `/metrics` endpoint of the server. | ||
| 2. Send the benchmark's prompts to the server. | ||
| 3. Get metrics from `/metrics` endpoint, and calculate the difference from step (1), to get the average SD metrics (AL, AR, etc.). | ||
|
|
||
| !!! note | ||
| For `local` executor and SGLang server, we also support a flow which writes a metrics file per request to a local path, and then we calculate the SD metrics based on this file. This way, we can have a per-request metric, which can be relevant in some cases. More information on this feature can be found in [SGLang Documentation](https://docs.sglang.io/advanced_features/server_arguments.html#requestmetricsexporter-configuration). | ||
|
|
||
|
|
||
| ## Supported Benchmarks | ||
|
|
||
| ### SPEED-Bench | ||
|
|
||
| - Benchmark is defined in [`nemo_skills/dataset/speed-bench/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/speed-bench/__init__.py) | ||
| - Original benchmark source, is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench). | ||
| - NOTICE: This dataset is governed by the [NVIDIA Evaluation Dataset License Agreement](https://huggingface.co/datasets/nvidia/SPEED-Bench/blob/main/License.pdf). For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare_data` script automatically fetches data from all the source datasets. | ||
|
|
||
| #### Data preparation | ||
|
|
||
| See example of data preparation command in [main evaluation docs](../evaluation/index.md#using-data-on-cluster). | ||
|
|
||
| ```shell | ||
| ns prepare_data speed-bench --data_dir=<output directory for data files> --cluster=<cluster config> | ||
| ``` | ||
|
|
||
| Other supported options: | ||
|
|
||
| * **config**: select which config to prepare, can be one of the splits in the dataset (e.g., `qualitative`, `throughput_2k`) or `all` to prepare all of the configs. | ||
|
|
||
|
|
||
| #### Evaluation command | ||
|
|
||
| An example of running Llama 3.3 70B with external draft Llama 3.2 1B using SGLang and a draft length of 3: | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=<cluster config> \ | ||
| --data_dir=<must match prepare_data parameter> \ | ||
| --output_dir=<any mounted output location> \ | ||
| --benchmarks=speed-bench \ | ||
| --model=meta-llama/Llama-3.3-70B-Instruct \ | ||
| --server_args="--speculative-algorithm STANDALONE --speculative-draft-model-path meta-llama/Llama-3.2-1B-Instruct --speculative-num-steps 3 --speculative-eagle-topk 1 --torch-compile-max-bs 32 --max-running-requests 32 --cuda-graph-max-bs 32 --mem-fraction-static 0.8" \ | ||
| --server_nodes=1 \ | ||
| --server_gpus=8 \ | ||
| --server_type=sglang \ | ||
| ++inference.tokens_to_generate=1024 | ||
| ``` | ||
|
|
||
| Example evaluation metrics: | ||
|
|
||
| ``` | ||
| --------------------------------------------- speed-bench ---------------------------------------------- | ||
| evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate | ||
| pass@1 | 880 | 464 | 139 | 2.78 | 69.38 | ||
| ``` | ||
|
Comment on lines
+68
to
+72
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a language identifier to the metrics output code fence. This block should specify a language (e.g., 💡 Proposed markdownlint fix-```
+```text
--------------------------------------------- speed-bench ----------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
pass@1 | 880 | 464 | 139 | 2.78 | 69.38🧰 Tools🪛 markdownlint-cli2 (0.21.0)[warning] 77-77: Fenced code blocks should have a language specified (MD040, fenced-code-language) 🤖 Prompt for AI Agents |
||
|
|
||
| An example of running Llama 3.3 70B with EAGLE3 using vLLM and a draft length of 3: | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=<cluster config> \ | ||
| --data_dir=<must match prepare_data parameter> \ | ||
| --output_dir=<any mounted output location> \ | ||
| --benchmarks=speed-bench \ | ||
| --model=meta-llama/Llama-3.3-70B-Instruct \ | ||
| --server_args="--speculative-config '{\"method\": \"eagle3\", \"num_speculative_tokens\": 3, \"model\": \"nvidia/Llama-3.3-70B-Instruct-Eagle3\"}'" \ | ||
| --server_nodes=1 \ | ||
| --server_gpus=8 \ | ||
| --server_type=vllm \ | ||
| ++inference.tokens_to_generate=1024 | ||
| ``` | ||
gwarmstrong marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Example evaluation metrics: | ||
|
|
||
| ``` | ||
| --------------------------------------------- speed-bench ---------------------------------------------- | ||
| evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate | ||
| pass@1 | 880 | 463 | 104 | 2.37 | 45.52 | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||
| REQUIRES_DATA_DIR = True | ||
| METRICS_TYPE = "specdec" | ||
| EVAL_SPLIT = "qualitative" | ||
| GENERATION_ARGS = "++prompt_format=openai ++eval_type=specdec ++inference.include_response=true" | ||
| GENERATION_MODULE = "nemo_skills.inference.eval.specdec" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use descriptive link text instead of “here”.
The links on Line 9 and Line 32 are non-descriptive and trigger markdownlint MD059.
💡 Proposed doc fix
Also applies to: 32-32
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 9-9: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents