Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/evaluation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#flores-200), [wmt24pp](./multilingual.md#wmt24pp)
- [**Speech & Audio**](./speech-audio.md): e.g. [asr-leaderboard](./speech-audio.md#asr-leaderboard), [mmau-pro](./speech-audio.md#mmau-pro)
- [**Vision-Language Models (VLM)**](./vlm.md): e.g. [mmmu-pro](./vlm.md#mmmu-pro)
- [**Speculative Decoding (SD)**](./speculative-decoding.md): e.g. [SPEED-Bench](./speculative-decoding.md#SPEED-Bench)

See [nemo_skills/dataset](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.

Expand Down
96 changes: 96 additions & 0 deletions docs/evaluation/speculative-decoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Speculative Decoding

This section details how to evaluate speculative decoding (SD) benchmarks.
SD has emerged as a leading technique for accelerating LLM inference. By allowing a smaller draft model to propose multiple future tokens that are verified in a single forward pass by a larger target model, SD can significantly increase system throughput.

In all SD benchmarks we want to measure two qualitative metrics for draft accuracy/quality: acceptance length (AL), acceptance rate (AR).
Other metric in this group is conditional acceptance rate (or per-position acceptance rate), which measures the acceptance rate in a given position conditioned that all previous tokens were accepted.

For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of “here”.

The links on Line 9 and Line 32 are non-descriptive and trigger markdownlint MD059.

💡 Proposed doc fix
-For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
+For more advanced evaluation of SD, including throughput and per-category metrics, please use the [Model Optimizer speculative-decoding benchmark framework](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).

-- Original benchmark source, is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench).
+- Original benchmark source is the [SPEED-Bench dataset on Hugging Face](https://huggingface.co/datasets/nvidia/SPEED-Bench).

Also applies to: 32-32

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 9-9: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` at line 9, Replace the
non-descriptive link text "here" with meaningful, accessible link text that
describes the target (e.g., "Model-Optimizer speculative decoding benchmark" or
"SpecDec benchmark repository") for both occurrences of the GitHub URL (the
evaluation framework link to NVIDIA/Model-Optimizer/examples/specdec_bench) so
the link text is descriptive and resolves the markdownlint MD059 warning.



## How we evaluate?

!!! note
The current evaluation supports only SGLang and VLLM servers.

The evaluation is executed by the following process:

1. Get SD metrics from `/metrics` endpoint of the server.
2. Send the benchmark's prompts to the server.
3. Get metrics from `/metrics` endpoint, and calculate the difference from step (1), to get the average SD metrics (AL, AR, etc.).

!!! note
For `local` executor and SGLang server, we also support a flow which writes a metrics file per request to a local path, and then we calculate the SD metrics based on this file. This way, we can have a per-request metric, which can be relevant in some cases. More information on this feature can be found in [SGLang Documentation](https://docs.sglang.io/advanced_features/server_arguments.html#requestmetricsexporter-configuration).


## Supported Benchmarks

### SPEED-Bench

- Benchmark is defined in [`nemo_skills/dataset/speed-bench/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/speed-bench/__init__.py)
- Original benchmark source, is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench).
- NOTICE: This dataset is governed by the [NVIDIA Evaluation Dataset License Agreement](https://huggingface.co/datasets/nvidia/SPEED-Bench/blob/main/License.pdf). For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare_data` script automatically fetches data from all the source datasets.

#### Data preparation

See example of data preparation command in [main evaluation docs](../evaluation/index.md#using-data-on-cluster).

```shell
ns prepare_data speed-bench --data_dir=<output directory for data files> --cluster=<cluster config>
```

Other supported options:

* **config**: select which config to prepare, can be one of the splits in the dataset (e.g., `qualitative`, `throughput_2k`) or `all` to prepare all of the configs.


#### Evaluation command

An example of running Llama 3.3 70B with external draft Llama 3.2 1B using SGLang and a draft length of 3:

```bash
ns eval \
--cluster=<cluster config> \
--data_dir=<must match prepare_data parameter> \
--output_dir=<any mounted output location> \
--benchmarks=speed-bench \
--model=meta-llama/Llama-3.3-70B-Instruct \
--server_args="--speculative-algorithm STANDALONE --speculative-draft-model-path meta-llama/Llama-3.2-1B-Instruct --speculative-num-steps 3 --speculative-eagle-topk 1 --torch-compile-max-bs 32 --max-running-requests 32 --cuda-graph-max-bs 32 --mem-fraction-static 0.8" \
--server_nodes=1 \
--server_gpus=8 \
--server_type=sglang \
++inference.tokens_to_generate=1024
```

Example evaluation metrics:

```
--------------------------------------------- speed-bench ----------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
pass@1 | 880 | 464 | 139 | 2.78 | 69.38
```
Comment on lines +68 to +72
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language identifier to the metrics output code fence.

This block should specify a language (e.g., text) to satisfy MD040.

💡 Proposed markdownlint fix
-```
+```text
 --------------------------------------------- speed-bench ----------------------------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
 pass@1          | 880         | 464        | 139         | 2.78                   | 69.38
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 77-77: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` around lines 77 - 81, The markdown
code fence in docs/evaluation/speculative-decoding.md containing the speed-bench
metrics is missing a language identifier which triggers MD040; update the
opening triple-backtick for that block to include a language tag (e.g., use
```text) so the fenced block reads as a text code block and satisfies the
linter, ensuring the metrics table remains unchanged.


An example of running Llama 3.3 70B with EAGLE3 using vLLM and a draft length of 3:

```bash
ns eval \
--cluster=<cluster config> \
--data_dir=<must match prepare_data parameter> \
--output_dir=<any mounted output location> \
--benchmarks=speed-bench \
--model=meta-llama/Llama-3.3-70B-Instruct \
--server_args="--speculative-config '{\"method\": \"eagle3\", \"num_speculative_tokens\": 3, \"model\": \"nvidia/Llama-3.3-70B-Instruct-Eagle3\"}'" \
--server_nodes=1 \
--server_gpus=8 \
--server_type=vllm \
++inference.tokens_to_generate=1024
```

Example evaluation metrics:

```
--------------------------------------------- speed-bench ----------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
pass@1 | 880 | 463 | 104 | 2.37 | 45.52
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ nav:
- evaluation/vlm.md
- evaluation/other-benchmarks.md
- evaluation/robustness.md
- evaluation/speculative-decoding.md
- External benchmarks: evaluation/external-benchmarks.md
- Agentic Inference:
- agentic_inference/parallel_thinking.md
Expand Down
20 changes: 20 additions & 0 deletions nemo_skills/dataset/speed-bench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
REQUIRES_DATA_DIR = True
METRICS_TYPE = "specdec"
EVAL_SPLIT = "qualitative"
GENERATION_ARGS = "++prompt_format=openai ++eval_type=specdec ++inference.include_response=true"
GENERATION_MODULE = "nemo_skills.inference.eval.specdec"
Loading