Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/evaluation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#flores-200), [wmt24pp](./multilingual.md#wmt24pp)
- [**Speech & Audio**](./speech-audio.md): e.g. [asr-leaderboard](./speech-audio.md#asr-leaderboard), [mmau-pro](./speech-audio.md#mmau-pro)
- [**Vision-Language Models (VLM)**](./vlm.md): e.g. [mmmu-pro](./vlm.md#mmmu-pro)
- [**Speculative Decoding (SD)**](./speculative-decoding.md): e.g. [SPEED-Bench](./speculative-decoding.md#SPEED-Bench)

See [nemo_skills/dataset](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.

Expand Down
97 changes: 97 additions & 0 deletions docs/evaluation/speculative-decoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Speculative Decoding

This section details how to evaluate speculative decoding (SD) benchmarks.
SD has emerged as a leading technique for accelerating LLM inference. By allowing a smaller draft model to propose multiple future tokens that are verified in a single forward pass by a larger target model, SD can significantly increase system throughput.

In all SD benchmarks we want to measure two qualitative metrics for draft accuracy/quality: acceptance length (AL), acceptance rate (AR).
Other metric in this group is conditional acceptance rate (or per-position acceptance rate), which measures the acceptance rate in a given position conditioned that all previous tokens were accepted.

For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tighten doc lint compliance (link text + code fence language).

Use descriptive link labels instead of “here”, and set a language for the metrics output block.

Proposed fix
-For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
+For more advanced evaluation of SD, including throughput and per-category metrics, see the [Model Optimizer specdec_bench examples](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).

-- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench).
+- Original benchmark source is the [SPEED-Bench dataset on Hugging Face](https://huggingface.co/datasets/nvidia/SPEED-Bench).

-```
+```text
 --------------------------------------------- speed-bench ----------------------------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
 pass@1          | 880         | 464        | 139         | 2.78                   | 69.38
</details>


Also applies to: 32-32, 77-81

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 9-9: Link text should be descriptive

(MD059, descriptive-link-text)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/evaluation/speculative-decoding.md at line 9, Replace the vague "here"
link text with a descriptive label (e.g., "Speculative decoding evaluation
framework on GitHub") and update the metrics output code fence to include a
language specifier (use text) so the table is rendered correctly; specifically, find the markdown link whose anchor text is "here" and replace it with a descriptive label, and change the fenced block that shows the "speed-bench" metrics to start with text (apply the same fixes to the other
occurrences mentioned in the comment).


</details>

<!-- fingerprinting:phantom:poseidon:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->



## How we evaluate?

!!! note
The current evaluation supports only SGLang and VLLM servers.

The evaluation is executed by the following process:

1. Get SD metrics from `/metrics` endpoint of the server.
2. Send the benchmark's prompts to the server.
3. Get metrics from `/metrics` endpoint, and calculate the difference from step (1), to get the average SD metrics (AL, AR, etc.).

!!! note
For `local` executor and SGLang server, we also support a flow which writes a metrics file per request to a local path, and then we calculate the SD metrics based on this file. This way, we can have a per-request metric, which can be relevant in some cases. More information on this feature can be found in [SGLang Documentation](https://docs.sglang.io/advanced_features/server_arguments.html#requestmetricsexporter-configuration).


## Supported Benchmarks

### SPEED-Bench

- Benchmark is defined in [`nemo_skills/dataset/speed-bench/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/speed-bench/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench).

#### License

GOVERNING TERMS: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement.

ADDITIONAL INFORMATION: MIT for bigcode/humanevalpack, RUCAIBox/MMATH, RUCAIBox/BAMBOO and EQ-Bench. Apache 2.0 for Writing Bench and Spec-Bench. CC BY 4.0 for FBK-MT/MCIF. MIT and Apache 2.0 for tianyang/repobench_python_v1.1, JetBrains-Research/lca-project-level-code-completion and tianyang/repobench_java_v1.1.

NOTICE: For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare_data.py` script automatically fetches data from all the source datasets.

Additional details are in [HuggingFace dataset repository](https://huggingface.co/datasets/nvidia/SPEED-Bench).

#### Data preparation

See example of data preparation command in [main evaluation docs](../evaluation/index.md#using-data-on-cluster).

```shell
ns prepare_data speed-bench --data_dir=<output directory for data files> --cluster=<cluster config>
```

Other supported options:

* **config**: select which config to prepare, can be one of the splits in the dataset (e.g., `qualitative`, `throughput_2k`) or `all` to prepare all of the configs.


#### Evaluation command

An example of running Llama 3.3 70B with external draft Llama 3.2 1B using SGLang and a draft length of 3:

```bash
ns eval \
--cluster=<cluster config> \
--data_dir=<must match prepare_data parameter> \
--output_dir=<any mounted output location> \
--benchmarks=speed-bench \
--model=meta-llama/Llama-3.3-70B-Instruct \
--server_args="--speculative-algorithm STANDALONE --speculative-draft-model-path meta-llama/Llama-3.2-1B-Instruct --speculative-num-steps 3 --speculative-eagle-topk 1 --torch-compile-max-bs 32 --max-running-requests 32 --cuda-graph-max-bs 32 --mem-fraction-static 0.8" \
--server_nodes=1 \
--server_gpus=8 \
--server_type=sglang \
++inference.tokens_to_generate=1024
```

Example evaluation metrics:

```
--------------------------------------------- speed-bench ----------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
pass@1 | 880 | 464 | 139 | 2.78 | 69.38
```

An example of running Llama 3.3 70B with EAGLE3 using vLLM and a draft length of 3:

```bash
ns eval \
--cluster=<cluster config> \
--data_dir=<must match prepare_data parameter> \
--output_dir=<any mounted output location> \
--benchmarks=speed-bench \
--model=meta-llama/Llama-3.3-70B-Instruct \
--server_args="--speculative-config '{\"method\": \"eagle3\", \"num_speculative_tokens\": 3, \"model\": \"nvidia/Llama-3.3-70B-Instruct-Eagle3\"}'" \
--server_nodes=1 \
--server_gpus=8 \
--server_type=vllm \
++inference.tokens_to_generate=1024
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ nav:
- evaluation/vlm.md
- evaluation/other-benchmarks.md
- evaluation/robustness.md
- evaluation/speculative-decoding.md
- External benchmarks: evaluation/external-benchmarks.md
- Agentic Inference:
- agentic_inference/parallel_thinking.md
Expand Down
20 changes: 20 additions & 0 deletions nemo_skills/dataset/speed-bench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
REQUIRES_DATA_DIR = True
METRICS_TYPE = "specdec"
EVAL_SPLIT = "qualitative"
GENERATION_ARGS = "++prompt_format=openai ++eval_type=specdec ++inference.include_response=true"
GENERATION_MODULE = "nemo_skills.inference.eval.specdec"
Loading
Loading