From 901ff797afb9ae5785b0794365e24a54642052c6 Mon Sep 17 00:00:00 2001 From: wili-65535 Date: Tue, 8 Apr 2025 16:52:40 +0800 Subject: [PATCH 1/7] doc/Draft-Target-Model: v1.0 Signed-off-by: wili-65535 --- docs/source/advanced/speculative-decoding.md | 248 +---------------- examples/draft_target_model/README.md | 268 ++++++++++++++++--- examples/prompt_lookup/README.md | 4 - 3 files changed, 233 insertions(+), 287 deletions(-) diff --git a/docs/source/advanced/speculative-decoding.md b/docs/source/advanced/speculative-decoding.md index 9452c581c4d..2c75e0d5ee2 100644 --- a/docs/source/advanced/speculative-decoding.md +++ b/docs/source/advanced/speculative-decoding.md @@ -3,7 +3,6 @@ - [About Speculative Sampling](#about-speculative-sampling) - [Performance Improvements](#Performance-improvements) - [Draft-Target-Model](#Draft-Target-Model) - - [Using Draft model approach with Triton Inference Server](#Using-Draft-model-approach-with-Triton-Inference-Server) - [Prompt-Lookup-Decoding](#prompt-lookup-decoding) - [Medusa](#medusa) - [Medusa Tree](#medusa-tree) @@ -40,7 +39,6 @@ TensorRT-LLM supports several approaches for generating draft tokens, including: 4. Utilizing Jacobi-like decoding to predict and verify draft tokens using the same model which does not need additional fine-tuning. Refer to [Break the Sequential Dependency of LLM Inference Using Lookahead Decoding](https://arxiv.org/pdf/2402.02057). - ## Performance Improvements It's important to note that the effectiveness of speculative decoding techniques is highly dependent @@ -52,9 +50,7 @@ tuned as TensorRT-LLM, the potential time savings are more pronounced. ## Draft-Target-Model -The Draft-Target-Model involves the use of two distinct models trained independently but sharing the same vocabulary: a smaller Draft model and a larger Target model. For example, GPT 125M / 6.7B models can serve as the Draft / Target model. - -There are two styles of using Draft-Target-Model in TensorRT-LLM now. The first one is using TensorRT-LLM-BLS in Triton, which more information and detailed steps can be found in this document. The second one is using it directly in TensorRT-LLM, which steps can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py). +The Draft-Target-Model involves the use of two distinct models (a smaller Draft model and a larger Target model) trained independently but sharing the same vocabulary. For example, GPT 125M / 6.7B models serves as the Draft / Target model. The management of Draft and Target models is facilitated through two separate `Executor` instances. It is essential that you to coordinate the interactions between the Draft and Target models effectively. @@ -65,248 +61,12 @@ Subsequently, the prompt, now updated with the accepted tokens, is sent back to This iterative process continues until a predefined stop conditions are met. An example of this orchestration process can be found in the [TensorRT-LLM Triton backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py). -Configuring and executing the Draft model within the Inflight Fused Batching (IFB) framework -follows the same procedure as for any other model within IFB. -The `maxNewTokens` parameter should be set to the number of draft tokens in the `LlmRequest` for the Draft model query. - -When building the Target model, it is necessary to specify the `--max_draft_len --speculative_decoding_mode draft_tokens_external` option to the `trtllm-build` command. -During the Target model's inference phase in IFB, `maxNewTokens` should be set to `1`, -and the draft tokens must be set in the `draftTokens` field of the `LlmRequest` for the Target model query. - -**NOTE:** To enhance performance, especially due to the repetitive querying of Draft and Target models with requests that share a common prefix, -it is advisable to enable KV cache reuse for both models. -This can be achieved by adding the `--use_paged_context_fmha=enable` flag to the `trtllm-build` command -and setting `enableBlockReuse=true` in the `KVCacheConfig`. - -### Using Draft-Target-Model approach with Triton Inference Server - -This example is only relevant for Draft-Target-Model model method. For all other speculative decoding models, you can deploy them in Triton server in the same way as standard non-speculative autoregressive models. - -+ Draft model approach is supported since TensorRT-LLM-0.7.0 (using two separate Tritonserver to maintain draft and target model respectively), but has significant optimization in TensorRT-LLM-0.10.0 (using one Tritonserver with [Business Logic Scripting](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#business-logic-scripting), BLS). -+ The source file of Draft model with BLS can be found [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py). -+ This example is based on TensorRT-LLM-0.10.0 and TRTLLM-backend-0.10.0, using docker image `nvcr.io/nvidia/tritonserver:24.05-trtllm-py3`. -+ Llama-7B-hf and Llama-30B-hf are used as draft and target model respectively in this example, assuming the paths to the models' repository are `DRAFT_MODEL_PATH` and `TARGET_MODEL_PATH`. -+ Maximum number of draft tokens is set to 10 in this example. - -1. Prepare TensorRT engine for inference - + Here are the commands to build draft / target engines in FP16 or FP8. All combinations of the data type (Draft-FP16/FP8 + Target-FP16/FP8) are supported. - + `--remove_input_padding=enable --paged_kv_cache=enable` are necessary for inflight-batching. - + `--context_fmha=enable --use_paged_context_fmha=enable` are optional, but recommended for the performance. - + `--gather_generation_logits` is necessary if using generation logits for selecting tokens in target model. - + `--tp_size` can be modified set if using TP mode for draft / target model. - + `--max_batch_size` more than 1 is acceptable in general usage, but we use 1 in this example. - - ```bash - export MAX_DRAFT_LENGTH=10 - export COMMON_COMMAND="--max_batch_size=1 --max_input_len=2048 --max_seq_len=3072 --gpt_attention_plugin=float16 --gemm_plugin=float16 --remove_input_padding=enable --paged_kv_cache=enable --context_fmha=enable --use_paged_context_fmha=enable --gather_generation_logits" - export DRAFT_COMMAND_FP16="$COMMON_COMMAND" - export TARGET_COMMAND_FP16="$DRAFT_COMMAND_FP16 --max_draft_len=$MAX_DRAFT_LENGTH --speculative_decoding_mode draft_tokens_external" - export DRAFT_COMMAND_FP8="$COMMON_COMMAND --use_fp8_context_fmha=enable" - export TARGET_COMMAND_FP8="$DRAFT_COMMAND_FP8 --max_draft_len=$MAX_DRAFT_LENGTH --speculative_decoding_mode draft_tokens_external" - - # Build checkpoints and engines in tensorrt_llm/examples/llama/ - # FP16 mode - export DRAFT_NAME=llama-7b-fp16-tp1 - export TARGET_NAME=llama-30b-fp16-tp1 - python3 convert_checkpoint.py --model_dir=$DRAFT_MODEL_PATH --output_dir=ckpt/$DRAFT_NAME --tp_size=1 - python3 convert_checkpoint.py --model_dir=$TARGET_MODEL_PATH --output_dir=ckpt/$TARGET_NAME --tp_size=1 - trtllm-build --checkpoint_dir=ckpt/$DRAFT_NAME --output_dir=engine/draft/$DRAFT_NAME $DRAFT_COMMAND_FP16 - trtllm-build --checkpoint_dir=ckpt/$TARGET_NAME --output_dir=engine/target/$TARGET_NAME $TARGET_COMMAND_FP16 - export DRAFT_ENGINE_PATH=$(pwd)/engine/draft/$DRAFT_NAME - export TARGET_ENGINE_PATH=$(pwd)/engine/target/$TARGET_NAME - - # FP8 mode - export DRAFT_NAME=llama-7b-fp8-tp1 - export TARGET_NAME=llama-30b-fp8-tp1 - python3 ../quantization/quantize.py --model_dir=$DRAFT_MODEL_PATH --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir=ckpt/$DRAFT_NAME --tp_size=1 - python3 ../quantization/quantize.py --model_dir=$TARGET_MODEL_PATH --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir=ckpt/$TARGET_NAME --tp_size=1 - trtllm-build --checkpoint_dir=ckpt/$DRAFT_NAME --output_dir=engine/draft/$DRAFT_NAME $DRAFT_COMMAND_FP8 - trtllm-build --checkpoint_dir=ckpt/$TARGET_NAME --output_dir=engine/target/$TARGET_NAME $TARGET_COMMAND_FP8 - export DRAFT_ENGINE_PATH=$(pwd)/engine/draft/$DRAFT_NAME - export TARGET_ENGINE_PATH=$(pwd)/engine/target/$TARGET_NAME - ``` - -2. Edit Triton configuration - + If both draft and target model can be placed in one GPU (for example, llama-7B-FP8 + llama-30B-FP8, totally 40GiB in one H100-80GiB GPU), `DRAFT_GPU_DEVICE_IDS` and `TARGET_GPU_DEVICE_IDS` can be the same, `0` as example. It appears better performance than placing on two separate GPUs. - + Elsewise, the draft and target models can be placed in different GPUs, `DRAFT_GPU_DEVICE_IDS="0"` and `TARGET_GPU_DEVICE_IDS="1"` as example. - + Furthermore, if TP mode is used, the value of `GPU_DEVICE_IDS` can be a list, `DRAFT_GPU_DEVICE_IDS="0"` and `TARGET_GPU_DEVICE_IDS="1,2,3,4"` as example. - + For more configuration of launching models with Tritonserver, please visit [TensorRT-LLM Backend repo](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md). - - ```bash - ACCUMULATE_TOKEN="false" - BACKEND="tensorrtllm" - BATCH_SCHEDULER_POLICY="guaranteed_no_evict" - BATCHING_STRATEGY="inflight_fused_batching" - BLS_INSTANCE_COUNT="1" - DECODING_MODE="top_k_top_p" - DECOUPLED_MODE="False" - DRAFT_GPU_DEVICE_IDS="0" - E2E_MODEL_NAME="ensemble" - ENABLE_KV_CACHE_REUSE="true" - ENGINE_PATH=$TARGET_ENGINE_PATH - EXCLUDE_INPUT_IN_OUTPUT="false" - KV_CACHE_FREE_GPU_MEM_FRACTION="0.8" - MAX_ATTENTION_WINDOW_SIZE="" - MAX_BEAM_WIDTH="1" - MAX_QUEUE_DELAY_MICROSECONDS="0" - MAX_TOKENS_IN_KV_CACHE="" - NORMALIZE_LOG_PROBS="true" - POSTPROCESSING_INSTANCE_COUNT="1" - PREPROCESSING_INSTANCE_COUNT="1" - TARGET_GPU_DEVICE_IDS="1" - TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft" - TENSORRT_LLM_MODEL_NAME="tensorrt_llm" - TOKENIZER_PATH=$DRAFT_MODEL_PATH - TOKENIZER_TYPE=llama - TRITON_GRPC_PORT="8001" - TRITON_HTTP_PORT="8000" - TRITON_MAX_BATCH_SIZE="4" - TRITON_METRICS_PORT="8002" - TRITON_REPO="triton_repo" - USE_DRAFT_LOGITS="false" - LOGITS_DATATYPE="TYPE_FP32" # Replace by TYPE_FP16 for FP8 model - - # Make a copy of triton repo and replace the fields in the configuration files - cd /tensorrtllm_backend/ - apt-get update && apt-get install -y build-essential cmake git-lfs - pip3 install git-lfs tritonclient grpcio - rm -rf ${TRITON_REPO} - cp -R all_models/inflight_batcher_llm ${TRITON_REPO} - python3 tools/fill_template.py -i ${TRITON_REPO}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE} - python3 tools/fill_template.py -i ${TRITON_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${PREPROCESSING_INSTANCE_COUNT} - python3 tools/fill_template.py -i ${TRITON_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${POSTPROCESSING_INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE} - python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},accumulate_tokens:${ACCUMULATE_TOKEN},bls_instance_count:${BLS_INSTANCE_COUNT},tensorrt_llm_model_name:${TENSORRT_LLM_MODEL_NAME},tensorrt_llm_draft_model_name:${TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:${LOGITS_DATATYPE} - - # Make a copy of tensorrt_llm as configurations of draft / target models. - cp -R ${TRITON_REPO}/tensorrt_llm ${TRITON_REPO}/tensorrt_llm_draft - sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt - python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${TARGET_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} - python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt triton_backend:${BACKEND},engine_dir:${DRAFT_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${DRAFT_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} - ``` - -3. Launch Triton server - + `--multi-model` is necessary if TP mode is used for target model. - - ```bash - python3 scripts/launch_triton_server.py \ - --model_repo=${TRITON_REPO} \ - --tensorrt_llm_model_name "tensorrt_llm,tensorrt_llm_draft" \ - --multi-model \ - --log & - ``` - - + Verbose log will be written in to file `triton_log.txt`. Triton server launches successfully if you see the output below in the file: - - ```txt - Started HTTPService at 0.0.0.0:8000 - Started GRPCInferenceService at 0.0.0.0:8001 - Started Metrics Service at 0.0.0.0:8002 - ``` - -4. Send Requests - + Prepare a JSON file `input_data.json` containing input data as below (more requests are acceptable). - - ```json - [ - { - "input": "James Best, best known for his ", - "instruction": "Continue writing the following story:", - "output": " " - } - ] - ``` - - + Use command below to launch requests for inference. - + `--num-draft-tokens` can be modified by runtime draft lengths, 4 is used in this example. - - ```bash - python3 tools/inflight_batcher_llm/speculative_decoding_test.py \ - --max-input-len 2048 \ - --dataset=input_data.json \ - --url-target=localhost:8001 \ - --url-draft=localhost:8001 \ - --draft-tensorrt-llm-model-name="${TENSORRT_LLM_DRAFT_MODEL_NAME}" \ - --target-tensorrt-llm-model-name="${TENSORRT_LLM_MODEL_NAME}" \ - --bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \ - --execute-bls-speculative-decoding \ - --disable-output-comparison \ - --num-draft-tokens=4 \ - --verbose - ``` - -5. Enable fast logits D2D transfer when `"use_draft_logits": True` - + Obtaining adjusted logits distribution from draft logits is a proposed method in the [Fast Inference from Transformers via Speculative Decoding paper](https://arxiv.org/pdf/2211.17192.pdf). Fast logits feature boosts the performance (TPS) by hiding the latency of logits transfer from draft engine to target engine. - + Fast logits feature is newly supported in TensorRT-LLM-0.15.0. - + Modify `participant_ids` entry in `tensorrt_llm/config.pbtxt` and `tensorrt_llm_draft/config.pbtxt` to suitable MPI ranks. Usually in this setting, rank 0 is reserved for the orchestrator rank; rank 1 is for draft engine; the rest of the ranks are for target engine. In this example, `particpant_ids` can be set as snippet below. Same logic also applies to TP>1 target engine. - ``` - ### In tensorrt_llm_draft/config.pbtxt - parameters: { - key: "gpu_device_ids" - value: { - string_value: "0" - } - } - parameters: { - key: "participant_ids" - value: { - string_value: "1" - } - } - ### In tensorrt_llm/config.pbtxt - parameters: { - key: "gpu_device_ids" - value: { - string_value: "1" - } - } - parameters: { - key: "participant_ids" - value: { - string_value: "2" - } - } - ``` - + Enable `speculative_decoding_fast_logits` in both `tensorrt_llm/config.pbtxt` and `tensorrt_llm_draft/config.pbtxt`. - ``` - parameters: { - key: "speculative_decoding_fast_logits" - value: { - string_value: "1" - } - } - ``` - + Fast logits feature requires Tritonserver to be launched in orchestrator mode with `--disable-spawn-process`. See [model config](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more information. `--world_size` has to be set as 1 (orchestrator rank 0) + 1 (draft engine ranks) + 1 (target engine ranks). - ```bash - python3 scripts/launch_triton_server.py \ - --model_repo=$TRITON_REPO \ - --tensorrt_llm_model_name "tensorrt_llm,tensorrt_llm_draft" \ - --multi-model \ - --disable-spawn-processes \ - --world_size=3 --log & - ``` - + Send request with `use_draft_logits` to tritonserver BLS API: - ``` - curl -X POST "http://localhost:8000/v2/models/tensorrt_llm_bls/generate" \ - -H "Content-Type: application/json" \ - -d '{ - "text_input": "Continue writing the following story: James Best, best known for his", - "max_tokens": 128, - "num_draft_tokens": 10, - "use_draft_logits": true, - "stream": false - }' - ``` - + With the fast logits enabled and following optimization tips in [model configuration](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md#some-tips-for-model-configuration), speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target. - -6. Kill Tritonserver after finishing inference - - ```bash - pkill -9 -f trtllmExecutorWorker - pkill -9 -f tritonserver - ``` +We provide two styles of runnig Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py). ## Prompt-Lookup-Decoding +The Prompt-Lookup speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The Prompt-Lookup profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc. + See document in [examples/prompt_lookup/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py). ## Medusa diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md index 3a1afcf0af4..67fd3db3c42 100644 --- a/examples/draft_target_model/README.md +++ b/examples/draft_target_model/README.md @@ -1,18 +1,10 @@ -# Draft-Target-Model Speculative Decoding +# Draft-Target-Model Speculative Decoding (DTM) -This document shows how to build and run a model using Draft-Target-Model speculative decoding (also known as `Speculative-Sampling`, [`Paper`](https://arxiv.org/abs/2302.01318)) in TensorRT-LLM on single GPU, or single node multiple GPU. +This document shows how to build and run a model using DTM speculative decoding (also known as `Speculative-Sampling`, [`Paper`](https://arxiv.org/abs/2302.01318)) in TensorRT-LLM on single GPU, or single node multiple GPU. ## Overview -The Draft-Target-Model involves the use of two distinct LLM models trained independently but sharing the same vocabulary: a smaller Draft model and a larger Target model. For example, GPT 125M / 6.7B models can serve as the Draft / Target model. - -There are two styles of using Draft-Target-Model in TensorRT-LLM. The first one is using TensorRT-LLM-BLS in Triton, which more information and detailed steps can be found in [speculative decoding documentation](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html). The second one is using it directly in TensorRT-LLM, which steps can be found in this document and the code can be found in [examples/run.py](../run.py). - -The Draft-Target-Model has 4 additional hyperparameters that you need to specify to control the process of generation: -- `draft_len`: the number of tokens the draft model generated in one iteration, which the range is from 4 to 10 in common usage. Empirically, the larger the value is, the higher acceptance ratio but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found. -- `draft_model_device_list`: the index list of device(s) to run the draft model. The length of it must be the same as the TP size of the draft model engine. For instances, `draft_model_device_list=[1]` means using tp_size=1 and GPU 1 for draft model, `draft_model_device_list=[4,5,6,7]` means using tp=4 and GPU from 4 to 7 for draft model. -- `target_model_device_list`: the index list of device(s) to run the target model. The length of it must be the same as the TP size of the target model engine. For instances, `draft_model_device_list=[0]` means using tp_size=1 and GPU 0 for target model, `draft_model_device_list=[2,3]` means using tp=2 and GPU from 2 to 3 for target model. -- `use_logits`: there are two methods to accept tokens proposed by draft model. When `use_logits=True`, the draft tokens are accepted based on the ratio of the logits from draft and target model (modified rejection sampling method in the original paper); When `use_logits=False`, the draft tokens are accepted based on per-token comparison with target predictions regardless of the logits. +We provide two styles of runnig DTM now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Here we introduce the detailed steps of running DTM in both workflows. ## Support Matrix * GPU Compute Capability >= 8.0 (Ampere or newer) @@ -22,51 +14,67 @@ The Draft-Target-Model has 4 additional hyperparameters that you need to specify ## Usage -### Build draft and target engines +### Build draft and target engines (the same in two workflows) -+ We use a open-source `llama-v2-7B/13B` models as both draft and target model in this example. -+ `--use_paged_context_fmha=enable` must be specified since we need KVcache reuse in this approach. ++ We use open-source `llama-7B/13B` as draft and target models in this example, assuming the paths to the models' repository are `DRAFT_MODEL_PATH` and `TARGET_MODEL_PATH`. ++ `--use_paged_context_fmha=enable` must be specified since we need KV-Cache reuse in this approach. + `--speculative_decoding_mode=draft_tokens_external` and `--max_draft_len` must be specified for target model. ++ `--use_paged_context_fmha=enable` are optional, but recommended for the performance. ++ `--gather_generation_logits` is necessary if using generation logits for selecting tokens in target model. ++ `--tp_size` can be modified set if using TP mode for draft / target model. ++ `--max_batch_size` more than 1 is acceptable in general usage, but we use 1 in this example. ```bash cd examples/llama +export DRAFT_CKPT_PATH=/workspace/ckpt-draft +export TARGET_CKPT_PATH=/workspace/ckpt-target +export DRAFT_ENGINE_PATH=/workspace/engine-draft +export TARGET_ENGINE_PATH=/workspace/engine-target +export MAX_BATCH_SIZE=4 +export MAX_DRAFT_LEN=10 +export MAX_INPUT_LEN=3200 +export MAX_SEQ_LEN=4800 python3 convert_checkpoint.py \ - --model_dir= \ - --output_dir=./ckpt-draft \ + --model_dir=${DRAFT_MODEL_PATH} \ + --output_dir=${DRAFT_CKPT_PATH} \ --dtype=float16 python3 convert_checkpoint.py \ - --model_dir= \ - --output_dir=./ckpt-target \ + --model_dir=${TARGET_MODEL_PATH} \ + --output_dir=${TARGET_CKPT_PATH} \ --dtype=float16 trtllm-build \ - --checkpoint_dir ./ckpt-draft \ - --output_dir=./draft-engine \ + --checkpoint_dir=${DRAFT_CKPT_PATH} \ + --output_dir=${DRAFT_ENGINE_PATH} \ --gemm_plugin=float16 \ --use_paged_context_fmha=enable \ - --max_batch_size=4 \ - --max_input_len=3200 \ - --max_seq_len=4800 + --max_batch_size=${MAX_BATCH_SIZE} \ + --max_input_len=${MAX_INPUT_LEN} \ + --max_seq_len=${MAX_SEQ_LEN} trtllm-build \ - --checkpoint_dir=./ckpt-target \ - --output_dir=./target-engine \ + --checkpoint_dir=${TARGET_CKPT_PATH} \ + --output_dir=${TARGET_ENGINE_PATH} \ --gemm_plugin=float16 \ --use_paged_context_fmha=enable \ --speculative_decoding_mode=draft_tokens_external \ - --max_draft_len=10 \ - --max_batch_size=4 \ - --max_input_len=3200 \ - --max_seq_len=4800 + --max_batch_size=${MAX_BATCH_SIZE} \ + --max_draft_len=${MAX_DRAFT_LEN} \ + --max_input_len=${MAX_INPUT_LEN} \ + --max_seq_len=${MAX_SEQ_LEN} ``` -### Run decoding +### TensorRT-LLM workflow + `--draft_engine_dir` and `--engine_dir` must be specified for the draft and target engines respectively. -+ `--draft_target_model_config` is corresponding configuration of Draft-Target-Model, we can see its usage in [util.py](../util.py). - + As an example, `[4,[0],[1],False]` means `draft_len=4`, device of draft model is `GPU0`, device of target model is `GPU1`, and use tokens rather than logits to accept. ++ `--draft_target_model_config` is corresponding configuration of DTM, which has 4 hyperparameters that you need to specify to control the process of generation: + - `draft_len`: the number of tokens the draft model generated in one iteration, which the range is from 4 to 10 in common usage. Empirically, the larger the value is, the higher acceptance ratio but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found. + - `draft_model_device_list`: the index list of device(s) to run the draft model. The length of it must be the same as the TP size of the draft model engine. For instances, `draft_model_device_list=[1]` means using tp_size=1 and GPU 1 for draft model, `draft_model_device_list=[4,5,6,7]` means using tp=4 and GPU from 4 to 7 for draft model. + - `target_model_device_list`: the index list of device(s) to run the target model. The length of it must be the same as the TP size of the target model engine. For instances, `draft_model_device_list=[0]` means using tp_size=1 and GPU 0 for target model, `draft_model_device_list=[2,3]` means using tp=2 and GPU from 2 to 3 for target model. + - `use_logits`: there are two methods to accept tokens proposed by draft model. When `use_logits=True`, the draft tokens are accepted based on the ratio of the logits from draft and target model (modified rejection sampling method in the original paper); When `use_logits=False`, the draft tokens are accepted based on per-token comparison with target predictions regardless of the logits. + - As an example, `[4,[0],[1],False]` means `draft_len=4`, device of draft model is `GPU0`, device of target model is `GPU1`, and use tokens rather than logits to accept. + `--kv_cache_enable_block_reuse` must be specified for this approach. + Only CPP session is supported, so `--use_py_session` must not be specified. + `--kv_cache_free_gpu_memory_fraction` should be specified if we want to place two models on one GPU, or one of the models would use out of the GPU memory. @@ -74,16 +82,198 @@ trtllm-build \ + `--output_generation_logits` is optional. In original paper, we accept the tokens by comparing logits of draft and target models, so this parameter is needed. But for simplification, we can accept the tokens by comparing the output token directly, in this occasion, we can skip this parameter. ```bash -cd examples/llama - -python3 ../run.py \ - --tokenizer_dir \ - --draft_engine_dir ./draft-engine \ - --engine_dir ./target-engine \ +python3 examples/run.py \ + --tokenizer_dir=${TARGET_MODEL_PATH} \ + --draft_engine_dir=/workspace/engine-draft \ + --engine_dir=/workspace/engine-target \ --draft_target_model_config="[4,[0],[1],True]" \ --max_output_len=256 \ --kv_cache_enable_block_reuse \ - --kv_cache_free_gpu_memory_fraction=0.4 \ + --kv_cache_free_gpu_memory_fraction=0.5 \ --output_generation_logits \ --input_text="How does Draft-Sampling work?" ``` + +### Triton Inference Server workflow + ++ This example is based on TensorRT-LLM-0.18.0 and TRTLLM-backend-0.18.0 with docker image `nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3`. ++ Draft model approach is supported since TensorRT-LLM-0.7.0 (using two separate Tritonserver to maintain draft and target model respectively), but has significant optimization in TensorRT-LLM-0.10.0 (using one Tritonserver with [Business Logic Scripting](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#business-logic-scripting), BLS). + +1. Get related repository inside the container + +```bash +git clone https://github.com/triton-inference-server/tensorrtllm_backend.git +cd tensorrtllm_backend +git checkout rel +git lfs pull +git submodule update --init --recursive +pip install -r requirements.txt +pip install SentencePiece tritonclient +``` + +2. Set necessary variables + ++ If draft and target models can be placed in one GPU (llama-7B-FP8 + llama-30B-FP8, totally 40GiB in one H100-80GiB GPU as example), `DRAFT_GPU_DEVICE_IDS` and `TARGET_GPU_DEVICE_IDS` can be the same, (`0` as example). It appears better performance than placing on two separate GPUs. ++ Elsewise, the draft and target models can be placed in different GPUs, `DRAFT_GPU_DEVICE_IDS="0"` and `TARGET_GPU_DEVICE_IDS="1"` as example. ++ Furthermore, if TP mode is used, the device ids can be a list, `DRAFT_GPU_DEVICE_IDS="0"` and `TARGET_GPU_DEVICE_IDS="1,2,3,4"` as example. + +```bash +export DRAFT_MODEL_NAME="tensorrt_llm_draft" +export TARGET_MODEL_NAME="tensorrt_llm" +export DRAFT_DEVICE_IDS="0" +export TARGET_DEVICE_IDS="1" +export TRITON_MODEL_REPO=llama_dtm +``` + +3. Edit model configuration + +```bash +rm -rf ${TRITON_MODEL_REPO} +cp -r all_models/inflight_batcher_llm/ ${TRITON_MODEL_REPO} + +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:TYPE_FP32 +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},tokenizer_dir:${HF_LLAMA_MODEL},preprocessing_instance_count:1 +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},tokenizer_dir:${HF_LLAMA_MODEL},postprocessing_instance_count:1 +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False,tensorrt_llm_model_name:${TARGET_MODEL_NAME},tensorrt_llm_draft_model_name:${DRAFT_MODEL_NAME},logits_datatype:TYPE_FP32 + +cp -r ${TRITON_MODEL_REPO}/tensorrt_llm ${TRITON_MODEL_REPO}/tensorrt_llm_draft +sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${TARGET_ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,gpu_device_ids:${TARGET_DEVICE_IDS},encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32 +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${DRAFT_ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,gpu_device_ids:${DRAFT_DEVICE_IDS},encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32 +``` + +4. Start the triton inference server + ++ `--multi-model` is necessary if TP mode is used. ++ Verbose log will be written in to file `triton_log.txt`. + +```bash +python3 scripts/launch_triton_server.py \ + --model_repo=${TRITON_MODEL_REPO} \ + --multi-model \ + --world_size=1 \ + --log & +``` + ++ Triton server launches successfully if you see the output below in the file: + +```txt +Started HTTPService at 0.0.0.0:8000 +Started GRPCInferenceService at 0.0.0.0:8001 +Started Metrics Service at 0.0.0.0:8002 +``` + +5. Test DTM with a script ++ Prepare a JSON file `input_data.json` containing input data as below (more requests are acceptable). + +```json +[ + { + "input": "James Best, best known for his ", + "instruction": "Continue writing the following story:", + "output": " " + } +] +``` + ++ Use command below to launch requests for inference. ++ `--num-draft-tokens` can be modified by runtime draft lengths, 4 is used in this example. + +```bash +python3 tools/inflight_batcher_llm/speculative_decoding_test.py \ + --max-input-len 2500 \ + --dataset input_data.json \ + --url-control=localhost:8001 \ + --url-target=localhost:8001 \ + --url-draft=localhost:8001 \ + --draft-tensorrt-llm-model-name="${DRAFT_MODEL_NAME}" \ + --target-tensorrt-llm-model-name="${TARGET_MODEL_NAME}" \ + --bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \ + --execute-bls-speculative-decoding \ + --disable-output-comparison \ + --num-draft-tokens=4 \ + --use-draft-logits \ + --verbose +``` + +6. Stop triton inference server after all work is done + +```bash +pkill tritonserver +``` + +7. Advanced usage: Fast logits D2D transfer. + ++ Fast logits boosts the performance (TPS) by hiding the latency of logits transfer from draft engine to target engine supported since TensorRT-LLM-0.15.0. + ++ Modify `participant_ids` entry in `tensorrt_llm/config.pbtxt` and `tensorrt_llm_draft/config.pbtxt` to suitable MPI ranks. Usually in this setting, rank 0 is reserved for the orchestrator rank; rank 1 is for draft engine; the rest of the ranks are for target engine. In this example, `particpant_ids` can be set as snippet below. Same logic also applies to TP>1 target engine. + +```txt +### In tensorrt_llm_draft/config.pbtxt +parameters: { + key: "gpu_device_ids" + value: { + string_value: "0" + } +} +parameters: { + key: "participant_ids" + value: { + string_value: "1" + } +} +### In tensorrt_llm/config.pbtxt +parameters: { + key: "gpu_device_ids" + value: { + string_value: "1" + } +} +parameters: { + key: "participant_ids" + value: { + string_value: "2" + } +} +``` + ++ Enable `speculative_decoding_fast_logits` in both `tensorrt_llm/config.pbtxt` and `tensorrt_llm_draft/config.pbtxt`. + +```txt +parameters: { + key: "speculative_decoding_fast_logits" + value: { + string_value: "1" + } +} +``` + ++ Launched Triton Server + + Use orchestrator mode with `--disable-spawn-process`. See [model config](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more information. + + `--world_size` has to be set as 1 (orchestrator rank 0) + 1 (draft engine ranks) + 1 (target engine ranks). + +```bash +python3 scripts/launch_triton_server.py \ + --model_repo=${TRITON_MODEL_REPO} \ + --multi-model \ + --disable-spawn-processes \ + --world_size=3 \ + --log & +``` + ++ Send request with `use_draft_logits` to tritonserver BLS API: +``` +curl -X POST "http://localhost:8000/v2/models/tensorrt_llm_bls/generate" \ + -H "Content-Type: application/json" \ + -d '{ + "text_input": "Continue writing the following story: James Best, best known for his", + "max_tokens": 128, + "num_draft_tokens": 10, + "use_draft_logits": true, + "stream": false + }' +``` + +### Additional information + ++ With the fast logits enabled and following optimization tips in [model configuration](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md#some-tips-for-model-configuration), speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target. diff --git a/examples/prompt_lookup/README.md b/examples/prompt_lookup/README.md index f542f50b10f..d82064aa85a 100644 --- a/examples/prompt_lookup/README.md +++ b/examples/prompt_lookup/README.md @@ -4,10 +4,6 @@ This document shows how to build and run a model using Prompt-Lookup speculative ## Overview -The Prompt-Lookup speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. - -The Prompt-Lookup profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc. - The Prompt-Lookup has 3 additional hyperparameters that you need to specify to control the process of generation: - `prompt_lookup_num_tokens`: the number of tokens we extract from input prompt or previous generated output as draft tokens in one iteration, which the range is from 4 to 10 in common usage. Empirically, the larger the value is, the higher acceptance ratio but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found. - `max_matching_ngram_size`: the number of tokens we get from the tail of the generated output as a pattern, which is used to match in input prompt or previous generated output. Empirically, the larger the value is, the more precise context can be matched from the existed sequence, indicating higher acceptance ratio, but the higher probability of miss-match and higher overhead appear, which fall back to normal generation (one token per iteration). From 63e8fd6d67a322808f4fe19983c63619c8877e1d Mon Sep 17 00:00:00 2001 From: wili-65535 Date: Tue, 8 Apr 2025 20:35:27 +0800 Subject: [PATCH 2/7] doc/Draft-Target-Model: v1.1 Signed-off-by: wili-65535 --- examples/draft_target_model/README.md | 42 ++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 7 deletions(-) diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md index 67fd3db3c42..70ce8ce2117 100644 --- a/examples/draft_target_model/README.md +++ b/examples/draft_target_model/README.md @@ -155,7 +155,7 @@ python3 scripts/launch_triton_server.py \ --log & ``` -+ Triton server launches successfully if you see the output below in the file: ++ You can see the output below in the file if Triton server launches successfully: ```txt Started HTTPService at 0.0.0.0:8000 @@ -163,21 +163,43 @@ Started GRPCInferenceService at 0.0.0.0:8001 Started Metrics Service at 0.0.0.0:8002 ``` -5. Test DTM with a script +1. Send a request for inference + +```bash +python3 inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py \ + --url-target=localhost:8001 \ + --draft-tensorrt-llm-model-name=${DRAFT_MODEL_NAME} \ + --target-tensorrt-llm-model-name=${TARGET_MODEL_NAME} \ + --output-len=100 \ + --num-draft-tokens=4 \ + --end-id=2 \ + --pad-id=2 \ + --prompt "What is Ubuntu operation system?" +``` + ++ You can receive the following results ff everything goes smoothly. + +```txt +Final text: + What is Ubuntu operation system? +Ubuntu is a free and open source operating system that runs from the desktop, to the cloud, to all your internet connected things. Ubuntu is used by millions of people around the world who want to explore new ideas and discover new opportunities. +Ubuntu is a community developed operating system that is perfect for laptops, desktops, servers, and cloud. It is used by millions of people around the world who want to explore new ideas and discover new opportunities. +``` + +6. Test DTM with a script + Prepare a JSON file `input_data.json` containing input data as below (more requests are acceptable). ```json [ { - "input": "James Best, best known for his ", - "instruction": "Continue writing the following story:", + "input": "What is Ubuntu operation system?", + "instruction": "Answer the question shortly.", "output": " " } ] ``` + Use command below to launch requests for inference. -+ `--num-draft-tokens` can be modified by runtime draft lengths, 4 is used in this example. ```bash python3 tools/inflight_batcher_llm/speculative_decoding_test.py \ @@ -196,13 +218,19 @@ python3 tools/inflight_batcher_llm/speculative_decoding_test.py \ --verbose ``` -6. Stop triton inference server after all work is done ++ You can receive the following results ff everything goes smoothly. + +```txt +Ubuntu is a free and open source operating system. It is a Linux based operating system. ... +``` + +7. Stop triton inference server after all work is done ```bash pkill tritonserver ``` -7. Advanced usage: Fast logits D2D transfer. +8. Advanced usage: Fast logits D2D transfer. + Fast logits boosts the performance (TPS) by hiding the latency of logits transfer from draft engine to target engine supported since TensorRT-LLM-0.15.0. From 69b5a9683db06df37bce7a1a58ada7cef12672e8 Mon Sep 17 00:00:00 2001 From: wili-65535 Date: Tue, 8 Apr 2025 21:24:28 +0800 Subject: [PATCH 3/7] doc/Draft-Target-Model: v1.2 Signed-off-by: wili-65535 --- examples/draft_target_model/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md index 70ce8ce2117..6c9e9994717 100644 --- a/examples/draft_target_model/README.md +++ b/examples/draft_target_model/README.md @@ -163,7 +163,7 @@ Started GRPCInferenceService at 0.0.0.0:8001 Started Metrics Service at 0.0.0.0:8002 ``` -1. Send a request for inference +5. Send a request for inference ```bash python3 inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py \ From f3624c7315a44576375262e1a4b8e62f7bf2c7fc Mon Sep 17 00:00:00 2001 From: wili-65535 Date: Tue, 8 Apr 2025 22:01:02 +0800 Subject: [PATCH 4/7] doc/Draft-Target-Model: v1.3 Signed-off-by: wili-65535 --- examples/draft_target_model/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md index 6c9e9994717..d449841ad7d 100644 --- a/examples/draft_target_model/README.md +++ b/examples/draft_target_model/README.md @@ -177,7 +177,7 @@ python3 inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py \ --prompt "What is Ubuntu operation system?" ``` -+ You can receive the following results ff everything goes smoothly. ++ You can receive the following results if everything goes smoothly. ```txt Final text: @@ -218,7 +218,7 @@ python3 tools/inflight_batcher_llm/speculative_decoding_test.py \ --verbose ``` -+ You can receive the following results ff everything goes smoothly. ++ You can receive the following results if everything goes smoothly. ```txt Ubuntu is a free and open source operating system. It is a Linux based operating system. ... From 4b04288b0b8a64e9a79f0b6f135e10b1ae40bb62 Mon Sep 17 00:00:00 2001 From: wili-65535 Date: Wed, 9 Apr 2025 13:05:28 +0800 Subject: [PATCH 5/7] doc/Draft-Target-Model: v1.4 Signed-off-by: wili-65535 --- docs/source/advanced/speculative-decoding.md | 2 +- examples/draft_target_model/README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/advanced/speculative-decoding.md b/docs/source/advanced/speculative-decoding.md index 2c75e0d5ee2..aa4b7cf8177 100644 --- a/docs/source/advanced/speculative-decoding.md +++ b/docs/source/advanced/speculative-decoding.md @@ -61,7 +61,7 @@ Subsequently, the prompt, now updated with the accepted tokens, is sent back to This iterative process continues until a predefined stop conditions are met. An example of this orchestration process can be found in the [TensorRT-LLM Triton backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py). -We provide two styles of runnig Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py). +We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py). ## Prompt-Lookup-Decoding diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md index d449841ad7d..223deb60f86 100644 --- a/examples/draft_target_model/README.md +++ b/examples/draft_target_model/README.md @@ -4,7 +4,7 @@ This document shows how to build and run a model using DTM speculative decoding ## Overview -We provide two styles of runnig DTM now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Here we introduce the detailed steps of running DTM in both workflows. +We provide two styles of running DTM now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Here we introduce the detailed steps of running DTM in both workflows. ## Support Matrix * GPU Compute Capability >= 8.0 (Ampere or newer) From f712c79be3251bbfc92c1f53cfe5ec5d13c0aeb9 Mon Sep 17 00:00:00 2001 From: wili-65535 Date: Wed, 9 Apr 2025 17:18:59 +0800 Subject: [PATCH 6/7] doc/Draft-Target-Model: v1.5 Signed-off-by: wili-65535 --- docs/source/advanced/speculative-decoding.md | 2 +- examples/draft_target_model/README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/advanced/speculative-decoding.md b/docs/source/advanced/speculative-decoding.md index aa4b7cf8177..a601d9dd24a 100644 --- a/docs/source/advanced/speculative-decoding.md +++ b/docs/source/advanced/speculative-decoding.md @@ -50,7 +50,7 @@ tuned as TensorRT-LLM, the potential time savings are more pronounced. ## Draft-Target-Model -The Draft-Target-Model involves the use of two distinct models (a smaller Draft model and a larger Target model) trained independently but sharing the same vocabulary. For example, GPT 125M / 6.7B models serves as the Draft / Target model. +The Draft-Target-Model involves the use of two distinct models (a smaller Draft model and a larger Target model) trained independently but sharing the same vocabulary. For example, GPT 125M / 6.7B models serve as the Draft / Target model. The management of Draft and Target models is facilitated through two separate `Executor` instances. It is essential that you to coordinate the interactions between the Draft and Target models effectively. diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md index 223deb60f86..5248174a5c0 100644 --- a/examples/draft_target_model/README.md +++ b/examples/draft_target_model/README.md @@ -86,7 +86,7 @@ python3 examples/run.py \ --tokenizer_dir=${TARGET_MODEL_PATH} \ --draft_engine_dir=/workspace/engine-draft \ --engine_dir=/workspace/engine-target \ - --draft_target_model_config="[4,[0],[1],True]" \ + --draft_target_model_config="[4,[0],[1],False]" \ --max_output_len=256 \ --kv_cache_enable_block_reuse \ --kv_cache_free_gpu_memory_fraction=0.5 \ From ec270fcfe71677251d78015eff6020e2c60402ec Mon Sep 17 00:00:00 2001 From: wili-65535 Date: Wed, 9 Apr 2025 17:20:41 +0800 Subject: [PATCH 7/7] doc/Draft-Target-Model: v1.6 Signed-off-by: wili-65535 --- examples/draft_target_model/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md index 5248174a5c0..e7bd67bdf7d 100644 --- a/examples/draft_target_model/README.md +++ b/examples/draft_target_model/README.md @@ -132,8 +132,8 @@ rm -rf ${TRITON_MODEL_REPO} cp -r all_models/inflight_batcher_llm/ ${TRITON_MODEL_REPO} python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:TYPE_FP32 -python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},tokenizer_dir:${HF_LLAMA_MODEL},preprocessing_instance_count:1 -python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},tokenizer_dir:${HF_LLAMA_MODEL},postprocessing_instance_count:1 +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},tokenizer_dir:${TARGET_MODEL_PATH},preprocessing_instance_count:1 +python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},tokenizer_dir:${TARGET_MODEL_PATH},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False,tensorrt_llm_model_name:${TARGET_MODEL_NAME},tensorrt_llm_draft_model_name:${DRAFT_MODEL_NAME},logits_datatype:TYPE_FP32 cp -r ${TRITON_MODEL_REPO}/tensorrt_llm ${TRITON_MODEL_REPO}/tensorrt_llm_draft