Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered an error in forward function: slice 712 exceeds buffer size 471 #1480

Closed
1 of 4 tasks
sleepwalker2017 opened this issue Apr 22, 2024 · 5 comments
Closed
1 of 4 tasks
Assignees
Labels
bug Something isn't working

Comments

@sleepwalker2017
Copy link

System Info

GPU A30 * 2

TensorRT-LLM version: v0.9.0

Model: vicuna 13B

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. build engine
python convert_checkpoint.py --model_dir /data/weilong.yu/vicuna-13b/vicuna-13b-v1.5/ \
                              --output_dir ./tllm_checkpoint_2gpu_fp16 \
                              --dtype float16 --tp_size 2

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp16 \
            --output_dir ./tmp/llama/13B/trt_engines/fp16/2-gpu \
            --gemm_plugin float16 \
            --use_fused_mlp \
            --max_batch_size $1 \
            --max_input_len 2048 \
            --max_output_len 256 \
            --context_fmha enable \
            --paged_kv_cache enable \
            --use_paged_context_fmha enable \
            --remove_input_padding enable  --workers 2 \
            --use_fused_mlp
  1. run benchmark
mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse -enable_chunked_context

Expected behavior

No error message.

actual behavior

sh run.sh
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 712 exceeds buffer size 471
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 712 exceeds buffer size 471
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 1553 exceeds buffer size 927
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 1553 exceeds buffer size 927
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 884 exceeds buffer size 642
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 884 exceeds buffer size 642
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 1192 exceeds buffer size 951
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 1192 exceeds buffer size 951
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 1253 exceeds buffer size 1012
[TensorRT-LLM][ERROR] Encountered an error in forward function: slice 1253 exceeds buffer size 1012
[TensorRT-LLM][WARNING] Step function failed, continuing.
[TensorRT-LLM][WARNING] Step function failed, continuing.
[BENCHMARK] num_samples 200
[BENCHMARK] total_latency(ms) 71149.43
[BENCHMARK] seq_throughput(seq/sec) 2.81
[BENCHMARK] token_throughput(token/sec) 531.37
[BENCHMARK] avg_sequence_latency(ms) 22587.76
[BENCHMARK] p99_sequence_latency(ms) 50983.86
[BENCHMARK] p90_sequence_latency(ms) 45602.29
[BENCHMARK] p50_sequence_latency(ms) 14514.95
[TensorRT-LLM][INFO] Terminate signal received, worker thread exiting.
[TensorRT-LLM][INFO] Terminate signal received, worker thread exiting.

additional notes

no

@sleepwalker2017 sleepwalker2017 added the bug Something isn't working label Apr 22, 2024
@Tushar-ml
Copy link
Contributor

I am getting the same issue when trying speculative decoding (medusa) with vicuna, after some inference, it is getting buffer size exceeds 2560

@skyCreateXian
Copy link

skyCreateXian commented May 6, 2024

Encountered an issue while using speculative decoding: '[TensorRT LM] [ERROR] Encountered an error in forward function: slice 501760 excesses buffer size 250880', 0.9.0 dev20240222000 is normal

@pcastonguay
Copy link
Collaborator

Hi, thanks for reporting this issue. I haven't been able to reproduce on latest main on 2xA100. What --max_batch_size value did you use (it's not specified in the build cmd you shared)? Thanks.

@pcastonguay
Copy link
Collaborator

I also just tested on 2xA30 and cannot reproduce using latest main following the instructions shared above.

mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse
[BENCHMARK] num_samples 100
[BENCHMARK] num_error_samples 0

[BENCHMARK] num_samples 100
[BENCHMARK] total_latency(ms) 1506.20
[BENCHMARK] seq_throughput(seq/sec) 66.39
[BENCHMARK] token_throughput(token/sec) 995.88

[BENCHMARK] avg_sequence_latency(ms) 1116.72
[BENCHMARK] max_sequence_latency(ms) 1501.60
[BENCHMARK] min_sequence_latency(ms) 872.77
[BENCHMARK] p99_sequence_latency(ms) 1501.60
[BENCHMARK] p90_sequence_latency(ms) 1501.58
[BENCHMARK] p50_sequence_latency(ms) 900.98

@sleepwalker2017
Copy link
Author

mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse -enable_chunked_context
mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse -enable_chunked_context

hi, this issue is reproduced by using --enable_kv_cache_reuse and -enable_chunked_context together.

I built it using max_batch = 24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants