-
Notifications
You must be signed in to change notification settings - Fork 584
Open
Labels
Description
I'm using FlashInfer 2.5 attention backend with vLLM 0.9.0.1 on CUDA 12.8 and with VLLM_USE_FLASHINFER_SAMPLER=1, when temperature deviates between requests, this is what happens:
...
1|vllm-server-7011 | (VllmWorker rank=2 pid=16417) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011 | (VllmWorker rank=3 pid=16418) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011 | (VllmWorker rank=7 pid=16422) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011 | (VllmWorker rank=6 pid=16421) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011 | (VllmWorker rank=5 pid=16420) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
...
I am able to set it to VLLM_USE_FLASHINFER_SAMPLER=1 and it falls back to Flash Attention sampler.
Please let me know if this is a WIP, thanks!
Cross post:
vllm-project/vllm#18811
mogazheng, Shiroe7 and RuBing-Yang