Skip to content

FlashInfer 0.2.3+ does not support per-request generators #1104

@sirouk

Description

@sirouk

I'm using FlashInfer 2.5 attention backend with vLLM 0.9.0.1 on CUDA 12.8 and with VLLM_USE_FLASHINFER_SAMPLER=1, when temperature deviates between requests, this is what happens:

...
1|vllm-server-7011  | (VllmWorker rank=2 pid=16417) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011  | (VllmWorker rank=3 pid=16418) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011  | (VllmWorker rank=7 pid=16422) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011  | (VllmWorker rank=6 pid=16421) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
1|vllm-server-7011  | (VllmWorker rank=5 pid=16420) WARNING 05-31 03:56:50 [topk_topp_sampler.py:99] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
...

I am able to set it to VLLM_USE_FLASHINFER_SAMPLER=1 and it falls back to Flash Attention sampler.

Please let me know if this is a WIP, thanks!

Cross post:
vllm-project/vllm#18811

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions