Fix a bug in tying OPT embeddings #1

WoosukKwon · 2023-02-25T00:27:11Z

This PR fixes a bug in supporting OPT-350m/OPT-6.7b/OPT-13b and OPT-IML models.

The bug happened because our model code didn't include some methods that were required to tie the input and output embeddings.

add rope scaling as a cli arg so openai server can load rope scaled models

Fix key cache block shape.

Deterministic OpenVINO inference

merge code

BA-78554: Jurassic 2.5 * worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer * changed default tokenizer vocab values, loading of custom .pt weight files works. * removed notebook * merging master to jurassic-2.5 to reset head * Merge branch 'master' into jurassic-2.5 * align to master Approved-by: Tomer Asida Approved-by: Mor Zusman

Triton compilation fix

Group Gemm Version

…ermerge feat:trace v1

Bug #1 (CRITICAL): Add missing begin() and stage() methods to KVWriteRouter - Flash attention backend calls router.begin() and router.stage() - KVWriteRouter only had write() and commit() methods - Added begin() to store slot_mapping and initialize shadow buffer - Added stage() to extract per-timestep slot and stage KV pairs - Without these, no tokens were being staged → 0% acceptance rate Bug #2 (MODERATE): Fix bonus token counting in accepted_lens - valid_sampled_token_ids includes [accepted_draft_tokens..., bonus_token] - Previous: len([bonus]) = 1, incorrectly counted as 1 accepted draft token - Fixed: Use max(0, len(seq) - 1) to exclude bonus token from count - Now correctly reports 0 accepted when only bonus token is present Files modified: - vllm/v1/kv_cache/write_router.py: Added begin() and stage() methods - vllm/v1/worker/gpu_model_runner.py: Fixed accepted_lens calculation

Bug #1: EAGLE tree proposal returned zeros for draft_logprobs - Root cause: When using topk for tree branching, code set draft_logp_list=None, then created zeros tensor as fallback (lines 850-851) - Fix: Compute actual log-probs from logits using log_softmax + gather - Applied at 2 locations: root level (lines 698-704) and tree levels (lines 839-846) Bug #2: Added diagnostic logging in rejection sampler - Log draft_p (nonzero) min/med/max to detect zeros - Log p_target min/med/max to detect degenerate softmax - Helps identify if target logits are masked/filtered before sampling Expected results after fix: - draft_logp: -3.2/-1.6/-0.0 (real log-probs, all ≤ 0) instead of 0/0/0 - p_target: 1e-6/1e-3/0.7 (realistic distribution) instead of 1/1/1 - Acceptance rate: 30-70% instead of 0% Files changed: - vllm/v1/spec_decode/eagle.py: Fix draft_logp computation - vllm/v1/sample/rejection_sampler.py: Add sanity logging

CRITICAL FIX: tau_d was reading draft_temperature (0.05) instead of target temperature from sampling_metadata (1.0). This caused: - tau_q = 0.05 + 0.3 = 0.35 (before) - Logit gap = 10/0.35 = 28.6 → exp(-28.6) ≈ 0 (underflow!) - q collapses to 0.98-1.0 After fix: - tau_d = 1.0 (from sampling_metadata.temperature) - tau_q = 1.0 + 0.3 = 1.3 - Logit gap = 10/1.3 = 7.7 → exp(-7.7) = 0.00045 (survives!) - q should be in [0.5, 0.8] range Changes: - propose(): Store sampling_metadata as self._current_sampling_metadata - _sample_draft_tokens(): Read tau_d from sampling_metadata, not opt_config

Enhanced documentation for plugin patches: 1. Patch vllm-project#1 (Usage Tracking Helper): - Clarified as OPTIONAL (has fallback in harmony streaming patch) - Changed from "REQUIRED" to "OPTIONAL" - Explained fallback mechanism in patched_stream_method.py - Marked as upstreamable (minor utility addition) 2. Patch vllm-project#3 (Harmony Token-by-Token Streaming): - Added detailed speculative decoding context - Explained Eagle draft model generates 5-10 tokens per step - Documented specific failures with batch processing: * Tool calling broken * Multi-channel content lost * Token truncation during channel transitions - Added before/after code examples - Linked to PR vllm-project#26291 (Eagle3 Multi-Channel Streaming Fix) - Documented upstream status and removal plan Key insight: This patch exists because Eagle speculative decoding returns multiple tokens per step, and upstream's batch processing can't handle per-token channel switching. Signed-off-by: Pradyun Ramadorai <[email protected]>

--------- Signed-off-by: Wuxun Zhang <[email protected]>

Summary: Running FI Cutlass moe with FI a2av backend runs into error: ``` �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] ) = self.prepare_finalize.prepare( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] flashinfer_alltoall_dispatch( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] all2all_manager.prepare_workspace, �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'? �[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start. ``` After fixing the error above, running into the following error: ``` �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m run_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__ �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)vllm-project#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType) �[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value() I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy. ``` It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly Differential Revision: D86345110

Summary: Running FI Cutlass moe with FI a2av backend runs into error: ``` �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] ) = self.prepare_finalize.prepare( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] flashinfer_alltoall_dispatch( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] all2all_manager.prepare_workspace, �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'? �[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start. ``` After fixing the error above, running into the following error: ``` �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m run_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__ �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)vllm-project#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType) �[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value() I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy. ``` It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly Differential Revision: D86345110 Signed-off-by: Xiaozhu <[email protected]>

- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Jscaldwell55 <[email protected]>

- Add section-level state machine (in_tool_section flag) - Implement rolling buffer for split marker detection (1KB cap) - Suppress content between section_begin and tool_call_begin - Support marker variants (plural/singular) - Add error recovery for malformed sections (8KB limit) - Preserve function contract (always return DeltaMessage) - Fix critical bug vllm-project#1: Handle both begin/end markers in same chunk (Changed elif to if on line 237 to prevent state corruption) - Fix critical bug vllm-project#2: Defer section exit when tool_call_end present (Prevents dropping final tool arguments and token leakage) - Include 12 comprehensive tests (3 new tests for edge cases) Fixes bug where text between <|tool_calls_section_begin|> and <|tool_call_begin|> leaks into reasoning_delta during streaming mode. Also fixes two critical edge cases: 1. Section begin and end markers appearing in same chunk would leave parser stuck in in_tool_section=True, causing subsequent content to be incorrectly suppressed. 2. Tool_call_end and section_end in same chunk would cause early return before tool parsing, dropping final tool arguments and leaking special tokens into reasoning channel. Signed-off-by: Jscaldwell55 <[email protected]>

Fix OPT errors

44735b4

WoosukKwon merged commit cbf8779 into main Feb 25, 2023

WoosukKwon deleted the fix-opt branch February 25, 2023 00:29

murongweibo mentioned this pull request Jul 11, 2023

NCCL Error 5: invalid usage #427

Closed

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

CZT0 referenced this pull request in semedia-tech/vllm Sep 11, 2023

#1 测试部署vllm

cc4f1ce

orangetin referenced this pull request in togethercomputer/vllm-ttgi Sep 14, 2023

Merge pull request #1 from winglian/longchat-args

b9012fb

add rope scaling as a cli arg so openai server can load rope scaled models

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023

Add function invoke call for underlying models (vllm-project#1)

9895bbd

bigPYJ1151 added a commit to bigPYJ1151/vllm that referenced this pull request Oct 30, 2023

Merge pull request vllm-project#1 from bigPYJ1151/fix_ans

b5e7066

Fix key cache block shape.

l1cacheDell pushed a commit to CaspianFang/vllm that referenced this pull request Nov 15, 2023

blora LlaMa support vllm-project#1

424df61

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Fix a bug in tying OPT embeddings (#1)

2cb721d

kvikk mentioned this pull request Feb 15, 2024

ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects #2735

Closed

ilya-lavrenov referenced this pull request in ilya-lavrenov/vllm Feb 19, 2024

Merge pull request #1 from ilya-lavrenov/cpu-works

e3d65e0

Deterministic OpenVINO inference

daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#1 from DeepAuto-AI/geon-dev

d9d746e

merge code

afeldman-nm mentioned this pull request Apr 30, 2024

Adding support for encoder-decoder models, like T5 or BART #187

Closed

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

fmmoret mentioned this pull request May 8, 2024

[Bug]: Chunked prefill returning gibberish in some cases. #4697

Closed

Bellk17 added a commit to Bellk17/vllm that referenced this pull request May 10, 2024

Merge pull request vllm-project#1 from Bellk17/main

b36d574

Triton compilation fix

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

ykim362 referenced this pull request in ykim362/vllm Jun 17, 2024

Wenxh/fp8 on a100 v5 (#1)

aca4a33

Group Gemm Version

xiejibing mentioned this pull request Jun 24, 2024

[Bug]: vLLM 0.4.2 8xH100 init failed #5785

Closed

llmpros mentioned this pull request Jun 27, 2024

[Frontend]: Support base64 embedding #5935

Merged

Juelianqvq mentioned this pull request Jul 3, 2024

[Bug]: Flashinfer stuck with CUDA Graph #6086

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

fernandaspets mentioned this pull request Aug 8, 2025

[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 #22479

Open

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

bbartels pushed a commit to bbartels/vllm that referenced this pull request Aug 14, 2025

Merge pull request vllm-project#1 from RichardoMrMu/feat-trace-v1-aft…

a7414f7

…ermerge feat:trace v1

JeffreyWong20 mentioned this pull request Aug 19, 2025

[Bug]: [TPU] profiling_tpu/profiling.py example crashed when runs on vllm_tpu docker #23194

Closed

1 task

ruisearch42 mentioned this pull request Aug 22, 2025

[Bug]: VLLM_ALL2ALL_BACKEND=naive hangs/crashes on multi nodes when serving DeepSeekV3 #23448

Open

1 task

Tar-ive mentioned this pull request Aug 24, 2025

feat: Add TPU v6e architecture-adaptive attention backend #23507

Open

16 tasks

shaamil101-etched mentioned this pull request Aug 25, 2025

[Bug]: vLLM server timeout due to multiprocessing communication error #23582

Open

1 task

ZJY0516 mentioned this pull request Aug 31, 2025

[Bug]: CUDA error when serving MiniCPM-V model #23954

Closed

wyn1015 mentioned this pull request Sep 19, 2025

[Bug]: assortment of warnings / errors coming out of vllm basic python inference script #18634

Open

1 task

LinWang-avivia mentioned this pull request Sep 24, 2025

[Bug]: Sequence Parallelism and Async TP disabled by default #25277

Open

4 tasks

zhanghb55 mentioned this pull request Sep 25, 2025

[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access #25650

Open

1 task

This was referenced Oct 7, 2025

[Performance]: Use int over list[int] as output_tokens to reduce GC overhead #26369

Open

[Core] Bookkeeping optimization: Batchify updates 1D numpy arrays (e.g. num_tokens, num_tokens_no_spec) #25801

Open

tina0852 mentioned this pull request Oct 11, 2025

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

Open

1 task

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

Moondon69 mentioned this pull request Oct 23, 2025

[Bug]: vLLM crashes with SIGABRT on Intel Arc B-series (Battlemage) GPUs during model inspection #27408

Closed

1 task

Flink-ddd mentioned this pull request Oct 23, 2025

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Merged

whwangovo mentioned this pull request Oct 23, 2025

[Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi #27430

Open

1 task

FragranceHUST mentioned this pull request Nov 5, 2025

[Bug]: EngineCore died unexpectedly When Inference llama(generate) #23517

Open

1 task

yitingw1 pushed a commit to yitingw1/vllm that referenced this pull request Nov 5, 2025

allocate DP allgather tensor in forward context (vllm-project#1)

abf7ec4

--------- Signed-off-by: Wuxun Zhang <[email protected]>

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix a bug in tying OPT embeddings #1

Fix a bug in tying OPT embeddings #1

Uh oh!

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix a bug in tying OPT embeddings #1

Fix a bug in tying OPT embeddings #1

Uh oh!

Conversation

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants