Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
dd0c893
Align dev/vllm-align with upstream vLLM changes
tzhouam Mar 10, 2026
3630bdf
Update Dockerfile.ci to install vLLM from a new commit and simplify w…
tzhouam Mar 10, 2026
e150a1b
debug: empty commit to test CI pipeline (vLLM 156e33553ccd)
tzhouam Mar 10, 2026
7706132
debug: empty commit to test CI pipeline (vLLM 156e33553ccd)
tzhouam Mar 10, 2026
cfbdb57
debug: empty commit to test CI pipeline (vLLM 156e33553ccd)
tzhouam Mar 10, 2026
8ac369c
debug: empty commit to test CI pipeline (vLLM 156e33553ccd)
tzhouam Mar 10, 2026
cf0a7a5
Update Dockerfile.ci to install additional dependencies for flashinfe…
tzhouam Mar 10, 2026
9dc5d15
debug: empty commit to test CI pipeline (vLLM ddbb0d230a35)
tzhouam Mar 10, 2026
5c845ce
debug: empty commit to test CI pipeline (vLLM ddbb0d230a35)
tzhouam Mar 10, 2026
4014e76
debug: empty commit to test CI pipeline (vLLM ddbb0d230a35)
tzhouam Mar 10, 2026
dd0ec55
debug: empty commit to test CI pipeline (vLLM ddbb0d230a35)
tzhouam Mar 10, 2026
b0cf788
debug: empty commit to test CI pipeline (vLLM ddbb0d230a35)
tzhouam Mar 10, 2026
cee2b4b
debug: empty commit to test CI pipeline (vLLM ddbb0d230a35)
tzhouam Mar 10, 2026
1c0a71e
Refactor Dockerfile.ci to consolidate package installation commands i…
tzhouam Mar 10, 2026
a9862f1
Revert "Refactor Dockerfile.ci to consolidate package installation co…
tzhouam Mar 10, 2026
20c102d
debug: empty commit to test CI pipeline (vLLM ddbb0d230a35)
tzhouam Mar 10, 2026
ad12921
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 11, 2026
91135f1
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 11, 2026
e47c5dd
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 11, 2026
08a2673
rebase: align vllm-omni with vLLM 84e436ed1c94
tzhouam Mar 11, 2026
11b6c16
rebase: align vllm-omni with vLLM 84e436ed1c94
tzhouam Mar 11, 2026
72cb4f6
Adjust GPU memory utilization and device settings in qwen2_5_omni_ci.…
tzhouam Mar 12, 2026
e17fb0e
Refactor OpenAI serving components and enhance multimodal handling
tzhouam Mar 13, 2026
d911475
Fix output queue handling in OmniStage to return None if queue is clo…
tzhouam Mar 13, 2026
62c8695
Update GPUGenerationModelRunner to clarify KV connector metadata hand…
tzhouam Mar 13, 2026
d3db476
Merge branch 'main' into dev/vllm-align
tzhouam Mar 13, 2026
97899c6
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 13, 2026
f137a1e
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 16, 2026
3a508d7
rebase: align vllm-omni with vLLM e9163b536e72
tzhouam Mar 16, 2026
aae101f
Refactor import statement for OpenAIServingEmbedding in api_server.py
tzhouam Mar 16, 2026
9c22e3a
Refactor import statement for OpenAIServingEmbedding in api_server.py…
tzhouam Mar 16, 2026
a26e568
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 16, 2026
de6bcc7
Update Dockerfile.ci to use a new vllm wheel URL and refactor Mooncak…
tzhouam Mar 16, 2026
57ac7eb
rebase: align vllm-omni with vLLM 2754231ba3a7
tzhouam Mar 16, 2026
8e81e60
Refactor Dockerfile.ci to install vLLM from source using precompiled …
tzhouam Mar 16, 2026
22a2bd3
Upgrade flashinfer-cubin and flashinfer-jit-cache versions
tzhouam Mar 16, 2026
83b05ce
Refactor Dockerfile.ci to install vLLM using a precompiled wheel from…
tzhouam Mar 17, 2026
b0e36f9
Update Dockerfile.ci to force reinstall vLLM from a precompiled wheel…
tzhouam Mar 17, 2026
0eccb77
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 17, 2026
ac937b8
Refactor Dockerfile to install vLLM from source
tzhouam Mar 17, 2026
722ad6b
rebase: align vllm-omni with vLLM 0a0a1a198be8
tzhouam Mar 17, 2026
b7a36d1
Merge branch 'dev/vllm-align' of https://github.com/vllm-project/vllm…
tzhouam Mar 17, 2026
83effe3
Update Dockerfile.ci to include an additional extra index URL for pip…
tzhouam Mar 17, 2026
35b3dd6
Refactor Dockerfile.ci to install vLLM from a dynamically retrieved p…
tzhouam Mar 17, 2026
cc4ade4
Refactor shutdown logic in OmniLLM and stage worker to ensure proper …
tzhouam Mar 17, 2026
004c60a
Refactor shutdown and cleanup logic in OmniRunner and OmniStage to im…
tzhouam Mar 17, 2026
8a31e78
Increase timeout for Diffusion Model Test and Diffusion Images API Lo…
tzhouam Mar 17, 2026
7a68bd3
Refactor cu_seqlens handling in Qwen3Omni_VisionTransformer to align …
tzhouam Mar 18, 2026
dc6f2fb
Enhance cu_seqlens computation in Qwen3Omni_VisionTransformer by swit…
tzhouam Mar 18, 2026
342a209
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 19, 2026
4f75ccc
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 19, 2026
034d579
Update vLLM precompiled wheel version and refactor various components…
tzhouam Mar 19, 2026
22b6703
Merge remote-tracking branch 'origin/main' into dev/vllm-align
tzhouam Mar 19, 2026
f83df01
Remove redundant assignment of rope_theta in Qwen3OmniMoeTalker initi…
tzhouam Mar 19, 2026
37e055e
Merge remote-tracking branch 'remotes/origin/main' into dev/vllm-align
tzhouam Mar 19, 2026
367ce63
Merge remote-tracking branch 'remotes/origin/main' into dev/vllm-align
tzhouam Mar 19, 2026
3c52432
Merge branch 'main' into dev/vllm-align
tzhouam Mar 20, 2026
0d23f68
revert changes in qwen 3 tts
tzhouam Mar 20, 2026
b8f736c
Remove obsolete image and YAML configuration files for Qwen2.5 and Qw…
tzhouam Mar 20, 2026
6ff31eb
Update rope parameters in Qwen3OmniMoeTalker to include rope_theta fo…
tzhouam Mar 20, 2026
66c31a3
Merge branch 'main' into dev/rebase-v0.18.0
tzhouam Mar 20, 2026
521f178
[Refactor] Remove unused parameters from AsyncOmni class constructor …
tzhouam Mar 20, 2026
9538e3f
[NPU] Upgrade to v0.18.0
gcanlin Mar 20, 2026
869a593
Merge branch 'main' into dev/rebase-v0.18.0
tzhouam Mar 20, 2026
826b3b3
Merge branch 'main' into dev/rebase-v0.18.0
tzhouam Mar 20, 2026
e805487
Merge branch 'main' into dev/rebase-v0.18.0
tzhouam Mar 21, 2026
c27f1c4
Merge branch 'main' into dev/rebase-v0.18.0
tzhouam Mar 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .buildkite/test-amd.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
steps:

- label: "Diffusion Model Test"
timeout_in_minutes: 20
timeout_in_minutes: 30
agent_pool: mi325_2
depends_on: amd-build
mirror_hardwares: [amdproduction]
Expand All @@ -11,7 +11,7 @@ steps:
- pytest -s -v tests/e2e/offline_inference/test_t2i_model.py

- label: "Diffusion Images API LoRA E2E"
timeout_in_minutes: 20
timeout_in_minutes: 30
agent_pool: mi325_1
depends_on: amd-build
mirror_hardwares: [amdproduction]
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/test-merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ steps:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Diffusion Model Test"
timeout_in_minutes: 20
timeout_in_minutes: 30
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we modify these settings?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The running time is a little bit longer than 20 mins, I guess this is caused by much larger docker image than the main branch.

depends_on: upload-merge-pipeline
commands:
- pytest -s -v tests/e2e/offline_inference/test_t2i_model.py -m "advanced_model and diffusion" --run-level "advanced_model"
Expand All @@ -35,7 +35,7 @@ steps:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Diffusion Images API LoRA E2E"
timeout_in_minutes: 20
timeout_in_minutes: 30
depends_on: upload-merge-pipeline
commands:
- pytest -s -v tests/e2e/online_serving/test_images_generations_lora.py
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/test-ready.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ steps:
- label: "Diffusion Model Test"
depends_on: upload-ready-pipeline
commands:
- timeout 20m pytest -s -v tests/e2e/offline_inference/test_t2i_model.py -m "core_model and diffusion" --run-level "core_model"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same timeout problem.

- timeout 30m pytest -s -v tests/e2e/offline_inference/test_t2i_model.py -m "core_model and diffusion" --run-level "core_model"
agents:
queue: "gpu_1_queue"
plugins:
Expand Down
25 changes: 22 additions & 3 deletions docker/Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,29 @@ RUN apt-get update && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# Install vllm-omni into the same uv-managed Python environment used by the base image.
# Use bash -c so that $(python3 -c ...) is expanded inside the container.
RUN uv pip install --system --no-cache-dir ".[dev]"
RUN uv pip uninstall --system -y vllm || true

# Install vLLM from precompiled wheel at the selected commit.
# Must use direct URL because the wheel has a PEP 440 local version identifier
# (e.g. +g0a0a1a198) which pip/uv refuse to install from a PEP 503 package index.
ENV VLLM_PRECOMPILED_WHEEL_COMMIT=89138b21cc246ae944c741d5c399c148e2b770ab
RUN VLLM_WHEEL_URL=$(python3 -c "import urllib.request,re; \
html=urllib.request.urlopen('https://wheels.vllm.ai/${VLLM_PRECOMPILED_WHEEL_COMMIT}/vllm/').read().decode(); \
m=re.search(r'>(\S+x86_64\.whl)<',html); \
print('https://wheels.vllm.ai/${VLLM_PRECOMPILED_WHEEL_COMMIT}/'+m.group(1).replace('+','%2B'))") && \
echo "Installing vLLM from: ${VLLM_WHEEL_URL}" && \
uv pip install --system --force-reinstall "${VLLM_WHEEL_URL}"

RUN uv pip install --system ".[dev]"

RUN uv pip install --system --upgrade \
"flashinfer-cubin==0.6.6" \
"nvidia-cublas-cu12==12.9.1.4" \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the cublas version upgrade?

"numpy==2.2.6"

RUN uv pip install --system --upgrade \
"flashinfer-jit-cache==0.6.6" \
--index-url https://flashinfer.ai/whl/cu129
RUN ln -sf /usr/bin/python3 /usr/bin/python

ENTRYPOINT []
6 changes: 3 additions & 3 deletions docs/getting_started/installation/npu/npu.inc.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,18 +68,18 @@ We are keeping [issue #886](https://github.com/vllm-project/vllm-omni/issues/886
You can also build vLLM-Omni from the latest main branch if you want to use the latest features or bug fixes. (But sometimes it will break for a while. You can check [issue #886](https://github.com/vllm-project/vllm-omni/issues/886) for the status of the latest commit of vLLM-Omni main branch on NPU.)

```bash
# Pin vLLM version to 0.17.0
# Pin vLLM version to 0.18.0
cd /vllm-workspace/vllm
git pull origin main
git fetch origin --tags
git checkout v0.17.0
git checkout v0.18.0
VLLM_TARGET_DEVICE=empty pip install -v -e .

# Because vllm-ascend has not yet entered continuous development and has not been officially released, we need to pin it to a specific commit. Please note that this commit may change over time.
cd /vllm-workspace/vllm-ascend
git pull origin main
git fetch origin --tags
git checkout v0.17.0
git checkout 1e05c4908f31737bc4eef865a9f351d030a77c9d
pip install -v -e .

# Install vLLM-Omni from the latest main branch
Expand Down
27 changes: 22 additions & 5 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1910,6 +1910,7 @@ def generate_multimodal(
def _cleanup_process(self):
try:
keywords = ["enginecore"]
matched = []

for proc in psutil.process_iter(["pid", "name", "cmdline", "username"]):
try:
Expand All @@ -1922,16 +1923,32 @@ def _cleanup_process(self):

if is_process:
print(f"Found vllm process: PID={proc.pid}, cmd={cmdline[:100]}")
matched.append(proc)
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass

try:
proc.terminate()
time.sleep(2)
except Exception:
proc.kill()
for proc in matched:
try:
proc.terminate()
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass

_, still_alive = psutil.wait_procs(matched, timeout=5)
for proc in still_alive:
try:
proc.kill()
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass

if still_alive:
_, stubborn = psutil.wait_procs(still_alive, timeout=3)
if stubborn:
print(f"Warning: failed to kill residual vllm pids: {[p.pid for p in stubborn]}")
else:
print(f"Force-killed residual vllm pids: {[p.pid for p in still_alive]}")
elif matched:
print(f"Terminated vllm pids: {[p.pid for p in matched]}")

except Exception as e:
print(f"Error in psutil vllm cleanup: {e}")

Expand Down
7 changes: 6 additions & 1 deletion tests/e2e/offline_inference/test_diffusion_lora.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
# This test is specific to Z-Image LoRA behavior. Keep it focused on a single
# model to reduce runtime and avoid extra downloads.
models = ["Tongyi-MAI/Z-Image-Turbo"]
DIFFUSION_INIT_TIMEOUT_S = 600


@pytest.mark.parametrize("model_name", models)
Expand Down Expand Up @@ -76,7 +77,11 @@ def _write_zimage_lora(adapter_dir: Path) -> str:
)
return str(adapter_dir)

m = Omni(model=model_name)
m = Omni(
model=model_name,
stage_init_timeout=DIFFUSION_INIT_TIMEOUT_S,
init_timeout=DIFFUSION_INIT_TIMEOUT_S,
)
try:
# high resolution may cause OOM on L4
height = 256
Expand Down
13 changes: 12 additions & 1 deletion tests/e2e/online_serving/test_images_generations_lora.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

MODEL = "Tongyi-MAI/Z-Image-Turbo"
DIFFUSION_INIT_TIMEOUT_S = 600


PROMPT = "a photo of a cat sitting on a laptop keyboard"
Expand All @@ -37,7 +38,17 @@

@pytest.fixture(scope="module")
def omni_server():
with OmniServer(MODEL, ["--num-gpus", "1"]) as server:
with OmniServer(
MODEL,
[
"--num-gpus",
"1",
"--stage-init-timeout",
str(DIFFUSION_INIT_TIMEOUT_S),
"--init-timeout",
str(DIFFUSION_INIT_TIMEOUT_S),
],
) as server:
yield server


Expand Down
4 changes: 2 additions & 2 deletions tests/e2e/stage_configs/qwen2_5_omni_ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ stage_args:
max_model_len: 16384
max_num_batched_tokens: 16384
max_num_seqs: 1
gpu_memory_utilization: 0.9
gpu_memory_utilization: 0.4
skip_mm_profiling: true
enforce_eager: true
trust_remote_code: true
Expand All @@ -72,7 +72,7 @@ stage_args:
model_arch: Qwen2_5OmniForConditionalGeneration
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
gpu_memory_utilization: 0.9 #increase the gpu memory utilization to enable the test on H800
gpu_memory_utilization: 0.5 #increase the gpu memory utilization to enable the test on H800
enforce_eager: true
trust_remote_code: true
enable_prefix_caching: false
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ def make_mock_model(hidden: int = 8):
cfg.video_token_index = VIDEO_TOKEN_ID
cfg.audio_token_index = AUDIO_TOKEN_ID
model.config = cfg
model._has_oov_mm_tokens = False

def fake_lm_embed(ids: torch.Tensor) -> torch.Tensor:
# Use .clone() so the tensor is contiguous (expand() creates a strided
Expand All @@ -137,13 +138,12 @@ def fake_lm_embed(ids: torch.Tensor) -> torch.Tensor:

model._embed_text_input_ids = lambda *a, **kw: SupportsMultiModal._embed_text_input_ids(model, *a, **kw)

def fake_super_embed(ids, mm_embs=None, *, is_multimodal=None, handle_oov_mm_token=False):
def fake_super_embed(ids, mm_embs=None, *, is_multimodal=None):
return SupportsMultiModal.embed_input_ids(
model,
ids,
mm_embs,
is_multimodal=is_multimodal,
handle_oov_mm_token=handle_oov_mm_token,
)

model.embed_input_ids = lambda *a, **kw: Qwen2_5OmniThinkerForConditionalGeneration.embed_input_ids(model, *a, **kw)
Expand Down
1 change: 1 addition & 0 deletions vllm_omni/core/sched/omni_ar_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,7 @@ def update_from_output(
if stopped_preempted_reqs:
# This is a rare case and unlikely to impact performance.
self.waiting.remove_requests(stopped_preempted_reqs)
self.skipped_waiting.remove_requests(stopped_preempted_reqs)

# [Main] Handle failed KV load requests
if failed_kv_load_req_ids and not self.recompute_kv_load_failures:
Expand Down
15 changes: 11 additions & 4 deletions vllm_omni/core/sched/omni_generation_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,15 @@ def schedule(self) -> SchedulerOutput:
)

total_num_scheduled_tokens = sum(num_scheduled_tokens.values())

# Record the request ids scheduled in this step (v0.14.0 behavior).
self.prev_step_scheduled_req_ids.clear()
self.prev_step_scheduled_req_ids.update(num_scheduled_tokens.keys())

new_block_ids_to_zero = (
(self.kv_cache_manager.take_new_block_ids() or None) if self.needs_kv_cache_zeroing else None
)

scheduler_output = SchedulerOutput(
scheduled_new_reqs=new_reqs_data,
scheduled_cached_reqs=cached_reqs_data,
Expand All @@ -258,12 +267,9 @@ def schedule(self) -> SchedulerOutput:
finished_req_ids=self.finished_req_ids,
free_encoder_mm_hashes=self.encoder_cache_manager.get_freed_mm_hashes(),
preempted_req_ids=set(),
new_block_ids_to_zero=new_block_ids_to_zero,
)

# Record the request ids scheduled in this step (v0.14.0 behavior).
self.prev_step_scheduled_req_ids.clear()
self.prev_step_scheduled_req_ids.update(num_scheduled_tokens.keys())

# KVTransfer: package metadata
if self.connector is not None:
meta = self.connector.build_connector_meta(scheduler_output)
Expand Down Expand Up @@ -496,6 +502,7 @@ def update_from_output(
if stopped_preempted_reqs:
# This is a rare case and unlikely to impact performance.
self.waiting.remove_requests(stopped_preempted_reqs)
self.skipped_waiting.remove_requests(stopped_preempted_reqs)

# Handle failed KV load requests
if failed_kv_load_req_ids and not self.recompute_kv_load_failures:
Expand Down
11 changes: 6 additions & 5 deletions vllm_omni/diffusion/quantization/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,15 @@ class DiffusionQuantizationConfig(ABC):
# The underlying vLLM config instance
_vllm_config: "QuantizationConfig | None" = None

def get_name(self) -> str:
@classmethod
def get_name(cls) -> str:
"""Return the quantization method name (e.g., 'fp8', 'int8').

By default, delegates to the underlying vLLM config instance.
Delegates to the underlying vLLM config class's get_name().
"""
if self._vllm_config is not None:
return self._vllm_config.get_name()
raise NotImplementedError("Subclass must initialize _vllm_config or override get_name().")
if cls.quant_config_cls is not None:
return cls.quant_config_cls.get_name()
raise NotImplementedError("Subclass must set quant_config_cls or override get_name().")

def get_vllm_quant_config(self) -> "QuantizationConfig | None":
"""Return the underlying vLLM QuantizationConfig for linear layers."""
Expand Down
34 changes: 14 additions & 20 deletions vllm_omni/distributed/kv_transfer/monkey_patch.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from __future__ import annotations

import importlib
import logging
import sys
from dataclasses import dataclass
Expand All @@ -30,30 +31,23 @@ class PatchedRecvReqMeta:

def _import_mooncake_module():
"""Import MooncakeConnector module, supporting both vLLM >=0.16 and older."""
try:
from vllm.distributed.kv_transfer.kv_connector.v1.mooncake import mooncake_connector

return mooncake_connector
except ImportError:
pass
try:
from vllm.distributed.kv_transfer.kv_connector.v1 import mooncake_connector

return mooncake_connector
except ImportError:
return None
for mod_path in (
"vllm.distributed.kv_transfer.kv_connector.v1.mooncake.mooncake_connector",
"vllm.distributed.kv_transfer.kv_connector.v1.mooncake_connector",
):
try:
return importlib.import_module(mod_path)
except (ImportError, ModuleNotFoundError):
continue
return None


def _create_patched_mooncake_connector():
"""Return a subclass of MooncakeConnector with remote_request_id support."""
try:
from vllm.distributed.kv_transfer.kv_connector.v1.mooncake.mooncake_connector import (
MooncakeConnector as _OriginalMooncakeConnector,
)
except (ImportError, AttributeError):
from vllm.distributed.kv_transfer.kv_connector.v1.mooncake_connector import (
MooncakeConnector as _OriginalMooncakeConnector,
)
_mc_mod = _import_mooncake_module()
if _mc_mod is None:
raise ImportError("Cannot import MooncakeConnector from upstream vLLM")
_OriginalMooncakeConnector = _mc_mod.MooncakeConnector

class PatchedMooncakeConnector(_OriginalMooncakeConnector):
"""Fixes request-ID mismatch in PD disaggregation by injecting
Expand Down
Loading
Loading