Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
8293182
wip
LucasWilkinson May 19, 2025
37c9bab
enable naive microbatching
LucasWilkinson May 19, 2025
df8f889
support MLA
LucasWilkinson May 20, 2025
f93bdd3
support more args in dp example
LucasWilkinson May 20, 2025
9ccfd09
fix dummy mode
LucasWilkinson May 20, 2025
020269c
added multhreading support
SageMoore May 20, 2025
ffb740a
manually manage stream
LucasWilkinson May 20, 2025
04f11d9
working but only on the same stream
LucasWilkinson May 21, 2025
2259b47
use vllm current_stream
LucasWilkinson May 21, 2025
9c60a62
tp1 working multistream tp > 1 broken
LucasWilkinson May 21, 2025
2a7f25f
fix hang
SageMoore May 21, 2025
a8439e2
dp working no yields
LucasWilkinson May 22, 2025
00f526f
seperate gpu wait
LucasWilkinson May 22, 2025
18bf91e
wip
LucasWilkinson May 23, 2025
2dc3b8b
wip
LucasWilkinson May 23, 2025
9edd082
debugging hang
LucasWilkinson May 23, 2025
952f3c5
tone down prints
LucasWilkinson May 23, 2025
e4419df
better debug utils
LucasWilkinson May 23, 2025
37bdf9f
better logging
LucasWilkinson May 23, 2025
020d9b0
fix dp=2 tp=2 hang
SageMoore May 26, 2025
2f39206
add comment
LucasWilkinson May 27, 2025
7b31e8a
wip seperate comm and compute threads
LucasWilkinson May 27, 2025
a743a35
fixes
LucasWilkinson May 27, 2025
f0b66d6
prints
LucasWilkinson May 27, 2025
5cc573e
misc fixes
SageMoore May 29, 2025
895a6c2
one a2a kernel per microbatch group
SageMoore May 30, 2025
5b0249b
various fixes
SageMoore May 30, 2025
62da375
more fixes
SageMoore May 30, 2025
252bf08
debugging
SageMoore May 31, 2025
0323e29
misc cleanups to prepare for rebase
SageMoore Jun 2, 2025
8f59252
misc cleanups to prepare for rebase
SageMoore Jun 2, 2025
90e46ee
misc cleanups to prepare for rebase
SageMoore Jun 2, 2025
065816d
misc cleanups to prepare for rebase
SageMoore Jun 2, 2025
6645882
comment prepare input
SageMoore Jun 2, 2025
d6eca0c
remove modular kernel
SageMoore Jun 2, 2025
21d9529
revert offline_inference/basic.py
SageMoore Jun 2, 2025
8ea80fc
revert offline_inference/basic.py
SageMoore Jun 2, 2025
92e0cc7
format
SageMoore Jun 2, 2025
44a595f
config format
SageMoore Jun 2, 2025
d4b502a
mla format
SageMoore Jun 2, 2025
8332924
dp format
SageMoore Jun 2, 2025
243eac5
forward context format
SageMoore Jun 2, 2025
d463976
pplx format
SageMoore Jun 2, 2025
e34e441
fa format
SageMoore Jun 2, 2025
919eef9
temporarily remove enable_microbatching
SageMoore Jun 2, 2025
2731e8c
temporarily remove enable_microbatching
SageMoore Jun 2, 2025
18e7d6c
Merge branch 'main' of https://github.com/neuralmagic/vllm into lwilk…
SageMoore Jun 3, 2025
539c0c3
first round of fixes
SageMoore Jun 3, 2025
5f4a501
more fixes
SageMoore Jun 3, 2025
e080e06
fix pplx a2a
SageMoore Jun 3, 2025
2e3484c
debugging
SageMoore Jun 3, 2025
f8848bb
misc fixes. lm_eval still gets a wrong answer but it no longer hangs
SageMoore Jun 4, 2025
8a75b3a
added support for ubatch padding. not working
SageMoore Jun 5, 2025
a8675b7
ubatch padding should work now
SageMoore Jun 5, 2025
a00dabc
more padding work. still gets the wrong answer
SageMoore Jun 6, 2025
05ddc34
misc padding fixes
SageMoore Jun 6, 2025
60499f6
padding is getting correctness but there are still some edgecases tri…
SageMoore Jun 7, 2025
e6e3407
fix ubatch padding to account for the case where the padding would re…
SageMoore Jun 8, 2025
642bf2d
Merge branch 'main' of https://github.com/neuralmagic/vllm into lwilk…
SageMoore Jun 8, 2025
ef3c01c
fix using the same buffer across ubatches
LucasWilkinson Jun 9, 2025
d682f5e
wip cudagraphs
SageMoore Jun 12, 2025
b74c731
more hacking
SageMoore Jun 12, 2025
1d112d9
misc changes
SageMoore Jun 17, 2025
0889f66
Merge branch 'main' of https://github.com/neuralmagic/vllm into lwilk…
SageMoore Jun 18, 2025
ff2dd13
more fixes
SageMoore Jun 18, 2025
a4def24
setup deepepll for ubatching
SageMoore Jun 24, 2025
930efd0
yields now work with deepep_ll
SageMoore Jun 24, 2025
96c0c4e
added initial code for cuda graph capturing ubatches
SageMoore Jun 24, 2025
97dbafa
fix correctness issue with full-cudagraphs + attn splitting
SageMoore Jun 24, 2025
144b148
initial full cudagraphs support. normal runs are working. ubatching d…
SageMoore Jun 25, 2025
44a2b34
add attention splitting to dummy runs
SageMoore Jun 25, 2025
e2ba707
factored out some of the context creation code along with misc commet…
SageMoore Jun 25, 2025
0e2b4bd
more refactoring
SageMoore Jun 25, 2025
54deb61
delete any notion of dummy_ubatch
SageMoore Jun 25, 2025
78228a6
refactor a bunch of misc parameters into a UbatchMetadata class
SageMoore Jun 26, 2025
af68574
reintegrate full cudagraphs
SageMoore Jun 26, 2025
4672c72
capture works replay does not
SageMoore Jun 28, 2025
d833982
random push
SageMoore Jun 30, 2025
88fbc04
misc
liewegas Jun 30, 2025
f6db576
Merge remote-tracking branch 'origin/main' into lwilkinson/attn-slicing
LucasWilkinson Jul 1, 2025
9ce5044
wip
LucasWilkinson Jul 1, 2025
b6b44d2
no cuda-graph for now
LucasWilkinson Jul 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions examples/basic-ub.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# SPDX-License-Identifier: Apache-2.0

import logging
import os

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Configure logging level for vllm (optional, uses VLLM_LOGGING_LEVEL env var).
logging_level = os.getenv("VLLM_LOGGING_LEVEL", "").upper()
if logging_level:
logging.basicConfig(level=getattr(logging, logging_level, logging.INFO))

# Create a sampling params object, optionally limiting output tokens via MAX_TOKENS env var.
param_kwargs = {"temperature": 0.8, "top_p": 0.95}
max_tokens_env = os.getenv("MAX_TOKENS")
if max_tokens_env is not None:
try:
param_kwargs["max_tokens"] = int(max_tokens_env)
except ValueError:
raise ValueError(f"Invalid MAX_TOKENS value: {max_tokens_env}")
sampling_params = SamplingParams(**param_kwargs)


def main():
# Create an LLM.
model = "deepseek-ai/DeepSeek-V2-Lite"
# model = "facebook/opt-125m"
llm = LLM(model=model,
enforce_eager=True,
compilation_config=2,
###############
trust_remote_code=True,
max_model_len=1024,
#load_format="dummy",
###############
#tensor_parallel_size=1,
data_parallel_size=2,
enable_expert_parallel=True,
###############
#enable_microbatching=True,
)
# Generate texts from the prompts.
# The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print("-" * 60)


if __name__ == "__main__":
main()
151 changes: 64 additions & 87 deletions examples/offline_inference/data_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,66 +32,43 @@
import os
from time import sleep

from vllm import LLM, SamplingParams
from vllm.utils import get_open_port
from vllm import LLM, EngineArgs, SamplingParams
from vllm.utils import FlexibleArgumentParser, get_open_port


def parse_args():
import argparse

parser = argparse.ArgumentParser(description="Data Parallel Inference")
parser.add_argument(
"--model",
type=str,
default="ibm-research/PowerMoE-3b",
help="Model name or path",
)
parser.add_argument("--dp-size", type=int, default=2, help="Data parallel size")
parser.add_argument("--tp-size", type=int, default=2, help="Tensor parallel size")
parser.add_argument(
"--node-size", type=int, default=1, help="Total number of nodes"
)
parser.add_argument(
"--node-rank", type=int, default=0, help="Rank of the current node"
)
parser.add_argument(
"--master-addr", type=str, default="", help="Master node IP address"
)
parser.add_argument("--master-port", type=int, default=0, help="Master node port")
parser.add_argument(
"--enforce-eager", action="store_true", help="Enforce eager mode execution."
)
parser.add_argument(
"--trust-remote-code", action="store_true", help="Trust remote code."
)
parser.add_argument(
"--max-num-seqs",
type=int,
default=64,
help=("Maximum number of sequences to be processed in a single iteration."),
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.8,
help=("Fraction of GPU memory vLLM is allowed to allocate (0.0, 1.0]."),
)
parser = FlexibleArgumentParser()
EngineArgs.add_cli_args(parser)
parser.set_defaults(model="ibm-research/PowerMoE-3b")
parser.add_argument("--dp-size",
type=int,
default=2,
help="Data parallel size")
parser.add_argument("--tp-size",
type=int,
default=2,
help="Tensor parallel size")
parser.add_argument("--node-size",
type=int,
default=1,
help="Total number of nodes")
parser.add_argument("--node-rank",
type=int,
default=0,
help="Rank of the current node")
parser.add_argument("--master-addr",
type=str,
default="",
help="Master node IP address")
parser.add_argument("--master-port",
type=int,
default=0,
help="Master node port")
return parser.parse_args()


def main(
model,
dp_size,
local_dp_rank,
global_dp_rank,
dp_master_ip,
dp_master_port,
GPUs_per_dp_rank,
enforce_eager,
trust_remote_code,
max_num_seqs,
gpu_memory_utilization,
):
def main(args, dp_size, local_dp_rank, global_dp_rank, dp_master_ip,
dp_master_port, GPUs_per_dp_rank):
os.environ["VLLM_DP_RANK"] = str(global_dp_rank)
os.environ["VLLM_DP_RANK_LOCAL"] = str(local_dp_rank)
os.environ["VLLM_DP_SIZE"] = str(dp_size)
Expand All @@ -107,7 +84,11 @@ def main(
"The president of the United States is",
"The capital of France is",
"The future of AI is",
] * 100
] * 10
# import random
# import string
# prompts = [''.join(random.choices(string.ascii_letters, k=128)) for _ in range(2048)]


# with DP, each rank should process different prompts.
# usually all the DP ranks process a full dataset,
Expand All @@ -131,18 +112,18 @@ def start(rank):
# sampling params. here we set different max_tokens for different
# ranks for demonstration.
sampling_params = SamplingParams(
temperature=0.8, top_p=0.95, max_tokens=[16, 20][global_dp_rank % 2]
temperature=0.8, top_p=0.95, max_tokens=[20, 16][global_dp_rank % 2]
)

# Fixed params
args.pop("tensor_parallel_size")
args.pop("enable_expert_parallel")

# Create an LLM.
llm = LLM(
model=model,
tensor_parallel_size=GPUs_per_dp_rank,
enforce_eager=enforce_eager,
enable_expert_parallel=True,
trust_remote_code=trust_remote_code,
max_num_seqs=max_num_seqs,
gpu_memory_utilization=gpu_memory_utilization,
**args,
)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
Expand All @@ -162,19 +143,22 @@ def start(rank):


if __name__ == "__main__":
args = parse_args()

dp_size = args.dp_size
tp_size = args.tp_size
node_size = args.node_size
node_rank = args.node_rank
args = vars(parse_args())

dp_size = args.pop("dp_size")
tp_size = args.pop("tp_size")
node_size = args.pop("node_size")
node_rank = args.pop("node_rank")

if node_size == 1:
dp_master_ip = "127.0.0.1"
dp_master_port = get_open_port()
args.pop("master_addr")
args.pop("master_port")
else:
dp_master_ip = args.master_addr
dp_master_port = args.master_port
dp_master_ip = args.pop("master_addr")
dp_master_port = args.pop("master_port")

assert dp_size % node_size == 0, "dp_size should be divisible by node_size"
dp_per_node = dp_size // node_size
Expand All @@ -183,29 +167,22 @@ def start(rank):

procs = []
for local_dp_rank, global_dp_rank in enumerate(
range(node_rank * dp_per_node, (node_rank + 1) * dp_per_node)
):
proc = Process(
target=main,
args=(
args.model,
dp_size,
local_dp_rank,
global_dp_rank,
dp_master_ip,
dp_master_port,
tp_size,
args.enforce_eager,
args.trust_remote_code,
args.max_num_seqs,
args.gpu_memory_utilization,
),
)
range(node_rank * dp_per_node, (node_rank + 1) * dp_per_node)):
proc = Process(target=main,
args=(
args,
dp_size,
local_dp_rank,
global_dp_rank,
dp_master_ip,
dp_master_port,
tp_size,
))
proc.start()
procs.append(proc)
exit_code = 0
for proc in procs:
proc.join(timeout=300)
proc.join(timeout=1200)
if proc.exitcode is None:
print(f"Killing process {proc.pid} that didn't stop within 5 minutes.")
proc.kill()
Expand Down
1 change: 1 addition & 0 deletions vllm/compilation/cuda_piecewise_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ def check_for_ending_compilation(self):
end_monitoring_torch_compile(self.vllm_config)

def __call__(self, *args) -> Any:
# logger.info("CUDA BACKEND CALL")
if not self.first_run_finished:
self.first_run_finished = True
self.check_for_ending_compilation()
Expand Down
2 changes: 2 additions & 0 deletions vllm/compilation/decorators.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = '', **kwargs):
vllm_config.compilation_config.level in [
CompilationLevel.NO_COMPILATION, CompilationLevel.DYNAMO_AS_IS
] or not supports_dynamo()
self.do_not_compile = True
if self.do_not_compile:
return
compilation_counter.num_models_seen += 1
Expand All @@ -170,6 +171,7 @@ def __call__(self, *args, **kwargs):
# e.g. TPU has the compilation logic in model runner, so we don't
# need to compile the model inside.
if self.do_not_compile or torch.compiler.is_compiling():
# logger.info("SKIPPING COMPILATION")
return self.forward(*args, **kwargs)

# the first compilation needs to have dynamic shapes marked
Expand Down
26 changes: 26 additions & 0 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1816,6 +1816,18 @@ class ParallelConfig:
disable_custom_all_reduce: bool = False
"""Disable the custom all-reduce kernel and fall back to NCCL."""

enable_microbatching: bool = False
"""Enable microbatching for the model executor."""

always_microbatch_if_enabled: bool = True
"""Always microbatch if microbatching is enabled. Easier to sync between
dp workers."""

microbatching_token_threshold: int = 4
"""The threshold for microbatching. If the number of tokens in the
request is greater than this threshold, microbatching will be used.
Otherwise, the request will be processed in a single batch."""

tokenizer_pool_config: Optional[TokenizerPoolConfig] = None
"""This parameter is deprecated and will be removed in a future release.
Please remove it from your configs"""
Expand Down Expand Up @@ -4564,6 +4576,20 @@ def __post_init__(self):
"cascade attention. Disabling cascade attention.")
self.model_config.disable_cascade_attn = True

if self.parallel_config.enable_microbatching and \
self.compilation_config.level >= CompilationLevel.PIECEWISE:
# Microbatching is not supported with piecewise compilation yet.
# More specifically piecewise cuda-graphs
logger.warning_once(
"Piecewise compilation is not supported with "
"microbatching. Disabling piecewiseching compilation.")
self.compilation_config.level = CompilationLevel.NO_COMPILATION
if not self.model_config.enforce_eager:
self.compilation_config.full_cuda_graph = True
logger.warning_once(
"Enabling fullcudagraphs for microbatching"
)

disable_chunked_prefill_reasons: list[str] = []

if self.model_config and self.model_config.pooler_config:
Expand Down
Loading