-
-
Notifications
You must be signed in to change notification settings - Fork 12.4k
[DisaggEverything] Tokens in<>out /generate endpoint
#24261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
/generate endpoint/generate endpoint
|
EDIT: Moved to #22817 (comment) |
|
I see that If it's a different audience, it may be better suited to a different HTTP service scoped to a different audience and purpose. I had similar feedback about an earlier version of http metadata exchange for the Nixl connector, but the latest version seems to have moved it to its own http service: #22274 If it is desired to keep this on the existing OpenAI API, I think it'd be nice if we used namespacing to make it clear which APIs are our own custom ones vs. our implementation of APIs defined by OpenAI. One option would be something like |
We're still discussing with @smarterclayton the full spectrum of intended use cases.
I understand, would you be in favor of a separate entrypoint altogether? My motivation for keeping things inside the OAI one was to enable easy access to the other endpoints, which are not exclusive, at least in this early stage.
|
It's probably fine to keep within the same API. It doesn't seem harmful to expose (like maybe internal infrastructure metadata exchange would be).
Fair point. I just think it'd be nice to make it clear where we're copying OpenAI vs. defining our own completely independent APIs. It could be |
240f870 to
115b87f
Compare
|
@russellb Changed naming to the one you suggested. Let me know if there's something else I should change in this PR in your view, looking to move this forward |
|
I'm looking forward to this feature! Question: will this endpoint propagate |
|
This pull request has merge conflicts that must be resolved before it can be |
115b87f to
fdecd4f
Compare
fdecd4f to
0a500ee
Compare
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to start the structure, nice work. Just some cleanup nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future work: we should pull out these APIs into a separate folder, like in this refactor #28040
| token_ids = output.token_ids | ||
| out_logprobs = output.logprobs | ||
|
|
||
| # sampling_params.logprobs == req.top_logprobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cruft, or we should assert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was more of a way of commenting this was the same as in completions but under a different name, since logprobs is a bit overloaded; redacted
|
This pull request has merge conflicts that must be resolved before it can be |
| class UtilityResult: | ||
| """Wrapper for special handling when serializing/deserializing.""" | ||
|
|
||
| def __init__(self, r: Any = None): | ||
| self.result = r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid circular import error; also, I believe this belongs to utils anyways
| token_ids = output.token_ids | ||
| out_logprobs = output.logprobs | ||
|
|
||
| # sampling_params.logprobs == req.top_logprobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was more of a way of commenting this was the same as in completions but under a different name, since logprobs is a bit overloaded; redacted
msgspec+pydantic ser mixin example script tests support lora tokens-only cli arg enforcing tokens-only+abort endpoint stop string tests remove openai prefix from oaiservingtoken Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: NickLucche <[email protected]>
934664e to
43617b6
Compare
|
Thanks for the review @mgoin , addressed your comments. |
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
…24261) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: George D. Torres <[email protected]>
…24261) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
…24261) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
| tokens_only: bool = False | ||
|
|
||
| def __post_init__(self): | ||
| # support `EngineArgs(compilation_config={...})` | ||
| # without having to manually construct a | ||
| # CompilationConfig object | ||
| if isinstance(self.compilation_config, dict): | ||
| self.compilation_config = CompilationConfig(**self.compilation_config) | ||
| if isinstance(self.eplb_config, dict): | ||
| self.eplb_config = EPLBConfig(**self.eplb_config) | ||
| # Setup plugins | ||
| from vllm.plugins import load_general_plugins | ||
|
|
||
| load_general_plugins() | ||
| # when use hf offline,replace model id to local model path | ||
| if huggingface_hub.constants.HF_HUB_OFFLINE: | ||
| model_id = self.model | ||
| self.model = get_model_path(self.model, self.revision) | ||
| logger.info( | ||
| "HF_HUB_OFFLINE is True, replace model_id [%s] to model_path [%s]", | ||
| model_id, | ||
| self.model, | ||
| ) | ||
|
|
||
| @staticmethod | ||
| def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: | ||
| """Shared CLI arguments for vLLM engine.""" | ||
|
|
||
| # Model arguments | ||
| model_kwargs = get_kwargs(ModelConfig) | ||
| model_group = parser.add_argument_group( | ||
| title="ModelConfig", | ||
| description=ModelConfig.__doc__, | ||
| ) | ||
| if not ("serve" in sys.argv[1:] and "--help" in sys.argv[1:]): | ||
| model_group.add_argument("--model", **model_kwargs["model"]) | ||
| model_group.add_argument("--runner", **model_kwargs["runner"]) | ||
| model_group.add_argument("--convert", **model_kwargs["convert"]) | ||
| model_group.add_argument("--task", **model_kwargs["task"], deprecated=True) | ||
| model_group.add_argument("--tokenizer", **model_kwargs["tokenizer"]) | ||
| model_group.add_argument("--tokenizer-mode", **model_kwargs["tokenizer_mode"]) | ||
| model_group.add_argument( | ||
| "--trust-remote-code", **model_kwargs["trust_remote_code"] | ||
| ) | ||
| model_group.add_argument("--dtype", **model_kwargs["dtype"]) | ||
| model_group.add_argument("--seed", **model_kwargs["seed"]) | ||
| model_group.add_argument("--hf-config-path", **model_kwargs["hf_config_path"]) | ||
| model_group.add_argument( | ||
| "--allowed-local-media-path", **model_kwargs["allowed_local_media_path"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--allowed-media-domains", **model_kwargs["allowed_media_domains"] | ||
| ) | ||
| model_group.add_argument("--revision", **model_kwargs["revision"]) | ||
| model_group.add_argument("--code-revision", **model_kwargs["code_revision"]) | ||
| model_group.add_argument( | ||
| "--tokenizer-revision", **model_kwargs["tokenizer_revision"] | ||
| ) | ||
| model_group.add_argument("--max-model-len", **model_kwargs["max_model_len"]) | ||
| model_group.add_argument("--quantization", "-q", **model_kwargs["quantization"]) | ||
| model_group.add_argument("--enforce-eager", **model_kwargs["enforce_eager"]) | ||
| model_group.add_argument("--max-logprobs", **model_kwargs["max_logprobs"]) | ||
| model_group.add_argument("--logprobs-mode", **model_kwargs["logprobs_mode"]) | ||
| model_group.add_argument( | ||
| "--disable-sliding-window", **model_kwargs["disable_sliding_window"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--disable-cascade-attn", **model_kwargs["disable_cascade_attn"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--skip-tokenizer-init", **model_kwargs["skip_tokenizer_init"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--enable-prompt-embeds", **model_kwargs["enable_prompt_embeds"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--served-model-name", **model_kwargs["served_model_name"] | ||
| ) | ||
| model_group.add_argument("--config-format", **model_kwargs["config_format"]) | ||
| # This one is a special case because it can bool | ||
| # or str. TODO: Handle this in get_kwargs | ||
| model_group.add_argument( | ||
| "--hf-token", | ||
| type=str, | ||
| nargs="?", | ||
| const=True, | ||
| default=model_kwargs["hf_token"]["default"], | ||
| help=model_kwargs["hf_token"]["help"], | ||
| ) | ||
| model_group.add_argument("--hf-overrides", **model_kwargs["hf_overrides"]) | ||
| model_group.add_argument("--pooler-config", **model_kwargs["pooler_config"]) | ||
| model_group.add_argument( | ||
| "--override-pooler-config", | ||
| **model_kwargs["override_pooler_config"], | ||
| deprecated=True, | ||
| ) | ||
| model_group.add_argument( | ||
| "--logits-processor-pattern", **model_kwargs["logits_processor_pattern"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--generation-config", **model_kwargs["generation_config"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--override-generation-config", **model_kwargs["override_generation_config"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--enable-sleep-mode", **model_kwargs["enable_sleep_mode"] | ||
| ) | ||
| model_group.add_argument("--model-impl", **model_kwargs["model_impl"]) | ||
| model_group.add_argument( | ||
| "--override-attention-dtype", **model_kwargs["override_attention_dtype"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--logits-processors", **model_kwargs["logits_processors"] | ||
| ) | ||
| model_group.add_argument( | ||
| "--io-processor-plugin", **model_kwargs["io_processor_plugin"] | ||
| ) | ||
|
|
||
| # Model loading arguments | ||
| load_kwargs = get_kwargs(LoadConfig) | ||
| load_group = parser.add_argument_group( | ||
| title="LoadConfig", | ||
| description=LoadConfig.__doc__, | ||
| ) | ||
| load_group.add_argument("--load-format", **load_kwargs["load_format"]) | ||
| load_group.add_argument("--download-dir", **load_kwargs["download_dir"]) | ||
| load_group.add_argument( | ||
| "--safetensors-load-strategy", **load_kwargs["safetensors_load_strategy"] | ||
| ) | ||
| load_group.add_argument( | ||
| "--model-loader-extra-config", **load_kwargs["model_loader_extra_config"] | ||
| ) | ||
| load_group.add_argument("--ignore-patterns", **load_kwargs["ignore_patterns"]) | ||
| load_group.add_argument("--use-tqdm-on-load", **load_kwargs["use_tqdm_on_load"]) | ||
| load_group.add_argument( | ||
| "--pt-load-map-location", **load_kwargs["pt_load_map_location"] | ||
| ) | ||
|
|
||
| # Structured outputs arguments | ||
| structured_outputs_kwargs = get_kwargs(StructuredOutputsConfig) | ||
| structured_outputs_group = parser.add_argument_group( | ||
| title="StructuredOutputsConfig", | ||
| description=StructuredOutputsConfig.__doc__, | ||
| ) | ||
| structured_outputs_group.add_argument( | ||
| "--reasoning-parser", | ||
| # Choices need to be validated after parsing to include plugins | ||
| **structured_outputs_kwargs["reasoning_parser"], | ||
| ) | ||
| structured_outputs_group.add_argument( | ||
| "--reasoning-parser-plugin", | ||
| **structured_outputs_kwargs["reasoning_parser_plugin"], | ||
| ) | ||
| # Deprecated guided decoding arguments | ||
| for arg, type in [ | ||
| ("--guided-decoding-backend", str), | ||
| ("--guided-decoding-disable-fallback", bool), | ||
| ("--guided-decoding-disable-any-whitespace", bool), | ||
| ("--guided-decoding-disable-additional-properties", bool), | ||
| ]: | ||
| structured_outputs_group.add_argument( | ||
| arg, | ||
| type=type, | ||
| help=(f"[DEPRECATED] {arg} will be removed in v0.12.0."), | ||
| deprecated=True, | ||
| ) | ||
|
|
||
| # Parallel arguments | ||
| parallel_kwargs = get_kwargs(ParallelConfig) | ||
| parallel_group = parser.add_argument_group( | ||
| title="ParallelConfig", | ||
| description=ParallelConfig.__doc__, | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--distributed-executor-backend", | ||
| **parallel_kwargs["distributed_executor_backend"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--pipeline-parallel-size", | ||
| "-pp", | ||
| **parallel_kwargs["pipeline_parallel_size"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--tensor-parallel-size", "-tp", **parallel_kwargs["tensor_parallel_size"] | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--decode-context-parallel-size", | ||
| "-dcp", | ||
| **parallel_kwargs["decode_context_parallel_size"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--dcp-kv-cache-interleave-size", | ||
| **parallel_kwargs["dcp_kv_cache_interleave_size"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-size", "-dp", **parallel_kwargs["data_parallel_size"] | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-rank", | ||
| "-dpn", | ||
| type=int, | ||
| help="Data parallel rank of this instance. " | ||
| "When set, enables external load balancer mode.", | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-start-rank", | ||
| "-dpr", | ||
| type=int, | ||
| help="Starting data parallel rank for secondary nodes.", | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-size-local", | ||
| "-dpl", | ||
| type=int, | ||
| help="Number of data parallel replicas to run on this node.", | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-address", | ||
| "-dpa", | ||
| type=str, | ||
| help="Address of data parallel cluster head-node.", | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-rpc-port", | ||
| "-dpp", | ||
| type=int, | ||
| help="Port for data parallel RPC communication.", | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-backend", | ||
| "-dpb", | ||
| type=str, | ||
| default="mp", | ||
| help='Backend for data parallel, either "mp" or "ray".', | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--data-parallel-hybrid-lb", **parallel_kwargs["data_parallel_hybrid_lb"] | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--enable-expert-parallel", **parallel_kwargs["enable_expert_parallel"] | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--all2all-backend", **parallel_kwargs["all2all_backend"] | ||
| ) | ||
| parallel_group.add_argument("--enable-dbo", **parallel_kwargs["enable_dbo"]) | ||
| parallel_group.add_argument( | ||
| "--dbo-decode-token-threshold", | ||
| **parallel_kwargs["dbo_decode_token_threshold"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--dbo-prefill-token-threshold", | ||
| **parallel_kwargs["dbo_prefill_token_threshold"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--disable-nccl-for-dp-synchronization", | ||
| **parallel_kwargs["disable_nccl_for_dp_synchronization"], | ||
| ) | ||
| parallel_group.add_argument("--enable-eplb", **parallel_kwargs["enable_eplb"]) | ||
| parallel_group.add_argument("--eplb-config", **parallel_kwargs["eplb_config"]) | ||
| parallel_group.add_argument( | ||
| "--expert-placement-strategy", | ||
| **parallel_kwargs["expert_placement_strategy"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--num-redundant-experts", | ||
| type=int, | ||
| help="[DEPRECATED] --num-redundant-experts will be removed in v0.12.0.", | ||
| deprecated=True, | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--eplb-window-size", | ||
| type=int, | ||
| help="[DEPRECATED] --eplb-window-size will be removed in v0.12.0.", | ||
| deprecated=True, | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--eplb-step-interval", | ||
| type=int, | ||
| help="[DEPRECATED] --eplb-step-interval will be removed in v0.12.0.", | ||
| deprecated=True, | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--eplb-log-balancedness", | ||
| action=argparse.BooleanOptionalAction, | ||
| help="[DEPRECATED] --eplb-log-balancedness will be removed in v0.12.0.", | ||
| deprecated=True, | ||
| ) | ||
|
|
||
| parallel_group.add_argument( | ||
| "--max-parallel-loading-workers", | ||
| **parallel_kwargs["max_parallel_loading_workers"], | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--ray-workers-use-nsight", **parallel_kwargs["ray_workers_use_nsight"] | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--disable-custom-all-reduce", | ||
| **parallel_kwargs["disable_custom_all_reduce"], | ||
| ) | ||
| parallel_group.add_argument("--worker-cls", **parallel_kwargs["worker_cls"]) | ||
| parallel_group.add_argument( | ||
| "--worker-extension-cls", **parallel_kwargs["worker_extension_cls"] | ||
| ) | ||
| parallel_group.add_argument( | ||
| "--enable-multimodal-encoder-data-parallel", | ||
| action="store_true", | ||
| deprecated=True, | ||
| ) | ||
|
|
||
| # KV cache arguments | ||
| cache_kwargs = get_kwargs(CacheConfig) | ||
| cache_group = parser.add_argument_group( | ||
| title="CacheConfig", | ||
| description=CacheConfig.__doc__, | ||
| ) | ||
| cache_group.add_argument("--block-size", **cache_kwargs["block_size"]) | ||
| cache_group.add_argument( | ||
| "--gpu-memory-utilization", **cache_kwargs["gpu_memory_utilization"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--kv-cache-memory-bytes", **cache_kwargs["kv_cache_memory_bytes"] | ||
| ) | ||
| cache_group.add_argument("--swap-space", **cache_kwargs["swap_space"]) | ||
| cache_group.add_argument("--kv-cache-dtype", **cache_kwargs["cache_dtype"]) | ||
| cache_group.add_argument( | ||
| "--num-gpu-blocks-override", **cache_kwargs["num_gpu_blocks_override"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--enable-prefix-caching", **cache_kwargs["enable_prefix_caching"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--prefix-caching-hash-algo", **cache_kwargs["prefix_caching_hash_algo"] | ||
| ) | ||
| cache_group.add_argument("--cpu-offload-gb", **cache_kwargs["cpu_offload_gb"]) | ||
| cache_group.add_argument( | ||
| "--calculate-kv-scales", **cache_kwargs["calculate_kv_scales"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--kv-sharing-fast-prefill", **cache_kwargs["kv_sharing_fast_prefill"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--mamba-cache-dtype", **cache_kwargs["mamba_cache_dtype"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--mamba-ssm-cache-dtype", **cache_kwargs["mamba_ssm_cache_dtype"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--mamba-block-size", **cache_kwargs["mamba_block_size"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--kv-offloading-size", **cache_kwargs["kv_offloading_size"] | ||
| ) | ||
| cache_group.add_argument( | ||
| "--kv-offloading-backend", **cache_kwargs["kv_offloading_backend"] | ||
| ) | ||
|
|
||
| # Multimodal related configs | ||
| multimodal_kwargs = get_kwargs(MultiModalConfig) | ||
| multimodal_group = parser.add_argument_group( | ||
| title="MultiModalConfig", | ||
| description=MultiModalConfig.__doc__, | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--limit-mm-per-prompt", **multimodal_kwargs["limit_per_prompt"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--enable-mm-embeds", **multimodal_kwargs["enable_mm_embeds"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--media-io-kwargs", **multimodal_kwargs["media_io_kwargs"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--mm-processor-kwargs", **multimodal_kwargs["mm_processor_kwargs"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--mm-processor-cache-gb", **multimodal_kwargs["mm_processor_cache_gb"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--disable-mm-preprocessor-cache", action="store_true", deprecated=True | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--mm-processor-cache-type", **multimodal_kwargs["mm_processor_cache_type"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--mm-shm-cache-max-object-size-mb", | ||
| **multimodal_kwargs["mm_shm_cache_max_object_size_mb"], | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--mm-encoder-tp-mode", **multimodal_kwargs["mm_encoder_tp_mode"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--mm-encoder-attn-backend", | ||
| **multimodal_kwargs["mm_encoder_attn_backend"], | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--interleave-mm-strings", **multimodal_kwargs["interleave_mm_strings"] | ||
| ) | ||
| multimodal_group.add_argument( | ||
| "--skip-mm-profiling", **multimodal_kwargs["skip_mm_profiling"] | ||
| ) | ||
|
|
||
| multimodal_group.add_argument( | ||
| "--video-pruning-rate", **multimodal_kwargs["video_pruning_rate"] | ||
| ) | ||
|
|
||
| # LoRA related configs | ||
| lora_kwargs = get_kwargs(LoRAConfig) | ||
| lora_group = parser.add_argument_group( | ||
| title="LoRAConfig", | ||
| description=LoRAConfig.__doc__, | ||
| ) | ||
| lora_group.add_argument( | ||
| "--enable-lora", | ||
| action=argparse.BooleanOptionalAction, | ||
| help="If True, enable handling of LoRA adapters.", | ||
| ) | ||
| lora_group.add_argument("--max-loras", **lora_kwargs["max_loras"]) | ||
| lora_group.add_argument("--max-lora-rank", **lora_kwargs["max_lora_rank"]) | ||
| lora_group.add_argument( | ||
| "--lora-extra-vocab-size", **lora_kwargs["lora_extra_vocab_size"] | ||
| ) | ||
| lora_group.add_argument( | ||
| "--lora-dtype", | ||
| **lora_kwargs["lora_dtype"], | ||
| ) | ||
| lora_group.add_argument("--max-cpu-loras", **lora_kwargs["max_cpu_loras"]) | ||
| lora_group.add_argument( | ||
| "--fully-sharded-loras", **lora_kwargs["fully_sharded_loras"] | ||
| ) | ||
| lora_group.add_argument("--default-mm-loras", **lora_kwargs["default_mm_loras"]) | ||
|
|
||
| # Observability arguments | ||
| observability_kwargs = get_kwargs(ObservabilityConfig) | ||
| observability_group = parser.add_argument_group( | ||
| title="ObservabilityConfig", | ||
| description=ObservabilityConfig.__doc__, | ||
| ) | ||
| observability_group.add_argument( | ||
| "--show-hidden-metrics-for-version", | ||
| **observability_kwargs["show_hidden_metrics_for_version"], | ||
| ) | ||
| observability_group.add_argument( | ||
| "--otlp-traces-endpoint", **observability_kwargs["otlp_traces_endpoint"] | ||
| ) | ||
| # TODO: generalise this special case | ||
| choices = observability_kwargs["collect_detailed_traces"]["choices"] | ||
| metavar = f"{{{','.join(choices)}}}" | ||
| observability_kwargs["collect_detailed_traces"]["metavar"] = metavar | ||
| observability_kwargs["collect_detailed_traces"]["choices"] += [ | ||
| ",".join(p) for p in permutations(get_args(DetailedTraceModules), r=2) | ||
| ] | ||
| observability_group.add_argument( | ||
| "--collect-detailed-traces", | ||
| **observability_kwargs["collect_detailed_traces"], | ||
| ) | ||
|
|
||
| # Scheduler arguments | ||
| scheduler_kwargs = get_kwargs(SchedulerConfig) | ||
| scheduler_group = parser.add_argument_group( | ||
| title="SchedulerConfig", | ||
| description=SchedulerConfig.__doc__, | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--max-num-batched-tokens", **scheduler_kwargs["max_num_batched_tokens"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--max-num-seqs", **scheduler_kwargs["max_num_seqs"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--max-num-partial-prefills", **scheduler_kwargs["max_num_partial_prefills"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--max-long-partial-prefills", | ||
| **scheduler_kwargs["max_long_partial_prefills"], | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--long-prefill-token-threshold", | ||
| **scheduler_kwargs["long_prefill_token_threshold"], | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--num-lookahead-slots", **scheduler_kwargs["num_lookahead_slots"] | ||
| ) | ||
| # multi-step scheduling has been removed; corresponding arguments | ||
| # are no longer supported. | ||
| scheduler_group.add_argument( | ||
| "--scheduling-policy", **scheduler_kwargs["policy"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--enable-chunked-prefill", **scheduler_kwargs["enable_chunked_prefill"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--disable-chunked-mm-input", **scheduler_kwargs["disable_chunked_mm_input"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--scheduler-cls", **scheduler_kwargs["scheduler_cls"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--disable-hybrid-kv-cache-manager", | ||
| **scheduler_kwargs["disable_hybrid_kv_cache_manager"], | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--async-scheduling", **scheduler_kwargs["async_scheduling"] | ||
| ) | ||
| scheduler_group.add_argument( | ||
| "--stream-interval", **scheduler_kwargs["stream_interval"] | ||
| ) | ||
|
|
||
| # Compilation arguments | ||
| compilation_kwargs = get_kwargs(CompilationConfig) | ||
| compilation_group = parser.add_argument_group( | ||
| title="CompilationConfig", | ||
| description=CompilationConfig.__doc__, | ||
| ) | ||
| compilation_group.add_argument( | ||
| "--cudagraph-capture-sizes", **compilation_kwargs["cudagraph_capture_sizes"] | ||
| ) | ||
| compilation_kwargs["cudagraph_capture_sizes"]["help"] = ( | ||
| "--cuda-graph-sizes is deprecated and will be removed in v0.13.0 or v1.0.0," | ||
| " whichever is soonest. Please use --cudagraph-capture-sizes instead." | ||
| ) | ||
| compilation_group.add_argument( | ||
| "--cuda-graph-sizes", | ||
| **compilation_kwargs["cudagraph_capture_sizes"], | ||
| deprecated=True, | ||
| ) | ||
| compilation_group.add_argument( | ||
| "--max-cudagraph-capture-size", | ||
| **compilation_kwargs["max_cudagraph_capture_size"], | ||
| ) | ||
|
|
||
| # vLLM arguments | ||
| vllm_kwargs = get_kwargs(VllmConfig) | ||
| vllm_group = parser.add_argument_group( | ||
| title="VllmConfig", | ||
| description=VllmConfig.__doc__, | ||
| ) | ||
| # We construct SpeculativeConfig using fields from other configs in | ||
| # create_engine_config. So we set the type to a JSON string here to | ||
| # delay the Pydantic validation that comes with SpeculativeConfig. | ||
| vllm_kwargs["speculative_config"]["type"] = optional_type(json.loads) | ||
| vllm_group.add_argument( | ||
| "--speculative-config", **vllm_kwargs["speculative_config"] | ||
| ) | ||
| vllm_group.add_argument( | ||
| "--kv-transfer-config", **vllm_kwargs["kv_transfer_config"] | ||
| ) | ||
| vllm_group.add_argument("--kv-events-config", **vllm_kwargs["kv_events_config"]) | ||
| vllm_group.add_argument( | ||
| "--ec-transfer-config", **vllm_kwargs["ec_transfer_config"] | ||
| ) | ||
| vllm_group.add_argument( | ||
| "--compilation-config", "-O", **vllm_kwargs["compilation_config"] | ||
| ) | ||
| vllm_group.add_argument( | ||
| "--additional-config", **vllm_kwargs["additional_config"] | ||
| ) | ||
| vllm_group.add_argument( | ||
| "--structured-outputs-config", **vllm_kwargs["structured_outputs_config"] | ||
| ) | ||
|
|
||
| # Other arguments | ||
| parser.add_argument( | ||
| "--disable-log-stats", | ||
| action="store_true", | ||
| help="Disable logging statistics.", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--aggregate-engine-logging", | ||
| action="store_true", | ||
| help="Log aggregate rather than per-engine statistics " | ||
| "when using data parallelism.", | ||
| ) | ||
| return parser | ||
|
|
||
| @classmethod | ||
| def from_cli_args(cls, args: argparse.Namespace): | ||
| # Get the list of attributes of this dataclass. | ||
| attrs = [attr.name for attr in dataclasses.fields(cls)] | ||
| # Set the attributes from the parsed arguments. | ||
| engine_args = cls( | ||
| **{attr: getattr(args, attr) for attr in attrs if hasattr(args, attr)} | ||
| ) | ||
| return engine_args | ||
|
|
||
| def create_model_config(self) -> ModelConfig: | ||
| # gguf file needs a specific model loader and doesn't use hf_repo | ||
| if check_gguf_file(self.model): | ||
| self.quantization = self.load_format = "gguf" | ||
|
|
||
| if self.disable_mm_preprocessor_cache: | ||
| logger.warning( | ||
| "`--disable-mm-preprocessor-cache` is deprecated " | ||
| "and will be removed in v0.13. " | ||
| "Please use `--mm-processor-cache-gb 0` instead.", | ||
| ) | ||
|
|
||
| self.mm_processor_cache_gb = 0 | ||
| elif envs.VLLM_MM_INPUT_CACHE_GIB != 4: | ||
| logger.warning( | ||
| "VLLM_MM_INPUT_CACHE_GIB` is deprecated " | ||
| "and will be removed in v0.13. " | ||
| "Please use `--mm-processor-cache-gb %d` instead.", | ||
| envs.VLLM_MM_INPUT_CACHE_GIB, | ||
| ) | ||
|
|
||
| self.mm_processor_cache_gb = envs.VLLM_MM_INPUT_CACHE_GIB | ||
|
|
||
| if self.enable_multimodal_encoder_data_parallel: | ||
| logger.warning( | ||
| "--enable-multimodal-encoder-data-parallel` is deprecated " | ||
| "and will be removed in v0.13. " | ||
| "Please use `--mm-encoder-tp-mode data` instead." | ||
| ) | ||
|
|
||
| self.mm_encoder_tp_mode = "data" | ||
|
|
||
| return ModelConfig( | ||
| model=self.model, | ||
| hf_config_path=self.hf_config_path, | ||
| runner=self.runner, | ||
| convert=self.convert, | ||
| task=self.task, | ||
| tokenizer=self.tokenizer, | ||
| tokenizer_mode=self.tokenizer_mode, | ||
| trust_remote_code=self.trust_remote_code, | ||
| allowed_local_media_path=self.allowed_local_media_path, | ||
| allowed_media_domains=self.allowed_media_domains, | ||
| dtype=self.dtype, | ||
| seed=self.seed, | ||
| revision=self.revision, | ||
| code_revision=self.code_revision, | ||
| hf_token=self.hf_token, | ||
| hf_overrides=self.hf_overrides, | ||
| tokenizer_revision=self.tokenizer_revision, | ||
| max_model_len=self.max_model_len, | ||
| quantization=self.quantization, | ||
| enforce_eager=self.enforce_eager, | ||
| max_logprobs=self.max_logprobs, | ||
| logprobs_mode=self.logprobs_mode, | ||
| disable_sliding_window=self.disable_sliding_window, | ||
| disable_cascade_attn=self.disable_cascade_attn, | ||
| skip_tokenizer_init=self.skip_tokenizer_init, | ||
| enable_prompt_embeds=self.enable_prompt_embeds, | ||
| served_model_name=self.served_model_name, | ||
| limit_mm_per_prompt=self.limit_mm_per_prompt, | ||
| enable_mm_embeds=self.enable_mm_embeds, | ||
| interleave_mm_strings=self.interleave_mm_strings, | ||
| media_io_kwargs=self.media_io_kwargs, | ||
| skip_mm_profiling=self.skip_mm_profiling, | ||
| config_format=self.config_format, | ||
| mm_processor_kwargs=self.mm_processor_kwargs, | ||
| mm_processor_cache_gb=self.mm_processor_cache_gb, | ||
| mm_processor_cache_type=self.mm_processor_cache_type, | ||
| mm_shm_cache_max_object_size_mb=self.mm_shm_cache_max_object_size_mb, | ||
| mm_encoder_tp_mode=self.mm_encoder_tp_mode, | ||
| mm_encoder_attn_backend=self.mm_encoder_attn_backend, | ||
| pooler_config=self.pooler_config, | ||
| override_pooler_config=self.override_pooler_config, | ||
| logits_processor_pattern=self.logits_processor_pattern, | ||
| generation_config=self.generation_config, | ||
| override_generation_config=self.override_generation_config, | ||
| enable_sleep_mode=self.enable_sleep_mode, | ||
| model_impl=self.model_impl, | ||
| override_attention_dtype=self.override_attention_dtype, | ||
| logits_processors=self.logits_processors, | ||
| video_pruning_rate=self.video_pruning_rate, | ||
| io_processor_plugin=self.io_processor_plugin, | ||
| ) | ||
|
|
||
| def validate_tensorizer_args(self): | ||
| from vllm.model_executor.model_loader.tensorizer import TensorizerConfig | ||
|
|
||
| for key in self.model_loader_extra_config: | ||
| if key in TensorizerConfig._fields: | ||
| self.model_loader_extra_config["tensorizer_config"][key] = ( | ||
| self.model_loader_extra_config[key] | ||
| ) | ||
|
|
||
| def create_load_config(self) -> LoadConfig: | ||
| if self.quantization == "bitsandbytes": | ||
| self.load_format = "bitsandbytes" | ||
|
|
||
| if self.load_format == "tensorizer": | ||
| if hasattr(self.model_loader_extra_config, "to_serializable"): | ||
| self.model_loader_extra_config = ( | ||
| self.model_loader_extra_config.to_serializable() | ||
| ) | ||
| self.model_loader_extra_config["tensorizer_config"] = {} | ||
| self.model_loader_extra_config["tensorizer_config"]["tensorizer_dir"] = ( | ||
| self.model | ||
| ) | ||
| self.validate_tensorizer_args() | ||
|
|
||
| return LoadConfig( | ||
| load_format=self.load_format, | ||
| download_dir=self.download_dir, | ||
| safetensors_load_strategy=self.safetensors_load_strategy, | ||
| device="cpu" if is_online_quantization(self.quantization) else None, | ||
| model_loader_extra_config=self.model_loader_extra_config, | ||
| ignore_patterns=self.ignore_patterns, | ||
| use_tqdm_on_load=self.use_tqdm_on_load, | ||
| pt_load_map_location=self.pt_load_map_location, | ||
| ) | ||
|
|
||
| def create_speculative_config( | ||
| self, | ||
| target_model_config: ModelConfig, | ||
| target_parallel_config: ParallelConfig, | ||
| ) -> SpeculativeConfig | None: | ||
| """Initializes and returns a SpeculativeConfig object based on | ||
| `speculative_config`. | ||
| This function utilizes `speculative_config` to create a | ||
| SpeculativeConfig object. The `speculative_config` can either be | ||
| provided as a JSON string input via CLI arguments or directly as a | ||
| dictionary from the engine. | ||
| """ | ||
| if self.speculative_config is None: | ||
| return None | ||
|
|
||
| # Note(Shangming): These parameters are not obtained from the cli arg | ||
| # '--speculative-config' and must be passed in when creating the engine | ||
| # config. | ||
| self.speculative_config.update( | ||
| { | ||
| "target_model_config": target_model_config, | ||
| "target_parallel_config": target_parallel_config, | ||
| } | ||
| ) | ||
| return SpeculativeConfig(**self.speculative_config) | ||
|
|
||
| def create_engine_config( | ||
| self, | ||
| usage_context: UsageContext | None = None, | ||
| headless: bool = False, | ||
| ) -> VllmConfig: | ||
| """ | ||
| Create the VllmConfig. | ||
| NOTE: If VllmConfig is incompatible, we raise an error. | ||
| """ | ||
| current_platform.pre_register_and_update() | ||
|
|
||
| device_config = DeviceConfig(device=cast(Device, current_platform.device_type)) | ||
|
|
||
| # Check if the model is a speculator and override model/tokenizer/config | ||
| # BEFORE creating ModelConfig, so the config is created with the target model | ||
| # Skip speculator detection for cloud storage models (eg: S3, GCS) since | ||
| # HuggingFace cannot load configs directly from S3 URLs. S3 models can still | ||
| # use speculators with explicit --speculative-config. | ||
| if not is_cloud_storage(self.model): | ||
| (self.model, self.tokenizer, self.speculative_config) = ( | ||
| maybe_override_with_speculators( | ||
| model=self.model, | ||
| tokenizer=self.tokenizer, | ||
| revision=self.revision, | ||
| trust_remote_code=self.trust_remote_code, | ||
| vllm_speculative_config=self.speculative_config, | ||
| ) | ||
| ) | ||
|
|
||
| model_config = self.create_model_config() | ||
| self.model = model_config.model | ||
| self.tokenizer = model_config.tokenizer | ||
|
|
||
| self._check_feature_supported(model_config) | ||
|
|
||
| # Set default arguments for V1 Engine. | ||
| self._set_default_args(usage_context, model_config) | ||
| # Disable chunked prefill and prefix caching for: | ||
| # POWER (ppc64le)/ARM/s390x/RISCV CPUs in V1 | ||
| if current_platform.is_cpu() and current_platform.get_cpu_architecture() in ( | ||
| CpuArchEnum.POWERPC, | ||
| CpuArchEnum.S390X, | ||
| CpuArchEnum.ARM, | ||
| CpuArchEnum.RISCV, | ||
| ): | ||
| logger.info( | ||
| "Chunked prefill is not supported for ARM and POWER, " | ||
| "S390X and RISC-V CPUs; " | ||
| "disabling it for V1 backend." | ||
| ) | ||
| self.enable_chunked_prefill = False | ||
| logger.info( | ||
| "Prefix caching is not supported for ARM and POWER, " | ||
| "S390X and RISC-V CPUs; " | ||
| "disabling it for V1 backend." | ||
| ) | ||
| self.enable_prefix_caching = False | ||
|
|
||
| assert self.enable_chunked_prefill is not None | ||
|
|
||
| sliding_window: int | None = None | ||
| if not is_interleaved(model_config.hf_text_config): | ||
| # Only set CacheConfig.sliding_window if the model is all sliding | ||
| # window. Otherwise CacheConfig.sliding_window will override the | ||
| # global layers in interleaved sliding window models. | ||
| sliding_window = model_config.get_sliding_window() | ||
|
|
||
| # Note(hc): In the current implementation of decode context | ||
| # parallel(DCP), tp_size needs to be divisible by dcp_size, | ||
| # because the world size does not change by dcp, it simply | ||
| # reuses the GPUs of TP group, and split one TP group into | ||
| # tp_size//dcp_size DCP groups. | ||
| assert self.tensor_parallel_size % self.decode_context_parallel_size == 0, ( | ||
| f"tp_size={self.tensor_parallel_size} must be divisible by" | ||
| f"dcp_size={self.decode_context_parallel_size}." | ||
| ) | ||
|
|
||
| cache_config = CacheConfig( | ||
| block_size=self.block_size, | ||
| gpu_memory_utilization=self.gpu_memory_utilization, | ||
| kv_cache_memory_bytes=self.kv_cache_memory_bytes, | ||
| swap_space=self.swap_space, | ||
| cache_dtype=self.kv_cache_dtype, | ||
| is_attention_free=model_config.is_attention_free, | ||
| num_gpu_blocks_override=self.num_gpu_blocks_override, | ||
| sliding_window=sliding_window, | ||
| enable_prefix_caching=self.enable_prefix_caching, | ||
| prefix_caching_hash_algo=self.prefix_caching_hash_algo, | ||
| cpu_offload_gb=self.cpu_offload_gb, | ||
| calculate_kv_scales=self.calculate_kv_scales, | ||
| kv_sharing_fast_prefill=self.kv_sharing_fast_prefill, | ||
| mamba_cache_dtype=self.mamba_cache_dtype, | ||
| mamba_ssm_cache_dtype=self.mamba_ssm_cache_dtype, | ||
| mamba_block_size=self.mamba_block_size, | ||
| kv_offloading_size=self.kv_offloading_size, | ||
| kv_offloading_backend=self.kv_offloading_backend, | ||
| ) | ||
|
|
||
| ray_runtime_env = None | ||
| if is_ray_initialized(): | ||
| # Ray Serve LLM calls `create_engine_config` in the context | ||
| # of a Ray task, therefore we check is_ray_initialized() | ||
| # as opposed to is_in_ray_actor(). | ||
| import ray | ||
|
|
||
| ray_runtime_env = ray.get_runtime_context().runtime_env | ||
| # Avoid logging sensitive environment variables | ||
| sanitized_env = ray_runtime_env.to_dict() if ray_runtime_env else {} | ||
| if "env_vars" in sanitized_env: | ||
| sanitized_env["env_vars"] = { | ||
| k: "***" for k in sanitized_env["env_vars"] | ||
| } | ||
| logger.info("Using ray runtime env (env vars redacted): %s", sanitized_env) | ||
|
|
||
| # Get the current placement group if Ray is initialized and | ||
| # we are in a Ray actor. If so, then the placement group will be | ||
| # passed to spawned processes. | ||
| placement_group = None | ||
| if is_in_ray_actor(): | ||
| import ray | ||
|
|
||
| # This call initializes Ray automatically if it is not initialized, | ||
| # but we should not do this here. | ||
| placement_group = ray.util.get_current_placement_group() | ||
|
|
||
| assert not headless or not self.data_parallel_hybrid_lb, ( | ||
| "data_parallel_hybrid_lb is not applicable in headless mode" | ||
| ) | ||
|
|
||
| data_parallel_external_lb = self.data_parallel_rank is not None | ||
| # Local DP rank = 1, use pure-external LB. | ||
| if data_parallel_external_lb: | ||
| assert self.data_parallel_size_local in (1, None), ( | ||
| "data_parallel_size_local must be 1 when data_parallel_rank is set" | ||
| ) | ||
| data_parallel_size_local = 1 | ||
| # Use full external lb if we have local_size of 1. | ||
| self.data_parallel_hybrid_lb = False | ||
| elif self.data_parallel_size_local is not None: | ||
| data_parallel_size_local = self.data_parallel_size_local | ||
|
|
||
| if self.data_parallel_start_rank and not headless: | ||
| # Infer hybrid LB mode. | ||
| self.data_parallel_hybrid_lb = True | ||
|
|
||
| if self.data_parallel_hybrid_lb and data_parallel_size_local == 1: | ||
| # Use full external lb if we have local_size of 1. | ||
| data_parallel_external_lb = True | ||
| self.data_parallel_hybrid_lb = False | ||
|
|
||
| if data_parallel_size_local == self.data_parallel_size: | ||
| # Disable hybrid LB mode if set for a single node | ||
| self.data_parallel_hybrid_lb = False | ||
|
|
||
| self.data_parallel_rank = self.data_parallel_start_rank or 0 | ||
| else: | ||
| assert not self.data_parallel_hybrid_lb, ( | ||
| "data_parallel_size_local must be set to use data_parallel_hybrid_lb." | ||
| ) | ||
|
|
||
| if self.data_parallel_backend == "ray" and ( | ||
| envs.VLLM_RAY_DP_PACK_STRATEGY == "span" | ||
| ): | ||
| # Data parallel size defaults to 1 if DP ranks are spanning | ||
| # multiple nodes | ||
| data_parallel_size_local = 1 | ||
| else: | ||
| # Otherwise local DP size defaults to global DP size if not set | ||
| data_parallel_size_local = self.data_parallel_size | ||
|
|
||
| # DP address, used in multi-node case for torch distributed group | ||
| # and ZMQ sockets. | ||
| if self.data_parallel_address is None: | ||
| if self.data_parallel_backend == "ray": | ||
| host_ip = get_ip() | ||
| logger.info( | ||
| "Using host IP %s as ray-based data parallel address", host_ip | ||
| ) | ||
| data_parallel_address = host_ip | ||
| else: | ||
| assert self.data_parallel_backend == "mp", ( | ||
| "data_parallel_backend can only be ray or mp, got %s", | ||
| self.data_parallel_backend, | ||
| ) | ||
| data_parallel_address = ParallelConfig.data_parallel_master_ip | ||
| else: | ||
| data_parallel_address = self.data_parallel_address | ||
|
|
||
| # This port is only used when there are remote data parallel engines, | ||
| # otherwise the local IPC transport is used. | ||
| data_parallel_rpc_port = ( | ||
| self.data_parallel_rpc_port | ||
| if (self.data_parallel_rpc_port is not None) | ||
| else ParallelConfig.data_parallel_rpc_port | ||
| ) | ||
|
|
||
| if self.tokens_only and not model_config.skip_tokenizer_init: | ||
| model_config.skip_tokenizer_init = True | ||
| logger.info("Skipping tokenizer initialization for tokens-only mode.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does tokens_only need to be an engine arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a UX change:the tokenization skip depends on model_config, but ideally in a disaggregated setup you want a more general toggle to just ensure/signal that you're deploying a tokens in-out instance.
I was actually planning to leave this flag for toggling optimizations that are "disaggregated-everything specific".
…24261) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
…24261) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>
…24261) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
Overview
First step in implementing the "Disaggregated Everything" proposal #22817.
This PR focuses on the following component:
In particular, it introduces:
GenerateRequest/Responseinterface. NOTE:SamplingParamscan now be validated and deserialized within a pydantic message (eg input-only). Check outPydanticMsgspecMixin./generatetokens-only endpoint/v1/chat/completionsfor the most part.--tokens-only"modality" for starting up the server, mostly intended to simplify ux./abort/request/endpoint, see below.Implementation Details
To get a "tokenizer-free" endpoint, one can already use
--skip_tokenizer_initand/ordetokenize: Falsesampling option, forcing the use of basicIncrementalDetokenizer.In order to make ux easier for a Disaggregated Everything setup, a
--tokens-onlyoption is added, which enforces the two flags above.This way the Detokenizer is optional, as intended in the initial design.
INFO 09-10 13:36:17 [arg_utils.py:1281] Skipping tokenizer initialization for tokens-only mode.Furthermore, it enables the
/abort_requestsendpoint./abort_requestsis a solution to the detection of stop strings, which is one of the main challenges to get a real "tokenizer-free" endpoint.Currently this is done in AsyncLLM output_handler_loop, followed by an IPC abort request back to the EngineCore, like so:
With this Disaggregated Everything, we task the "Coordinator" (to be implemented in a follow-up PR) with detokenization. Hence, the "generate" instance needs to act more as a "remote EngineCore". The workflow is the following:
How to test
or among other tests
Follow up PRs:
MultiModalFeatureSpecinput, will add once Renderer effort progresses