Skip to content

Conversation

@JustinTong0323
Copy link
Collaborator

@JustinTong0323 JustinTong0323 commented Jun 20, 2025

Motivation

Modifications

This pull request streamlines the OpenAI serving layer by eliminating batch request handling from its core entrypoints. This significant refactoring aims to simplify the codebase, making it more maintainable and easier to understand, while also incorporating specific improvements to streaming responses and tool call management.

  • API Refinement: Refactored OpenAI serving entrypoints (for chat, completions, and embeddings) to exclusively handle single requests, removing previous support for batch processing.
  • Code Simplification: Simplified internal request conversion logic (_convert_to_internal_request) across all serving entrypoints by removing batch-specific loops and parameters.
  • Streaming Improvements: Restructured streaming response generation in chat and completion serving into dedicated asynchronous generator methods for cleaner code, and improved tool call ID generation for streaming.
  • Test Updates: Updated unit tests to align with the new single-request model and adopted asyncio.run for proper execution of asynchronous tests.

Checklist

… validation.

This commit refactors the OpenAI serving entrypoints
to handle single requests instead of lists of requests.

This simplifies the code and makes it easier to
understand and maintain.

Signed-off-by: Xinyuan Tong <[email protected]>
gemini-code-assist[bot]

This comment was marked as outdated.

@JustinTong0323 JustinTong0323 changed the title Refactors OpenAI serving entrypoint to remove batch requests Refine OpenAI serving entrypoint to remove batch requests Jun 20, 2025
gemini-code-assist[bot]

This comment was marked as outdated.

Ensures that streaming responses including usage data are formatted to match OpenAI's API. Specifically, when `include_usage` is true, it first sends a chunk with `finish_reason` but no usage, followed by a chunk with usage but empty choices. This aligns with OpenAI's specification for streaming responses.

Signed-off-by: Xinyuan Tong <[email protected]>
Reorders the streaming chunks to first send a chunk with the
finish reason, followed by a separate chunk with usage information
when `include_usage` is specified. This change aligns the streaming
behavior with the OpenAI API.

Signed-off-by: Xinyuan Tong <[email protected]>
Fixes an issue where tool call IDs were not handled correctly during streaming.

The tool call ID is now generated only once per tool call, and subsequent chunks use null ID and name for argument deltas. This ensures correct identification and handling of tool calls in streaming scenarios.

Signed-off-by: Xinyuan Tong <[email protected]>
JustinTong0323 and others added 7 commits June 20, 2025 06:17
Removes code branches that are no longer needed
due to changes in the tokenizer manager's behavior.

Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Ensures the stream_buffer is always updated with the latest state.

This change addresses a potential issue where the stream buffer might not reflect the most recent updates, particularly when dealing with delta information in streaming responses. By consistently updating the stream buffer, the code maintains a correct and consistent state throughout the streaming process.

Signed-off-by: Xinyuan Tong <[email protected]>
Refactors the `_convert_to_internal_request` methods in the OpenAI serving classes to remove the `request_id` parameter.

The request ID is no longer needed during the conversion process, simplifying the method signatures and internal logic.

Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
@JustinTong0323
Copy link
Collaborator Author

local test:
image

JustinTong0323 and others added 4 commits June 20, 2025 06:55
Signed-off-by: Xinyuan Tong <[email protected]>
Addresses an issue in streaming chat completion where incorrect state was maintained when handling multiple concurrent requests.

The changes ensure that state variables like `is_first`, `stream_buffer`, `n_prev_token`, `prompt_tokens`, `completion_tokens`, and `cached_tokens` are properly tracked for each request by using dictionaries indexed by the request index.

Signed-off-by: Xinyuan Tong <[email protected]>
gemini-code-assist[bot]

This comment was marked as resolved.

@JustinTong0323
Copy link
Collaborator Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request streamlines the OpenAI serving layer by eliminating batch request handling from its core entrypoints. This significant refactoring aims to simplify the codebase, making it more maintainable and easier to understand, while also incorporating specific improvements to streaming responses and tool call management.

Highlights

  • API Refinement: Refactored OpenAI serving entrypoints (for chat, completions, and embeddings) to exclusively handle single requests, removing previous support for batch processing.
  • Code Simplification: Simplified internal request conversion logic (_convert_to_internal_request) across all serving entrypoints by removing batch-specific loops and parameters.
  • Streaming Improvements: Restructured streaming response generation in chat and completion serving into dedicated asynchronous generator methods for cleaner code, and improved tool call ID generation for streaming.
  • Test Updates: Updated unit tests to align with the new single-request model and adopted asyncio.run for proper execution of asynchronous tests.
Changelog
  • python/sglang/srt/code_completion_parser.py
    • Updated import from ChatCompletionRequest to CompletionRequest.
    • Changed generate_completion_prompt_from_request to accept CompletionRequest.
  • python/sglang/srt/entrypoints/openai/serving_base.py
    • Removed List from typing imports.
    • Modified _convert_to_internal_request signature to remove request_id parameter and return a single OpenAIServingRequest.
    • Updated call to _convert_to_internal_request in handle_request to reflect the single-request model.
  • python/sglang/srt/entrypoints/openai/serving_chat.py
    • Removed List from typing imports, added AsyncGenerator.
    • Removed the _validate_request method.
    • Refactored _convert_to_internal_request to process a single ChatCompletionRequest directly, removing batch-related loops and logic.
    • Extracted streaming response generation into a new _generate_chat_stream async generator.
    • Simplified _handle_non_streaming_request and _build_chat_response signatures by removing redundant parameters (cache_report, tool_call_parser, reasoning_parser).
    • Changed tool call ID generation to use uuid.uuid4().hex[:24].
    • Improved tool call ID and function name handling in _process_tool_call_stream for streaming.
  • python/sglang/srt/entrypoints/openai/serving_completions.py
    • Added logging and AsyncGenerator imports, removed Optional.
    • Removed the _validate_request method.
    • Refactored _convert_to_internal_request to handle a single CompletionRequest, removing batch processing logic.
    • Added a warning for echo and logprobs incompatibility.
    • Extracted streaming response generation into a new _generate_completion_stream async generator.
    • Simplified _handle_non_streaming_request and _build_completion_response signatures by removing redundant cache_report parameter.
    • Simplified echo and logprobs handling in _build_completion_response.
  • python/sglang/srt/entrypoints/openai/serving_embedding.py
    • Simplified _validate_request by removing validation for list of list inputs.
    • Modified _convert_to_internal_request signature to remove request_id parameter.
    • Added logic to treat single-element list string inputs as single strings in _convert_to_internal_request.
  • test/srt/openai/test_serving_chat.py
    • Updated calls to _convert_to_internal_request to pass single request objects.
    • Commented out test_tool_call_request_conversion and test_tool_choice_none (pending re-enablement).
  • test/srt/openai/test_serving_completions.py
    • Updated calls to _convert_to_internal_request to pass single request objects.
  • test/srt/openai/test_serving_embedding.py
    • Updated calls to _convert_to_internal_request to pass single request objects.
    • Commented out rid assertions.
    • Wrapped async test methods in asyncio.run(run_test()).
Activity
  • The gemini-code-assist[bot] provided initial automated review comments on type hint updates and argument changes for _convert_to_internal_request across various serving files.
  • The gemini-code-assist[bot] also suggested a fix for stream_buffer not updating in serving_chat.py and a fix for tool call ID handling in streaming.
  • CatherineSue engaged in a discussion about removing the request_id parameter from _convert_to_internal_request (referencing issue [Bug] Inconsistent rid handling in OpenAI-Compatible Server #7374), which was subsequently removed by the author.
  • CatherineSue raised a concern about the change of streaming state variables (is_firsts, stream_buffers, n_prev_tokens) from dictionaries to single variables in serving_chat.py, noting it might break for request.n > 1. The author, JustinTong0323, acknowledged this as a valid point and plans to fix it.
  • CatherineSue also provided feedback on tool call ID generation for streaming, which the author addressed.
  • CatherineSue suggested improvements for serving_embedding.py regarding single-element list string inputs and removal of a validation method.
  • The gemini-code-assist[bot] later raised high-priority concerns that the rid (request ID) is not being correctly propagated and set for GenerateReqInput and EmbeddingReqInput objects in serving_chat.py, serving_completions.py, and serving_embedding.py.
  • The gemini-code-assist[bot] noted that tool call tests in test_serving_chat.py were commented out and should be re-enabled and updated.
  • JustinTong0323 posted a screenshot of local test results.
  • There are still unresolved comments regarding the request.n > 1 issue for streaming state variables and the rid propagation.

@CatherineSue
Copy link
Collaborator

@JustinTong0323 serving_completions.py might have the issue with index too.

Refactors the OpenAI completion streaming endpoint to correctly
handle multiple streams. This change addresses an issue where
concurrent streams were not being managed independently, leading
to incorrect output. It introduces per-stream buffers and token
counters to ensure accurate and isolated responses for each stream.

Signed-off-by: Xinyuan Tong <[email protected]>
Updates the DeltaMessage instantiation in the OpenAI serving chat to include an empty content string for the assistant role. This change ensures that the message structure is correctly formed for processing in the chat completion response stream.

Signed-off-by: Xinyuan Tong <[email protected]>
@zhyncs zhyncs merged commit 0998808 into sgl-project:main Jun 20, 2025
57 of 90 checks passed
@justHungryMan
Copy link

Hi, I was effectively utilizing OpenAI's batch inference feature, but is it getting deprecated?

whybeyoung pushed a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025
@CatherineSue
Copy link
Collaborator

Hi, I was effectively utilizing OpenAI's batch inference feature, but is it getting deprecated?

Hi @justHungryMan , the original online batch support is not production-wise ready. We plan to switch to an offline batch support, please check out #7427 for details. Welcome to leave comments there!

chenxijun1029 pushed a commit to chenxijun1029/sglang that referenced this pull request Jul 17, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Jul 17, 2025
* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233)

* support custom weight loader for model runner (sgl-project#7122)

Co-authored-by: kavioyu <[email protected]>

* Fix AMD speculative decoding (sgl-project#7252)

* [Refactor] OAI Server components (sgl-project#7167)

Signed-off-by: Xinyuan Tong <[email protected]>

* OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179)

* [amd] Opt dsv3 moe (sgl-project#7160)

Co-authored-by: wunhuang <[email protected]>

* update ci node for xeon (sgl-project#7265)

* feat: mtp support dp-attention (sgl-project#6081)

Co-authored-by: austindeng <[email protected]>
Co-authored-by: tianqilin.99 <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: ch-wan <[email protected]>

* support qwen2 running on ascend npu device (sgl-project#7022)

Co-authored-by: 刁莹煜 <[email protected]>

* Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164)

* bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283)

* Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842)

* fix: resolve b200 dsv3 mtp issue (sgl-project#7286)

* ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288)

* fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289)

* [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237)

Signed-off-by: Shangming Cai <[email protected]>

* [PD] Add custom memory pool option to support Mooncake PD with NVLink  (sgl-project#7264)

Signed-off-by: Shangming Cai <[email protected]>

* Upstreaming hicache bug fixes (sgl-project#7267)

* Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614)

Co-authored-by: Wu, Chunyuan <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: sdp <[email protected]>

* Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300)

* chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251)

* update invalid link in doc (sgl-project#7297)

* Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301)

Signed-off-by: ch-tiger1 <[email protected]>
Co-authored-by: ch-tiger1 <[email protected]>

* Fix profiler error when there are idle passes (sgl-project#7003)

* [pd] optimize dockerfile for  pd disaggregation (sgl-project#7319)

Co-authored-by: zhyncs <[email protected]>

* Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096)

* Add more refactored openai test & in CI (sgl-project#7284)

* fix: resolve blackwell deepep image issue (sgl-project#7331)

* add seed in CPU UTs to avoid flaky failure (sgl-project#7333)

* Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099)

* Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354)

* [Refactor] Clean up radix cache related API (sgl-project#7303)

Co-authored-by: Zhiqiang Xie <[email protected]>

* Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363)

* [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242)

* [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362)

Signed-off-by: Shangming Cai <[email protected]>

* docs: update installation (sgl-project#7366)

* [Docker] optimize dockerfile  remove deepep and blackwell merge it to… (sgl-project#7343)

Co-authored-by: Yineng Zhang <[email protected]>

* Clean unused import for mimo mtp model (sgl-project#7370)

* [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159)

Signed-off-by: huanglong <[email protected]>

* [Doc] add embedding rerank doc (sgl-project#7364)

* Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371)

* Feat/refactor embedding server (sgl-project#7322)

* Purge VerlEngine (sgl-project#7326)

Signed-off-by: Ata Fatahi <[email protected]>

* support return logprobs for pipeline (sgl-project#7356)

Co-authored-by: Zhang Kaihong <[email protected]>

* [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393)

Signed-off-by: Shangming Cai <[email protected]>

* Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485)

* Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372)

Signed-off-by: Xinyuan Tong <[email protected]>
Co-authored-by: Chang Su <[email protected]>

* [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389)

* [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398)

* [OAI refactor] Add rerank and score serving (sgl-project#7399)

Co-authored-by: Chang Su <[email protected]>

* [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360)

Co-authored-by: Chang Su <[email protected]>

* Fix All-Gather under world size one (sgl-project#7219)

* Optimize DP attn scheduling for speculative decoding (sgl-project#7285)

* Update usage_processor.py (sgl-project#7402)

* Fix 7285 Merge Conflicts (sgl-project#7403)

* chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401)

* [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329)

Signed-off-by: keru <[email protected]>

* Remove batches api in docs & example (sgl-project#7400)

* [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396)

* [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394)

* fix overlap pagecount (sgl-project#6984)

Co-authored-by: Zhiqiang Xie <[email protected]>

* fix: Fix CI test_function_call_parser.py (sgl-project#7425)

* Fix CPU offloading for MLA memory pool (sgl-project#7409)

* [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420)

* feat(oai refactor): Replace `openai_api` with `entrypoints/openai`  (sgl-project#7351)

Co-authored-by: Jin Pan <[email protected]>

* Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412)

* refactor(test): reorganize OpenAI test file structure (sgl-project#7408)

* [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414)

* Tiny add logging for GC  (sgl-project#7406)

* FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327)

Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: alcanderian <[email protected]>

* Remove copy after bmm (sgl-project#7441)

* Fix torch compile run (sgl-project#7391)

Co-authored-by: wunhuang <[email protected]>
Co-authored-by: Sai Enduri <[email protected]>

* [misc] Add PD service discovery support in router (sgl-project#7361)

* add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445)

* Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451)

* Update hyperparameter_tuning.md (sgl-project#7454)

* feat: integrate deepgemm into EPMoE (sgl-project#6821)

Co-authored-by: tianqilin.99 <[email protected]>
Co-authored-by: TianQiLin666666 <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>

* Solve docker build failed in the virtual machine (sgl-project#7290)

Co-authored-by: wunhuang <[email protected]>
Co-authored-by: Sai Enduri <[email protected]>
Co-authored-by: HAI <[email protected]>

* Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457)

* [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472)

Signed-off-by: Shangming Cai <[email protected]>

* Fix prefill OOM due to wrong token calculation when page > 1  (sgl-project#7397)

* feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479)

* Fix dtype for idle input in spec decoding (sgl-project#7456)

* update mooncake in dockerfile (sgl-project#7480)

* kvcache io kernels and test case (sgl-project#7382)

* [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481)

* Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378)

* Support weight loading without mmap (sgl-project#7469)

* ci: Revert openai_server related tests in AMD suites (sgl-project#7449)

* Perormance: Enable cuda graph for dp idle batch (sgl-project#7269)

Co-authored-by: austindeng <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: ch-wan <[email protected]>

* bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347)

Co-authored-by: Chang Su <[email protected]>

* Fix RequestValidationError response format (sgl-project#7487)

* Fix MTP with Deepseek R1 Fp4 (sgl-project#7376)

* chore: bump sgl-kernel v0.2.0 (sgl-project#7490)

* chore: bump v0.4.8 (sgl-project#7493)

* [AMD] add aiter fused moe in DeepEP path (sgl-project#7268)

* enable aiter_biased_grouped_topk kernel (sgl-project#7423)

* [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)

* Remove cumsum_buffer initilization (sgl-project#7439)

* [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422)

* Support multi-thread model weight loading (sgl-project#7277)

* [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717)

* fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505)

* misc: Improvement to serving_chat.py and add more ut (sgl-project#7489)

* Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437)

* [OAI] patch origin request_id logic (sgl-project#7508)

* [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516)

Signed-off-by: Shangming Cai <[email protected]>

* EPLB support for MTP (sgl-project#7510)

* clean duplicate code (sgl-project#7512)

* [ci] add router benchmark script and CI (sgl-project#7498)

* fix: force synchronization between TP workers when update_weights (sgl-project#6626)

Co-authored-by: dangkai.dk <[email protected]>

* [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641)

Co-authored-by: Thien Tran <[email protected]>

* [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522)

Signed-off-by: Shangming Cai <[email protected]>

* npu fused op (sgl-project#7386)

Co-authored-by: Li Junwen <[email protected]>

* feat: send kvmetrics from sglang scheduler (sgl-project#6721)

* [PD] Add different TP sizes support for no-MLA models (sgl-project#6793)

Co-authored-by: shangmingc <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>

* enable aiter fp8 blockscale quant (sgl-project#7520)

* take aiter get_rope back (sgl-project#7521)

* Fix typo of flash_cache (sgl-project#7513)

* feat: add return hidden_states at async generation (sgl-project#7507)

* minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499)

* Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148)

* Fix gathered_buffer issues in tbo (sgl-project#7531)

* [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527)

Signed-off-by: Shangming Cai <[email protected]>

* [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543)

* Add Tencent HunYuanMoEV1 model support (sgl-project#7549)

* Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544)

* chore: improve ci bug reporting (sgl-project#7542)

* chore: remove vlm unnecessary import (sgl-project#7541)

Signed-off-by: Xinyuan Tong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Mick <[email protected]>

* chore: bump v0.4.8.post1 (sgl-project#7559)

* [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330)

* [Fix] incorrect assert in EPLB (sgl-project#7575)

* Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573)

Signed-off-by: Xinyuan Tong <[email protected]>

* Fix MTP error when enabling two-batch overlap  (sgl-project#7569)

* Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208)

Signed-off-by: Ata Fatahi <[email protected]>

* [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540)

* Updates transformers and timm dependencies (sgl-project#7577)

Signed-off-by: Xinyuan Tong <[email protected]>

* feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225)

Co-authored-by: Cheng Wan <[email protected]>

* Move multimodal processors into a separate folder (sgl-project#7581)

* Fix broken CI TestVILAServer (sgl-project#7610)

* [router] add centralized configuration module for sgl-router (sgl-project#7588)

* Fix: Minicpm (sgl-project#7612)

Signed-off-by: Xinyuan Tong <[email protected]>

* Hybrid kv cache for LLaMA4 (sgl-project#6563)

Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: Hanming Lu <[email protected]>

* [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769)

Co-authored-by: Zheng, Beilei <[email protected]>

* Tiny add logs for expert location updater (sgl-project#7308)

* Fix flakiness in LoRA batch test. (sgl-project#7552)

* [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584)

* Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446)

* [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598)

Signed-off-by: Shangming Cai <[email protected]>

* fix unit tests (sgl-project#7618)

* Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309)

* Let EP prefill support new DeepGEMM (sgl-project#7310)

* docs: add gb200 nvl72 and a16z grant (sgl-project#7620)

* oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036)

Signed-off-by: Xinyuan Tong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Mick <[email protected]>

* [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625)

* [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624)

* Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630)

* Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636)

* Fix stream reasoning parser and Adds Kimi reasoning parser  (sgl-project#7432)

Signed-off-by: Xinyuan Tong <[email protected]>

* Fix sgl-router startup crash (sgl-project#7619)

* [bugfix] fix runtime dropping panic in editable (sgl-project#7628)

* Move files related to EPLB (sgl-project#7580)

* [misc] reduce weird rope_scaling_factor warning (sgl-project#7176)

* [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539)

* Update CODEOWNERS (sgl-project#7640)

* [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643)

* [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524)

* Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632)

Co-authored-by: Kan Wu <[email protected]>

* Add dsv3 router gemm kernel (sgl-project#7627)

* chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663)

* [doc] update lws doc for pd (sgl-project#7318)

* Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648)

* Add small requirements for benchmark/parse_result tools (sgl-project#7671)

* [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486)

* chore: bump sgl-kernel v0.2.1 (sgl-project#7675)

* support llama4 eagle3  (sgl-project#6985)

Co-authored-by: shuaills <[email protected]>
Co-authored-by: Shenggui Li <[email protected]>
Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: yizhang2077 <[email protected]>

* Refactor mm processors and Enable mixed modality processing (sgl-project#7629)

Signed-off-by: Xinyuan Tong <[email protected]>

* upgrade sgl kernel to 0.2.1 for main (sgl-project#7676)

* add description for llama4 eagle3 (sgl-project#7688)

* fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684)

* chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698)

* Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642)

* Apply dsv3_fused_a_gemm kernel (sgl-project#7635)

* Fix GPTQMarlinMoE (sgl-project#7697)

* [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683)

Co-authored-by: 晟海 <[email protected]>
Co-authored-by: yych0745 <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: 弋云 <[email protected]>
Co-authored-by: walker-ai <[email protected]>

* Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677)

* [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717)

* [RL] add --skip-warmup (sgl-project#7416)

* [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292)

* [router] add --log-level to sgl-router (sgl-project#6512)

* [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621)

* [CPU] Bind threads and numa node for each TP rank (sgl-project#6549)

Co-authored-by: srinarayan-srikanthan <[email protected]>

* Support non-contiguous query input for extend/decode attention (sgl-project#7462)

* Support updating weights at once by stopping all requests (sgl-project#6698)

Signed-off-by: Tianyu Zhou <[email protected]>
Co-authored-by: Zilin Zhu <[email protected]>

* Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714)

* [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734)

* [CPU] fix all_reduce and all_gather (sgl-project#6770)

Co-authored-by: blzheng <[email protected]>

* fix awq and dsv3 fused gemm compatible (sgl-project#7735)

* [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731)

Signed-off-by: Shangming Cai <[email protected]>

* Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278)

Co-authored-by: HydraQYH <[email protected]>
Co-authored-by: TianQiLin666666 <[email protected]>

* fix dsv3 fused proj check  (sgl-project#7738)

* Ascend attention backend(PA&MLA) (sgl-project#7722)

Co-authored-by: Maksim <[email protected]>
Co-authored-by: VDV1985 <[email protected]>

* [fix] fix dsv3_router_gemm filter (sgl-project#7750)

* [CPU] refine CPU integration code (sgl-project#7647)

* [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771)

* support qwen3 dense model dp attention (sgl-project#7681)

* [optimize] add two stream norm for qwen3 (sgl-project#7740)

Co-authored-by: ispobock <[email protected]>

* feat: use D2D instead of H2H in pp (sgl-project#7673)

Co-authored-by: alpha-baby <[email protected]>

* [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723)

* [fix] put cpu in the first priority in get_device() (sgl-project#7752)

* [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744)

Co-authored-by: ispobock <[email protected]>

* chore: bump sgl-kernel 0.2.2 (sgl-project#7755)

* fix CI: update native api ipynb (sgl-project#7754)

Signed-off-by: Xinyuan Tong <[email protected]>

* fuse renormal into moe topk softmax kernel python code (sgl-project#7751)

Co-authored-by: ispobock <[email protected]>
Co-authored-by: zhyncs <[email protected]>

* Remove type conversion and fix id map in topk (sgl-project#7759)

* Add V2-lite model test (sgl-project#7390)

Co-authored-by: DiweiSun <[email protected]>

* refactor llama4 dp attention logic (sgl-project#7729)

* fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741)

Signed-off-by: rudeigerc <[email protected]>

* [fix] update bench_speculative.py for compatibility (sgl-project#7764)

Signed-off-by: Kay Yan <[email protected]>

* Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748)

* [RL] Add --nccl-port to prevent port conflict (sgl-project#7418)

* [RL] add pause and continue generation for async rl training (sgl-project#7419)

* [Fix] Alloc return type error (sgl-project#7778)

Signed-off-by: Capronir <[email protected]>

* [feat] Support EAGLE3 for Qwen (sgl-project#7745)

Co-authored-by: 纬杭 <[email protected]>
Co-authored-by: zyksir <[email protected]>

* saving hidden_states.clone() (sgl-project#7705)

* [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772)

Signed-off-by: yangsijia.614 <[email protected]>
Co-authored-by: yicwang <[email protected]>

* add model: qwen2-audio (sgl-project#7596)

* Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782)

* Embedding parallel by attn_tp (sgl-project#7623)

* fix: fix apply_shuffle_mul_sum (sgl-project#7444)

* chore: bump sgl-kernel v0.2.3 (sgl-project#7784)

* fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787)

* DP Attention with Auto DeepEP Dispatch (sgl-project#7222)

* chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786)

* Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757)

* [fix] fix misusing of is_cuda (sgl-project#7790)

* Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756)

Co-authored-by: Pranjal Shankhdhar <[email protected]>

* chore: bump sgl-kernel v0.2.4 (sgl-project#7800)

* ci: fix port args (sgl-project#7792)

* Fix CI test OOM issue. (sgl-project#7799)

* chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801)

* chore: bump v0.4.9 (sgl-project#7802)

* fix merge conflict issue

* fix hpu attention nonetyep issue

* fix alignment

* fix alignment2

* Ci failure fixes

* fix attention-backend choices

---------

Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: ch-tiger1 <[email protected]>
Signed-off-by: huanglong <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: keru <[email protected]>
Signed-off-by: Tianyu Zhou <[email protected]>
Signed-off-by: rudeigerc <[email protected]>
Signed-off-by: Kay Yan <[email protected]>
Signed-off-by: Capronir <[email protected]>
Signed-off-by: yangsijia.614 <[email protected]>
Signed-off-by: Mohit Sinha <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: KavioYu <[email protected]>
Co-authored-by: kavioyu <[email protected]>
Co-authored-by: Xinyuan Tong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: kk <[email protected]>
Co-authored-by: wunhuang <[email protected]>
Co-authored-by: DiweiSun <[email protected]>
Co-authored-by: u4lr451 <[email protected]>
Co-authored-by: austindeng <[email protected]>
Co-authored-by: tianqilin.99 <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: ch-wan <[email protected]>
Co-authored-by: Yijie Zhu <[email protected]>
Co-authored-by: 刁莹煜 <[email protected]>
Co-authored-by: Charles Chen <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: AniZpZ <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: Wu, Chunyuan <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: sdp <[email protected]>
Co-authored-by: Binyao Jiang <[email protected]>
Co-authored-by: ishandhanani <[email protected]>
Co-authored-by: linzhuo <[email protected]>
Co-authored-by: ch-tiger1 <[email protected]>
Co-authored-by: ch-tiger1 <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: Jinn <[email protected]>
Co-authored-by: Stefan He <[email protected]>
Co-authored-by: DarkSharpness <[email protected]>
Co-authored-by: Atream <[email protected]>
Co-authored-by: Li Hui <[email protected]>
Co-authored-by: Huang Long <[email protected]>
Co-authored-by: woodx <[email protected]>
Co-authored-by: Ata Fatahi <[email protected]>
Co-authored-by: strgrb <[email protected]>
Co-authored-by: Zhang Kaihong <[email protected]>
Co-authored-by: Wenbo Yang <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Keyang Ru <[email protected]>
Co-authored-by: ehuaa <[email protected]>
Co-authored-by: pansicheng <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Jin Pan <[email protected]>
Co-authored-by: Lifu Huang <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: alcanderian <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: Sai Enduri <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: xutizhou <[email protected]>
Co-authored-by: TianQiLin666666 <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: huangtingwei <[email protected]>
Co-authored-by: Alex Sun <[email protected]>
Co-authored-by: valarLip <[email protected]>
Co-authored-by: Francis <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: xianzhiT <[email protected]>
Co-authored-by: yilian49 <[email protected]>
Co-authored-by: DangKai <[email protected]>
Co-authored-by: dangkai.dk <[email protected]>
Co-authored-by: Thien Tran <[email protected]>
Co-authored-by: ll819214 <[email protected]>
Co-authored-by: Li Junwen <[email protected]>
Co-authored-by: zixuanzhang226 <[email protected]>
Co-authored-by: Hongbo Xu <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: eigen <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Meng, Peng <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: Hanming Lu <[email protected]>
Co-authored-by: Zheng, Beilei <[email protected]>
Co-authored-by: Sheng Qi <[email protected]>
Co-authored-by: finetune <[email protected]>
Co-authored-by: Hubert Lu <[email protected]>
Co-authored-by: Kan Wu <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: narutolhy <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: Shenggui Li <[email protected]>
Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Simon_CQK <[email protected]>
Co-authored-by: Kyungmin Lee <[email protected]>
Co-authored-by: 晟海 <[email protected]>
Co-authored-by: yych0745 <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: 弋云 <[email protected]>
Co-authored-by: walker-ai <[email protected]>
Co-authored-by: Zilin Zhu <[email protected]>
Co-authored-by: srinarayan-srikanthan <[email protected]>
Co-authored-by: Albert <[email protected]>
Co-authored-by: Ziming Huang <[email protected]>
Co-authored-by: ayrnb <[email protected]>
Co-authored-by: HydraQYH <[email protected]>
Co-authored-by: ronnie_zheng <[email protected]>
Co-authored-by: Maksim <[email protected]>
Co-authored-by: VDV1985 <[email protected]>
Co-authored-by: ispobock <[email protected]>
Co-authored-by: TianyuZhang1214 <[email protected]>
Co-authored-by: alpha-baby <[email protected]>
Co-authored-by: Yuchen Cheng <[email protected]>
Co-authored-by: Kay Yan <[email protected]>
Co-authored-by: Caproni <[email protected]>
Co-authored-by: Ximingwang-09 <[email protected]>
Co-authored-by: 纬杭 <[email protected]>
Co-authored-by: zyksir <[email protected]>
Co-authored-by: SijiaYang <[email protected]>
Co-authored-by: yicwang <[email protected]>
Co-authored-by: Leng Yue <[email protected]>
Co-authored-by: Qi Yuhang <[email protected]>
Co-authored-by: Gang Chen <[email protected]>
Co-authored-by: Pranjal Shankhdhar <[email protected]>
Co-authored-by: jay <[email protected]>
@JustinTong0323 JustinTong0323 deleted the refactor_openai_serving_remove_batch branch July 18, 2025 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants