feat: allow framework tokenization/detokenization #3134

ishandhanani · 2025-09-19T03:53:54Z

Summary by CodeRabbit

New Features
- Added support for text-based Chat Completions requests alongside token-based inputs.
- Introduced streaming for both token and text responses.
- Added configurable host binding and bootstrap port in the disaggregated launch script.
Improvements
- Now honors the CLI setting for tokenizer initialization without overriding it.
- Automatically builds prompts when tokenizer is enabled.
- Broader compatibility for sampling parameters across request formats.
- Streamed responses include timestamps and model metadata.

copy-pr-bot · 2025-09-19T03:53:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-19T06:30:09Z

Walkthrough

Adds bootstrap host/port flags to disagg.sh; stops forcing skip_tokenizer_init in args parsing; updates protocol models to allow either token or OpenAI chat request; adjusts registration to select tokenizer mode and pass new runtime params; refactors handlers to route inputs via a new helper and split streaming into token vs text paths.

Changes

Cohort / File(s)	Summary
Launch scripts `components/backends/sglang/launch/disagg.sh`	Adds --host 0.0.0.0 and --disaggregation-bootstrap-port 12345 to both invocations; reorders arguments.
Argument parsing `components/backends/sglang/src/dynamo/sglang/args.py`	Removes auto-setting and warning for skip_tokenizer_init; preserves CLI-provided value.
Protocol models `components/backends/sglang/src/dynamo/sglang/protocol.py`	Drops PreprocessedRequest.batch_token_ids; widens DisaggPreprocessedRequest.request to Union[PreprocessedRequest, ChatCompletionRequest]; adjusts imports.
Registration/runtime wiring `components/backends/sglang/src/dynamo/sglang/register.py`	Computes input/output types based on skip_tokenizer_init; passes kv_cache_block_size, migration_limit, runtime_config; warns when using text/chat path.
Request handling `components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py`, `.../decode_handler.py`, `.../prefill_handler.py`	Introduces skip_tokenizer_init flag and _get_input_param to choose token_ids vs prompt; prefill/generate call now spreads input_param; decode streaming split into token vs text paths; builds OpenAI-style chunks for text; adjusts sampling param building; minor debug prints/timestamps.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant H as Prefill/Decode Handler
  participant B as BaseWorkerHandler
  participant TM as TokenizerManager
  participant E as Engine

  Note over C,H: Request may be token-based or OpenAI chat (messages)

  C->>H: request (sampling, stream, disagg bootstrap)
  H->>B: _get_input_param(request)
  alt skip_tokenizer_init = True (token path)
    B-->>H: { input_ids }
  else skip_tokenizer_init = False (text/chat path)
    B->>TM: apply_chat_template(messages)
    TM-->>B: prompt (text)
    B-->>H: { prompt }
  end

  H->>E: async_generate(**input_param, sampling_params, stream, bootstrap...)
  alt Stream tokens
    E-->>H: { output_ids, finish_reason? } (stream)
    H-->>C: token deltas / finish
  else Stream text (OpenAI-style)
    E-->>H: { text, finish_reason? } (stream)
    H-->>C: chat.completion.chunk deltas / finish
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix: fix examples #2459 — Adjusts SGLang disaggregation launch args similar to added host/bootstrap flags and skip_tokenizer_init handling.
feat: reorganize sglang and add expert distribution endpoints #2181 — Refactors SGLang request handlers and arg parsing along the same code paths touched here.
feat: add and enable reasoning and tool parser flags for trtllm and sglang #2713 — Alters register_llm/runtime-config wiring consistent with registration changes in this PR.

Poem

A whisk of flags, a hop to host,
Two paths diverge—tokens, text—at most.
I twitch my ears at streaming light,
Chat or ids, both chew just right.
With bootstrap ports and runtime cheer,
This bunny ships without a fear. 🐇✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request has no description; the repository's template (Overview, Details, Where should the reviewer start, Related Issues) was not filled out, leaving reviewers without rationale, change details, or guidance on where to inspect and how to test. Because all required sections are missing, the description check fails.	Please populate the PR using the provided template: add an Overview that states the goal, a Details section listing key code changes (notably removal of batch_token_ids, DisaggPreprocessedRequest request union, the runtime switch for tokenizer/detokenizer, handler input branching, and args/launch script changes), a "Where should the reviewer start?" pointing to files like register.py, handler_base.py, decode_handler.py, protocol.py, args.py, and the launch script, and a Related Issues entry plus brief testing steps to validate both token-based and chat-based flows and any migration notes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: allow framework tokenization/detokenization" is concise and accurately captures the primary intent of the changeset: enabling use of the framework's tokenizer/detokenizer and switching the runtime/handlers to support framework tokenization instead of forcing skip_tokenizer_init; it maps directly to the edits in registration, handler input paths, and argument parsing.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (1)

39-53: Remove raw print and avoid leaking request content.

Printing full requests risks PII leakage and breaks logging hygiene. Use structured logging (or drop logging entirely) and guard for missing keys.

-    def _get_input_param(self, request: dict) -> dict:
-        """Get the appropriate input parameter for SGLang"""
-        print(request)
+    def _get_input_param(self, request: dict) -> dict:
+        """Get the appropriate input parameter for SGLang."""
         if self.skip_tokenizer_init:
-            return {"input_ids": request["token_ids"]}
+            try:
+                return {"input_ids": request["token_ids"]}
+            except KeyError as e:
+                raise ValueError("Missing 'token_ids' when skip_tokenizer_init=True") from e
         else:
-            # use sglang's chat templating itself but leave tokenization to the 
-            # interal engine's TokenizerManager
+            # Use sglang's chat templating but leave tokenization to the
+            # internal engine's TokenizerManager
             prompt = self.engine.tokenizer_manager.tokenizer.apply_chat_template(
                 request["messages"],
                 tokenize=False,
                 add_generation_prompt=True
             )
             return {"prompt": prompt}

components/backends/sglang/src/dynamo/sglang/register.py (1)

35-45: Add runtime_config to Python stub (regenerate stubs)

Rust binding exposes runtime_config (lib/bindings/python/rust/lib.rs — #[pyo3(signature = (..., runtime_config=None, ...))]) but the stub lib/bindings/python/src/dynamo/_core.pyi omits it. Update the .pyi or regenerate the Python bindings so async def register_llm(..., runtime_config=...) matches the runtime; callers (e.g. components/backends/sglang/src/dynamo/sglang/register.py) currently pass runtime_config.

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (2)

69-81: Bug: request is a JSON string; decode and pass nested fields; fix payload sent to prefill

Current code passes the raw string into helpers and forwards a string under "request" to the prefill worker, which expects a dict (prefill_handler.generate uses req["request"]). Decode first; pass req["request"] to _get_input_param; send a structured dict to the prefill worker.

-    async def generate(self, request: str):
-        sampling_params = self._build_sampling_params(request)
-        input_param = self._get_input_param(request)
+    async def generate(self, request: str):
+        req = msgspec.json.decode(request, type=dict) if isinstance(request, (str, bytes, bytearray)) else request
+        sampling_params = self._build_sampling_params(req)
+        input_param = self._get_input_param(req["request"])
@@
-            prefill_stream = await self.prefill_client.generate(
-                DisaggPreprocessedRequest(
-                    request=request,
-                    sampling_params=sampling_params,
-                ).model_dump_json()
-            )
+            prefill_stream = await self.prefill_client.generate(
+                DisaggPreprocessedRequest(
+                    request=req["request"],
+                    sampling_params=sampling_params,
+                ).model_dump_json()
+            )

Also applies to: 90-97

137-166: Remove debug print (PII risk) and make text streaming multi-choice safe

print(res) can leak user content. Use debug logging or drop it.
Track emitted lengths per index to support multiple choices; include "role" only on first chunk per OpenAI convention.

-    async def _process_text_stream(self, stream_source):
-        """Process stream for text input mode"""
-        count = 0  
-
-        async for res in stream_source:
-            print(res)
-            index = res.get("index", 0)
-            text = res.get("text", "")
-            
-            finish_reason = res["meta_info"]["finish_reason"]
-            finish_reason_type = finish_reason["type"] if finish_reason else None
-            next_count = len(text)
-            delta = text[count:]
-            
-            choice_data = {
-                "index": index,
-                "delta": {"role": "assistant", "content": delta},
-                "finish_reason": finish_reason_type,
-            }
-            
-            response = {
-                "id": res["meta_info"]["id"],
-                "created": int(time.time()),
-                "choices": [choice_data],
-                "model": self.config.server_args.served_model_name,
-                "object": "chat.completion.chunk",
-            }
-            yield response
-            count = next_count
+    async def _process_text_stream(self, stream_source):
+        """Process stream for text input mode"""
+        emitted_len_by_index: dict[int, int] = {}
+
+        async for res in stream_source:
+            index = res.get("index", 0)
+            text = res.get("text", "") or ""
+
+            finish_reason = res["meta_info"]["finish_reason"]
+            finish_reason_type = finish_reason["type"] if finish_reason else None
+
+            prev_len = emitted_len_by_index.get(index, 0)
+            next_len = len(text)
+            delta_text = text[prev_len:]
+
+            delta_payload = {"content": delta_text}
+            if prev_len == 0:
+                delta_payload["role"] = "assistant"
+
+            choice_data = {
+                "index": index,
+                "delta": delta_payload,
+                "finish_reason": finish_reason_type,
+            }
+
+            response = {
+                "id": res["meta_info"]["id"],
+                "created": int(time.time()),
+                "choices": [choice_data],
+                "model": self.config.server_args.served_model_name,
+                "object": "chat.completion.chunk",
+            }
+            yield response
+            emitted_len_by_index[index] = next_len

🧹 Nitpick comments (5)

components/backends/sglang/launch/disagg.sh (1)

29-33: Parameterize host/port; consider managing the decode PID as well.

Hardcoding 0.0.0.0 and 12345 is brittle and can be unsafe in shared environments. Make them env‑overridable.

Decode runs in the foreground and isn’t included in cleanup. Either background it and track DECODE_PID or document the intent.

Apply this diff to improve resiliency and cleanup:
@@
 #!/bin/bash
+set -Eeuo pipefail
@@
 # run prefill worker
 python3 -m dynamo.sglang \
@@
-  --disaggregation-bootstrap-port 12345 \
-  --host 0.0.0.0 \
+  --disaggregation-bootstrap-port "${BOOTSTRAP_PORT:-12345}" \
+  --host "${HOST:-0.0.0.0}" \
   --disaggregation-transfer-backend nixl \
   --skip-tokenizer-init &
 PREFILL_PID=$!
@@
-# run decode worker
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
+# run decode worker
+CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
@@
-  --disaggregation-bootstrap-port 12345 \
-  --host 0.0.0.0 \
+  --disaggregation-bootstrap-port "${BOOTSTRAP_PORT:-12345}" \
+  --host "${HOST:-0.0.0.0}" \
   --disaggregation-transfer-backend nixl \
-  --skip-tokenizer-init
+  --skip-tokenizer-init &
+DECODE_PID=$!
@@
 cleanup() {
   echo "Cleaning up background processes..."
-  kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
-  wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
+  kill $DYNAMO_PID $PREFILL_PID ${DECODE_PID:-} 2>/dev/null || true
+  wait $DYNAMO_PID $PREFILL_PID ${DECODE_PID:-} 2>/dev/null || true
   echo "Cleanup complete."
 }
If keeping decode in the foreground by design, please confirm signal/cleanup behavior (Ctrl‑C on the script should also terminate prefill and ingress).

Also applies to: 43-46

components/backends/sglang/src/dynamo/sglang/register.py (1)

48-50: Log full traceback on registration failure.

Use logging.exception to capture stack traces.

-    except Exception as e:
-        logging.error(f"Failed to register with runtime config: {e}")
+    except Exception:
+        logging.exception("Failed to register with runtime config")
         return False

components/backends/sglang/src/dynamo/sglang/protocol.py (1)

4-5: Fix import order (pre-commit isort failure).

Move third‑party imports after stdlib and keep a blank line between groups.
-from typing import List, Optional, Union
-from sglang.srt.entrypoints.openai.protocol import ChatCompletionRequest
+from typing import List, Optional, Union
+
+from pydantic import BaseModel, Field
+from sglang.srt.entrypoints.openai.protocol import ChatCompletionRequest
-
-from pydantic import BaseModel, Field
Run: pre-commit run -a

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (2)

44-67: Make sampling param builder robust to multiple frontend shapes

If sampling_params already provided, use it directly.
Token format: accept either max_new_tokens or max_tokens.
OpenAI format: forward stop if present.

     def _build_sampling_params(self, request: dict) -> dict:
-        """Build sampling params depending on request from frontend"""
-        if self.skip_tokenizer_init:
+        """Build sampling params depending on request from frontend"""
+        # If already normalized by caller/aggregator, just pass through.
+        if "sampling_params" in request and isinstance(request["sampling_params"], dict):
+            return request["sampling_params"]
+        if self.skip_tokenizer_init:
             # Token-based request format
             sampling_opts = request.get("sampling_options", {})
             stop_conditions = request.get("stop_conditions", {})
             
             param_mapping = {
                 "temperature": sampling_opts.get("temperature"),
                 "top_p": sampling_opts.get("top_p"), 
                 "top_k": sampling_opts.get("top_k"),
-                "max_new_tokens": stop_conditions.get("max_tokens"),
+                "max_new_tokens": stop_conditions.get("max_new_tokens", stop_conditions.get("max_tokens")),
                 "ignore_eos": stop_conditions.get("ignore_eos"),
             }
         else:
             # OpenAI request format
             param_mapping = {
                 "temperature": request.get("temperature"),
                 "top_p": request.get("top_p"),
                 "top_k": request.get("top_k"), 
                 "max_new_tokens": request.get("max_tokens"),
+                "stop": request.get("stop"),
             }
         
         return {k: v for k, v in param_mapping.items() if v is not None}

125-134: Preserve original KeyError context and trim message

Raise from the caught KeyError to satisfy Ruff B904 and shorten message per TRY003.

-                except KeyError:
-                    raise ValueError(
-                        f"Missing 'output_ids' in response. Response keys: {list(res.keys())}"
-                    )
+                except KeyError as err:
+                    raise ValueError("Missing 'output_ids' in stream response") from err

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10bfb73 and e858d16.

📒 Files selected for processing (7)

components/backends/sglang/launch/disagg.sh (2 hunks)
components/backends/sglang/src/dynamo/sglang/args.py (0 hunks)
components/backends/sglang/src/dynamo/sglang/protocol.py (2 hunks)
components/backends/sglang/src/dynamo/sglang/register.py (1 hunks)
components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (3 hunks)
components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (1 hunks)
components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py (1 hunks)

💤 Files with no reviewable changes (1)

components/backends/sglang/src/dynamo/sglang/args.py

🧰 Additional context used

🧬 Code graph analysis (4)

components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py (1)

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (1)

_get_input_param (39-52)

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (2)

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (2)

generate (69-116)

cleanup (39-42)

components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py (2)

generate (49-72)

cleanup (30-33)

components/backends/sglang/src/dynamo/sglang/register.py (2)

lib/bindings/python/src/dynamo/_core.pyi (3)

ModelInput (844-846)

ModelType (848-850)

register_llm (864-878)

lib/bindings/python/rust/lib.rs (1)

register_llm (146-215)

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (2)

components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py (1)

generate (49-72)

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (2)

generate (33-34)

_get_input_param (39-52)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3134/merge) by ishandhanani.

components/backends/sglang/src/dynamo/sglang/protocol.py

[error] 2-9: pre-commit: isort reformatted protocol.py (import order).

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py

[error] 42-52: pre-commit: black reformatted handler_base.py (code formatting).

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py

[error] 47-56: pre-commit: black reformatted decode_handler.py (region 1).

[error] 60-69: pre-commit: black reformatted decode_handler.py (region 2).

🪛 Ruff (0.12.2)

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py

36-37: BaseWorkerHandler.cleanup is an empty method in an abstract base class, but has no abstract decorator

(B027)

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py

129-131: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

129-131: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (7)

components/backends/sglang/src/dynamo/sglang/register.py (2)

27-35: Tokenizer mode switch logic: LGTM.

Defaulting to Tokens|Chat+Completions and downgrading to Text|Chat when skip_tokenizer_init=False is coherent with the runtime.

Please confirm that v1/completions is intentionally unavailable when Text mode is selected (skip_tokenizer_init=False).

35-45: Pass-through of runtime and cache params: LGTM.

Using page_size for kv_cache_block_size and wiring migration_limit/runtime_config looks correct.

components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py (1)

61-65: Switch to _get_input_param and kwargs expansion: LGTM.

This cleanly supports both token and text inputs.

components/backends/sglang/src/dynamo/sglang/protocol.py (1)

47-50: Union request type: LGTM.

Allowing PreprocessedRequest or ChatCompletionRequest aligns with the new input routing.

Confirm downstream consumers only rely on “token_ids” or “messages” accordingly.

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (2)

30-30: Source of truth for skip_tokenizer_init: LGTM.

Reading from config.server_args is consistent with CLI behavior.

42-52: Black formatting failure.

Pre-commit reformatted this block; run pre-commit to stabilize formatting.

Run: pre-commit run -a

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (1)

47-56: Run pre-commit (Black) locally and commit formatting changes

CI flagged Black reformatting in components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (lines 47–56 and 60–69). Locally run:

git config --global user.email "[email protected]"
git config --global user.name "Your Name"
pip install pre-commit
pre-commit install
pre-commit run -a
git add -A && git commit -m "chore: apply black to decode_handler.py"

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py

ishandhanani · 2025-09-19T07:30:42Z

@coderabbitai can you do a thorough analysis on if we will be hitting any performance snags on the skip-tokenizer-init path?

coderabbitai · 2025-09-19T07:32:45Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

I'll conduct a thorough performance analysis of the skip-tokenizer-init path. Let me examine the codebase to understand the implementation details and identify potential performance implications.

@ishandhanani: Based on my analysis of the codebase, here's a comprehensive performance analysis of the skip-tokenizer-init path:

Performance Analysis Summary

Skip-Tokenizer-Init Path (skip_tokenizer_init=True)

Advantages:

Eliminates Tokenization Overhead - Bypasses SGLang's tokenizer completely, using pre-tokenized token_ids directly
Reduced CPU Usage - No text-to-token conversion happening in the inference pipeline
Deterministic Input Size - Token arrays have predictable memory footprint
Faster Input Processing - Direct token passing via {"input_ids": request["token_ids"]}

Potential Performance Concerns:

1. Memory Serialization Overhead

The token_ids arrays need to be serialized/deserialized across network boundaries and process boundaries. Large token sequences could create:

Higher network bandwidth usage
Increased serialization/deserialization CPU cost
Larger memory footprint during transport

2. Streaming Performance Split

The code introduces two separate streaming paths:

# Token streaming - processes output_ids arrays
async def _process_token_stream(self, stream_source)
# Text streaming - constructs chat completion responses  
async def _process_text_stream(self, stream_source)

The token streaming path has potential issues:

Array slicing operations on every stream update: res["output_ids"][self.num_output_tokens_so_far:]
Growing memory usage with num_output_tokens_so_far tracking
Additional array operations vs. simple text concatenation

3. Protocol Layer Overhead

The DisaggPreprocessedRequest now supports Union[PreprocessedRequest, ChatCompletionRequest], which:

Adds runtime type checking overhead
Increases serialization complexity
May cause cache misses in type dispatch

4. Debug Overhead

The code includes print(res) statements in the streaming path that should be removed for production:

async def _process_text_stream(self, stream_source):
    async for res in stream_source:
        print(res)  # This should be removed

5. Engine Integration

The **input_param expansion in both prefill and decode handlers adds a small overhead vs. direct parameter passing, though this is likely negligible.

Recommendations for Performance Optimization

Remove Debug Prints - The print(res) statements should be removed or gated behind debug flags
Optimize Token Stream Processing - Consider using iterative approaches instead of array slicing for large sequences
Memory Pool for Token Arrays - Pre-allocate token array buffers to reduce GC pressure
Benchmark Both Paths - Run performance tests comparing:
- Small sequences (< 512 tokens)
- Medium sequences (512-2048 tokens)
- Large sequences (> 2048 tokens)
Profile Serialization - Measure the actual network/serialization overhead of token arrays vs. text

Overall Assessment

The skip-tokenizer-init path should generally be faster for CPU-bound workloads since it eliminates tokenization overhead. However, it may have higher memory and network costs for large sequences. The performance trade-off will depend heavily on:

Input sequence lengths
Network bandwidth between disaggregated components
Available CPU vs. memory/network resources
Tokenizer complexity (some tokenizers are much more expensive than others)

I recommend running benchmarks with your typical workload patterns to validate these theoretical performance characteristics.

🏁 Script executed:

#!/bin/bash
# First, let's examine the key files to understand the tokenization paths
echo "=== Examining BaseWorkerHandler._get_input_param implementation ==="
cat components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py

echo -e "\n=== Examining DecodeWorkerHandler streaming implementations ==="  
cat components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py

echo -e "\n=== Examining PrefillWorkerHandler changes ==="
cat components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py

Length of output: 10644

🏁 Script executed:

#!/bin/bash
# Let's also examine the protocol changes and registration logic
echo "=== Protocol changes - examining data structures ==="
cat components/backends/sglang/src/dynamo/sglang/protocol.py

echo -e "\n=== Registration logic changes ==="
cat components/backends/sglang/src/dynamo/sglang/register.py

echo -e "\n=== Args parsing changes ==="
cat components/backends/sglang/src/dynamo/sglang/args.py

Length of output: 12938

🏁 Script executed:

#!/bin/bash
# Let's look for any SGLang tokenizer manager related code and imports
echo "=== Finding tokenizer-related imports and usage ==="
rg -n "tokenizer" --type py components/backends/sglang/

echo -e "\n=== Looking for performance-critical patterns ==="
rg -n "async|await" --type py components/backends/sglang/src/dynamo/sglang/request_handlers/

echo -e "\n=== Checking for any timing or performance logging ==="
rg -n "time\.|performance|latency|duration" --type py components/backends/sglang/

Length of output: 19218

🏁 Script executed:

#!/bin/bash
# Let's examine the disaggregation launch script changes
echo "=== Disaggregation launch script ==="
cat components/backends/sglang/launch/disagg.sh

echo -e "\n=== Looking for any batch processing or memory allocation patterns ==="
rg -n -A3 -B3 "batch|allocation|memory" --type py components/backends/sglang/src/dynamo/sglang/

echo -e "\n=== Checking for any serialization/deserialization code ==="
rg -n -A3 -B3 "serialize|deserialize|json|pickle" --type py components/backends/sglang/src/dynamo/sglang/

Length of output: 4174

Based on my thorough analysis of the codebase, here's a comprehensive performance assessment of the skip-tokenizer-init path:

Detailed Performance Analysis

Skip-Tokenizer-Init=True Path Performance Profile

🚀 Performance Advantages:

Eliminates Tokenizer Overhead

# BYPASSED when skip_tokenizer_init=True
prompt = self.engine.tokenizer_manager.tokenizer.apply_chat_template(
    request["messages"], tokenize=False, add_generation_prompt=True
)

No chat template processing
No tokenizer access through engine.tokenizer_manager.tokenizer
Direct token array usage: {"input_ids": request["token_ids"]}

Simplified Input Processing
- Single dict lookup vs. complex chat templating
- Predictable memory footprint from token arrays

🔍 Identified Performance Bottlenecks

1. Critical: Array Slicing in Token Stream Processing

async def _process_token_stream(self, stream_source):
    num_output_tokens_so_far = 0
    async for res in stream_source:
        # PERFORMANCE BOTTLENECK: Array slicing on every iteration
        out = {"token_ids": res["output_ids"][num_output_tokens_so_far:]}
        num_output_tokens_so_far = next_total_toks

Impact: O(n) array slicing operation on every stream chunk. For long sequences, this creates quadratic memory copying behavior.

2. Serialization Overhead in Disaggregation

# Heavy serialization for disaggregated requests
prefill_stream = await self.prefill_client.generate(
    DisaggPreprocessedRequest(
        request=request,
        sampling_params=sampling_params,
    ).model_dump_json()  # PERFORMANCE COST: Pydantic serialization
)

Impact:

Union[PreprocessedRequest, ChatCompletionRequest] type requires runtime dispatch
Full request serialization across network boundaries
Larger payload sizes for token arrays vs. text strings

3. Debug Overhead (Production Issue)

def _get_input_param(self, request: dict) -> dict:
    print(request)  # REMOVE FOR PRODUCTION
    
async def _process_text_stream(self, stream_source):
    async for res in stream_source:
        print(res)  # REMOVE FOR PRODUCTION

Impact: Synchronous I/O operations in async context, potential logging buffer overhead.

4. JSON Processing Overhead

# In prefill handler
req = msgspec.json.decode(request, type=dict)  # Fast but still overhead

Impact: While msgspec is fast, still adds deserialization cost for every request.

📊 Performance Comparison Analysis

Aspect	Skip-Tokenizer-Init=True	Skip-Tokenizer-Init=False
Input Processing	Direct token lookup (fast)	Chat template + tokenization (slow)
Memory Usage	Higher (token arrays)	Lower (text strings)
Network I/O	Higher bandwidth	Lower bandwidth
CPU Usage	Lower (no tokenization)	Higher (tokenization)
Streaming	Array slicing overhead	Text concatenation (efficient)

🎯 Specific Performance Recommendations

Critical Fixes:

Optimize Token Stream Processing

# Instead of array slicing, use iterative approach:
async def _process_token_stream(self, stream_source):
    last_output_len = 0
    async for res in stream_source:
        current_tokens = res["output_ids"]
        if len(current_tokens) > last_output_len:
            new_tokens = current_tokens[last_output_len:]
            yield {"token_ids": new_tokens}
            last_output_len = len(current_tokens)

Remove Debug Statements

# Replace print statements with conditional logging
if logging.getLogger().isEnabledFor(logging.DEBUG):
    logging.debug(f"Request: {request}")

Architecture Optimizations:

Streaming Buffer Pool - Pre-allocate token buffers to reduce GC pressure
Lazy Serialization - Only serialize request fields that changed for disaggregation
Type Hint Optimization - Consider separate request classes instead of Union types

⚡ Expected Performance Characteristics

Skip-tokenizer-init=True will be faster when:

Tokenization is expensive (large vocabulary, complex chat templates)
CPU is the bottleneck vs. memory/network
Small to medium output sequences (< 2048 tokens)

Skip-tokenizer-init=False may be faster when:

Network bandwidth is limited
Very long output sequences (> 4096 tokens)
Memory pressure is high
Tokenization is lightweight

🔬 Benchmarking Recommendations

Test both paths with:

Sequence lengths: 128, 512, 1024, 2048, 4096 tokens
Concurrent requests: 1, 10, 50, 100
Network conditions: local, cross-AZ, limited bandwidth
Different models: fast vs. slow tokenizers

The performance winner will depend heavily on your specific deployment characteristics and workload patterns.

🐰✨
Looking thoughtfully at the performance analysis with focused rabbit determination

✏️ Learnings added

Learnt from: ishandhanani
URL: https://github.com/ai-dynamo/dynamo/pull/3134

Timestamp: 2025-09-19T07:32:44.200Z
Learning: The skip_tokenizer_init=True path in SGLang backend bypasses tokenization but has array slicing overhead in _process_token_stream that creates O(n) memory copying on every stream chunk, potentially causing quadratic behavior for long sequences.

Learnt from: ishandhanani
URL: https://github.com/ai-dynamo/dynamo/pull/3134

Timestamp: 2025-09-19T07:32:44.200Z
Learning: The SGLang backend has debug print statements in _get_input_param() and _process_text_stream() methods that should be removed for production as they cause synchronous I/O in async contexts.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

ishandhanani · 2025-09-19T07:45:26Z

/ok to test 99ba666

rmccorm4 · 2025-09-22T21:56:40Z

/ok to test a33fbe9

rmccorm4

Approving to unblock, but probably want to lean towards uniform CLI arg across backends for specifying that backend will do the preprocessing. Similar to other dynamo specific cli args that don't go through framework argparser.

ishandhanani · 2025-09-24T04:13:29Z

/ok to test f27c5e5

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Jason Zhou <[email protected]>

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: athreesh <[email protected]>

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Jason Zhou <[email protected]>

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Kyle H <[email protected]>

ishandhanani added 15 commits September 18, 2025 16:29

try

8dde559

bump

cfdebdb

go

76bbee6

bump

cccf419

go

fb4774a

go

088be17

best effort

1e03c2f

try

e0b3729

go

45d9a9c

gp

6802389

bump

6869a72

bump

dd050c9

bump

ddb68d6

ok

79f3795

bump

2a6ea73

pull-request-size bot added the size/L label Sep 19, 2025

github-actions bot added the feat label Sep 19, 2025

ishandhanani and others added 4 commits September 19, 2025 11:58

merge

8845846

try

096e545

bump

c9d0da2

bump

e858d16

ishandhanani marked this pull request as ready for review September 19, 2025 06:21

ishandhanani requested review from a team as code owners September 19, 2025 06:21

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py Show resolved Hide resolved

components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py Show resolved Hide resolved

dump:

99ba666

Ubuntu added 3 commits September 21, 2025 04:53

bump

920ed0c

rebase

5c2f494

were so back

a33fbe9

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:56 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:59 Inactive

rmccorm4 approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'main' into ishan/tokenizer-affects-input

f27c5e5

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 04:13 Inactive

ishandhanani enabled auto-merge (squash) September 24, 2025 04:14

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 04:19 Inactive

ishandhanani merged commit f9be2e9 into main Sep 24, 2025
14 of 15 checks passed

ishandhanani deleted the ishan/tokenizer-affects-input branch September 24, 2025 04:47

jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025

feat: allow framework tokenization/detokenization (#3134)

ddc58da

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Jason Zhou <[email protected]>

athreesh pushed a commit that referenced this pull request Sep 24, 2025

feat: allow framework tokenization/detokenization (#3134)

3e07d5e

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: athreesh <[email protected]>

jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025

feat: allow framework tokenization/detokenization (#3134)

2b5085e

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Jason Zhou <[email protected]>

kylehh pushed a commit that referenced this pull request Sep 25, 2025

feat: allow framework tokenization/detokenization (#3134)

b1b57e2

Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Kyle H <[email protected]>

feat: allow framework tokenization/detokenization #3134

feat: allow framework tokenization/detokenization #3134

Uh oh!

Conversation

ishandhanani commented Sep 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 19, 2025

Uh oh!

coderabbitai bot commented Sep 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ishandhanani commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Sep 19, 2025

Performance Analysis Summary

Skip-Tokenizer-Init Path (skip_tokenizer_init=True)

1. Memory Serialization Overhead

2. Streaming Performance Split

3. Protocol Layer Overhead

4. Debug Overhead

5. Engine Integration

Recommendations for Performance Optimization

Overall Assessment

Detailed Performance Analysis

Skip-Tokenizer-Init=True Path Performance Profile

🔍 Identified Performance Bottlenecks

1. Critical: Array Slicing in Token Stream Processing

2. Serialization Overhead in Disaggregation

3. Debug Overhead (Production Issue)

4. JSON Processing Overhead

📊 Performance Comparison Analysis

🎯 Specific Performance Recommendations

Critical Fixes:

Architecture Optimizations:

⚡ Expected Performance Characteristics

🔬 Benchmarking Recommendations

Uh oh!

ishandhanani commented Sep 19, 2025

Uh oh!

rmccorm4 commented Sep 22, 2025

Uh oh!

rmccorm4 left a comment

Choose a reason for hiding this comment

Uh oh!

ishandhanani commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ishandhanani commented Sep 19, 2025 •

edited by coderabbitai bot

Loading

ishandhanani commented Sep 19, 2025 •

edited

Loading