Skip to content

feat(server): add serve log-level flag#57

Open
XiaoPengMei wants to merge 1 commit intoraullenchai:mainfrom
XiaoPengMei:add-log-level-flag-50
Open

feat(server): add serve log-level flag#57
XiaoPengMei wants to merge 1 commit intoraullenchai:mainfrom
XiaoPengMei:add-log-level-flag-50

Conversation

@XiaoPengMei
Copy link
Copy Markdown

Closes #50

Summary

  • add --log-level to both rapid-mlx serve and python -m vllm_mlx.server
  • apply the selected level to Python logging and pass the normalized value through to uvicorn.run
  • add focused tests that verify both entrypoints expose the new flag

Testing

  • pytest tests/test_harmony_parsers.py -k log_level
  • manual QA: invoked serve_command() with --log-level WARNING using a stubbed server and verified uvicorn.run(..., log_level='warning') plus root logger level 30

raullenchai pushed a commit that referenced this pull request Mar 26, 2026
…waybarrios#180)

* feat: MLLM+MTP per-request routing for text and vision

When both --mllm and --enable-mtp are set, SimpleEngine builds a
parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy).
Text-only requests route to mlx_lm with MTP speculative decoding;
media requests route to the mlx_vlm MLLM path.

Key components:
- text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights
- Per-request routing in stream_chat() via _has_media_content()
- _stream_generate_text() for MTP-accelerated text generation
- MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM

Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit):
- Text (MTP): 65.3 tok/s
- Vision (MLLM): 63.8 tok/s
- Memory: 38.7 GB (zero-copy, same as single model)

* feat: system prompt KV caching for SimpleEngine MTP text path

Persist backbone KV cache after prefilling system prompt tokens.
On subsequent requests with the same system prompt, restore the
snapshot and only prefill the suffix (user + history) tokens.

For a 10K-token system prompt on the 122B model, this saves ~57s
per request by avoiding redundant system prompt prefill.

Implementation:
- Detect system prefix via ChatML boundary markers
- Hash prefix text for cache key validation
- On cache miss: prefill system tokens, snapshot backbone KV state
- On cache hit: restore snapshot into fresh cache, send suffix only
- Token prefix validation ensures correct split at tokenization boundary
- Single-entry cache (one system prompt at a time)
- Stats exposed via get_stats() → system_kv_cache
- Cache cleared on stop(), invalidated on system prompt change

* feat: SpecPrefill — attention-based sparse prefill for TTFT reduction

Uses a small draft model to identify important prompt tokens via attention
scoring, then sparse-prefills the target model with only those tokens while
preserving original positional encoding via manual RoPE. Reduces TTFT
2.8-3.1x on 122B and 1.8x on 35B at 20% keep rate.

Implementation:
- specprefill.py: Core module with score_tokens(), select_chunks(),
  sparse_prefill(), cleanup_rope() (~640 lines)
- SimpleEngine integration: draft model loading, threshold-based activation,
  composition with system prompt KV cache, graceful fallback on error
- Per-request API: specprefill (bool) + specprefill_keep_pct (float)
  via extra_body for per-request control
- CLI: --specprefill, --specprefill-threshold, --specprefill-keep-pct,
  --specprefill-draft-model, --prefill-step-size

Closes waybarrios#179. Related: waybarrios#178 (TTFT), #57 (speculative decoding).

* feat: multi-architecture support for SpecPrefill scoring and sparse prefill

Add support for three model architecture families with auto-detection:

- Qwen3.5: gate split + q_norm + RoPE (existing, now refactored)
- Nemotron-H: content-based attention (no RoPE), mixer attr, compacted cache
- GPT-OSS/Llama: standard q_proj + RoPE (GQA, YarnRoPE compatible)

Key changes:
- Architecture-specific query extractors (_qwen35, _llama, _nemotron_h)
- Auto-detection in score_tokens() via model attributes (q_norm/rope/mixer)
- _get_attn_module()/_set_attn_module() abstract self_attn vs mixer access
- _find_attention_layers() handles block_type="*" (Nemotron-H attention)
- _build_layer_to_cache_map() handles compacted cache indexing
- sparse_prefill() skips RoPE patching for architectures without it
- cleanup_rope() is no-op for RoPE-less architectures
- Remove score_tokens_self() stub (CritiPrefill not viable for MoE)

Tested on Qwen3.5 4B (positions + pipeline). Nemotron-H and GPT-OSS
code paths ready for empirical validation.

* fix: handle GPT-OSS sliding window caches and head attribute naming

Two bugs found during cross-architecture testing on GPT-OSS 120B:

1. _llama_extract_queries() used eager evaluation in getattr fallback
   chain: getattr(attn, "num_attention_heads", attn.num_heads) evaluates
   attn.num_heads before checking if num_attention_heads exists. Fixed to
   use safe nested getattr with None default.

2. _compute_importance() concatenated score matrices with different
   shapes when mixing sliding window (128-token RotatingKVCache) and
   full attention (unlimited KVCache) layers. Fixed by skipping layers
   whose cache spans fewer tokens than the full prompt.

Validated on GPT-OSS 120B + 20B draft: importance-based selection
produces coherent output while uniform selection degrades, confirming
scoring signal from 18 full-attention layers is sufficient.

* fix: preserve tail tokens for models with RotatingKVCache

Models with sliding window attention (e.g., GPT-OSS alternating
sliding/full layers) use RotatingKVCache that evicts old entries.
When sparse prefill inserts more tokens than the window size, the
cache loses context needed for decode.

sparse_prefill() now auto-detects RotatingKVCache and augments the
selection to include the last max_size positions, ensuring sliding
window layers have valid recent context.

Validated: GPT-OSS 120B + 20B draft produces coherent output on
2294-token prompts (was garbage before this fix). Qwen3.5 and
Nemotron-H unaffected (no RotatingKVCache in their cache).

* feat: SpecPrefill support for non-MTP models (standard LLM path)

Add _stream_generate_specprefill() method for models that don't use MTP
speculative decoding (Nemotron, GPT-OSS, etc). The existing SpecPrefill
integration only worked in the MTP text path (_stream_generate_text).

Changes:
- stream_generate() now pops specprefill/specprefill_keep_pct from kwargs
  and dispatches to the new method when conditions are met
- _stream_generate_specprefill() follows the same pattern as the MTP path:
  score → select → sparse_prefill → autoregressive generation
- Graceful fallback to normal generation on any error
- Per-request overrides (specprefill, specprefill_keep_pct) via extra_body
- Threshold and upper-bound checks identical to MTP path
Copy link
Copy Markdown
Owner

@raullenchai raullenchai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @XiaoPengMei! The feature itself is straightforward and welcome. A few things to address before merge:

Issues

P1: Case-insensitive input rejected

choices=["DEBUG", "INFO", "WARNING", "ERROR"] means --log-level debug (lowercase) is rejected by argparse. Most CLI tools accept either case. Fix: add type=str.upper to the argument so any casing is accepted:

serve_parser.add_argument(
    "--log-level",
    type=str.upper,
    choices=["DEBUG", "INFO", "WARNING", "ERROR"],
    default="INFO",
    ...
)

This also makes normalize_log_level() unnecessary — argparse handles it.

P2: Root logger modification is too broad

logging.getLogger().setLevel(...) changes the root logger, which affects every library (httpx, uvicorn internals, asyncio, etc.). At DEBUG this floods output with noise unrelated to vllm-mlx.

Instead, only set the vllm_mlx logger hierarchy:

def configure_logging(log_level: str) -> str:
    level = getattr(logging, log_level, logging.INFO)
    logging.getLogger("vllm_mlx").setLevel(level)
    return log_level.lower()  # uvicorn wants lowercase

P3: Tests are source-code grep, not behavioral

The tests inspect source code strings ('"--log-level"' in source). This passes even if the flag is broken at runtime. A better approach would be to actually parse args through argparse and verify the result. Also, these tests belong in a CLI/server test file, not test_harmony_parsers.py.

Minor

  • normalize_log_level() is just .upper() — can be removed if you use type=str.upper in argparse.
  • Missing blank line between configure_logging() and # Global engine instance comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add --log-level CLI flag

2 participants