Skip to content

Add periodic KV-canary stats logging and kernel-run-counter health check#26821

Merged
fzyzcjy merged 1 commit into
mainfrom
tom/pr_chain/tom/kv_canary_revert_reversed/add-periodic-kv-canary-stats-logging-and-kernel-run-counter-health-check
May 31, 2026
Merged

Add periodic KV-canary stats logging and kernel-run-counter health check#26821
fzyzcjy merged 1 commit into
mainfrom
tom/pr_chain/tom/kv_canary_revert_reversed/add-periodic-kv-canary-stats-logging-and-kernel-run-counter-health-check

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented May 31, 2026

Add two always-on KV-canary observability add-ons, both pure observers
of CanaryDeviceState counters and the sweep orchestrator (nothing
depends on their output):

  • Periodic stats logging
    (python/sglang/srt/kv_canary/runner/stats_logger.py): a
    PeriodicCanaryStatsLogger that prints, every N forward steps, the
    number of protected tokens, sweep passes, cumulative violations and
    the count of active launch tags, driven once per outer step via a
    delayed D2H handler so the read never blocks the forward. Gated by
    CanaryConfig.stats_print_every_n_steps (env
    SGLANG_KV_CANARY_STATS_PRINT_EVERY_N_STEPS, 0 disables).

  • Kernel-run-counter health check
    (python/sglang/srt/kv_canary/runner/health_checker.py): a
    KernelRunCounterHealthChecker that watches the per-tag kernel run
    counters and warns when a canary kernel stops advancing (e.g. a tag
    silently goes un-launched), catching wiring regressions that would
    otherwise pass silently. Unit test in
    test/registered/kv_canary/test_self_unit_runner_health.py.

Both are constructed alongside the other per-step runners in
CanaryManager and ticked once per outer step.


CI States

Latest PR Test (Base): 🚫 Run #26700542488
Latest PR Test (Extra): ❌ Run #26700542417

Add two always-on KV-canary observability add-ons, both pure observers
of CanaryDeviceState counters and the sweep orchestrator (nothing
depends on their output):

- Periodic stats logging
  (python/sglang/srt/kv_canary/runner/stats_logger.py): a
  PeriodicCanaryStatsLogger that prints, every N forward steps, the
  number of protected tokens, sweep passes, cumulative violations and
  the count of active launch tags, driven once per outer step via a
  delayed D2H handler so the read never blocks the forward. Gated by
  CanaryConfig.stats_print_every_n_steps (env
  SGLANG_KV_CANARY_STATS_PRINT_EVERY_N_STEPS, 0 disables).

- Kernel-run-counter health check
  (python/sglang/srt/kv_canary/runner/health_checker.py): a
  KernelRunCounterHealthChecker that watches the per-tag kernel run
  counters and warns when a canary kernel stops advancing (e.g. a tag
  silently goes un-launched), catching wiring regressions that would
  otherwise pass silently. Unit test in
  test/registered/kv_canary/test_self_unit_runner_health.py.

Both are constructed alongside the other per-step runners in
CanaryManager and ticked once per outer step.
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 31, 2026

/tag-and-rerun-ci

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 31, 2026

verify-rebased: PASS

Rebased source tree matches the PR head tree (chain content lives on the PR branch; main does not yet reflect the last group).

SHA Tree
Rebased source (tag verify/rebased/20260531T015946Z) 34bc9ecc86ed c643a57b595e822a34161d550bb4043d0c06addb
PR head (tom/pr_chain/tom/kv_canary_revert_reversed/add-periodic-kv-canary-stats-logging-and-kernel-run-counter-health-check) 24a51e3faf13 c643a57b595e822a34161d550bb4043d0c06addb
upstream/main 7dd19ae3d8ba 35c87847a6bdf8cec6fa63a2eeea9740061a68d6

Reproduce locally (the rebase tag persists after this run):

git fetch upstream 24a51e3faf13dd35365e39542677cb4496c93dcd
REB_TREE=$(git rev-parse 'verify/rebased/20260531T015946Z^{tree}')
PR_TREE=$(git rev-parse '24a51e3faf13dd35365e39542677cb4496c93dcd^{tree}')
MAIN_TREE=$(git rev-parse 'upstream/main^{tree}')
echo "REB_TREE  = $REB_TREE"
echo "PR_TREE   = $PR_TREE"
echo "MAIN_TREE = $MAIN_TREE"

Generated by single_commit_pr_chain.py verify-rebased.

@fzyzcjy fzyzcjy merged commit f220c72 into main May 31, 2026
65 of 92 checks passed
@fzyzcjy fzyzcjy deleted the tom/pr_chain/tom/kv_canary_revert_reversed/add-periodic-kv-canary-stats-logging-and-kernel-run-counter-health-check branch May 31, 2026 02:00
xjpang pushed a commit to xjpang/sglang that referenced this pull request Jun 2, 2026
mqhc2020 pushed a commit to mqhc2020/sglang that referenced this pull request Jun 2, 2026
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026
jeynmann pushed a commit to jeynmann/sglang that referenced this pull request Jun 4, 2026
edwingao28 pushed a commit to edwingao28/sglang that referenced this pull request Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant