Skip to content

Add lifecycle hooks for model health checks and shutdown#595

Open
chand1012 wants to merge 7 commits intomostlygeek:mainfrom
chand1012:feat/swap-hooks
Open

Add lifecycle hooks for model health checks and shutdown#595
chand1012 wants to merge 7 commits intomostlygeek:mainfrom
chand1012:feat/swap-hooks

Conversation

@chand1012
Copy link

Adds two new lifecycle hooks to ModelConfig that run at specific points in each model's lifecycle:

  • afterHealthy: Runs once after the model passes its health check, before transitioning to StateReady. Blocks model from accepting requests until completion. Failure is logged as a warning but doesn't prevent startup.

  • beforeStop: Runs right before the upstream process is killed. Blocks shutdown until completion (or failure), but the process is killed regardless of outcome.

Both hooks inherit the upstream process environment and log output through the process logger.

Key Changes

  • proxy/config/model_config.go: Added AfterHealthy and BeforeStop fields
  • proxy/process.go: Added runHookCommand helper; wired hooks into start() and stopCommand()
  • proxy/process_test.go: Added TestProcess_AfterHealthyHook and TestProcess_BeforeStopHook
  • config.example.yaml: Documented both new fields with curl examples for llama.cpp slots API

Copilot AI and others added 5 commits March 19, 2026 21:47
Add two new lifecycle hooks to ModelConfig that run at specific points
in each model's lifecycle:

- afterHealthy: one-shot command that runs after the health check
  passes, before the process transitions to StateReady. Blocks the
  model from accepting requests until it completes. Failure is logged
  as a warning but does not prevent startup.

- beforeStop: command that runs right before the upstream process is
  killed. Blocks shutdown until it completes (or fails), but the
  process is killed regardless of the outcome.

Both hooks inherit the upstream process environment and log output
through the process logger.

Primary use case is loading/saving llama.cpp prompt cache slots.

- proxy/config/model_config.go: add AfterHealthy and BeforeStop fields
- proxy/process.go: add runHookCommand helper; wire hooks into start()
  and stopCommand()
- proxy/process_test.go: add TestProcess_AfterHealthyHook and
  TestProcess_BeforeStopHook
- config.example.yaml: document both new fields

Co-authored-by: chand1012 <3521582+chand1012@users.noreply.github.com>
Replace llama-cli command examples with curl calls to the llama.cpp
slots API, which is the correct way to save/restore prompt cache state
via the server's HTTP interface.

Co-authored-by: chand1012 <3521582+chand1012@users.noreply.github.com>
- move afterHealthy hook to run after health check instead of after state ready
- expand ${PORT} and other macros in afterHealthy and beforeStop fields
@coderabbitai
Copy link

coderabbitai bot commented Mar 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 04cf9fd6-187c-487f-8719-190866f4fad3

📥 Commits

Reviewing files that changed from the base of the PR and between 6ca9637 and b82e763.

📒 Files selected for processing (1)
  • config.example.yaml
✅ Files skipped from review due to trivial changes (1)
  • config.example.yaml

Walkthrough

Adds two optional model lifecycle hooks—afterHealthy (one-shot after health check) and beforeStop (runs before termination). Changes include config example docs, two new ModelConfig fields, macro/${PORT} substitution/validation updates, runtime hook execution with timeout, and tests verifying behavior.

Changes

Cohort / File(s) Summary
Configuration example
config.example.yaml
Added commented/example entries documenting afterHealthy and beforeStop for the models.docker-llama example.
Model config & loading
proxy/config/model_config.go, proxy/config/config.go
Added afterHealthy and beforeStop fields to ModelConfig; extended macro substitution, ${PORT} detection/substitution, and unknown-macro validation to include these fields.
Process lifecycle
proxy/process.go
Added runHookCommand() helper (30s timeout, sanitization, env inheritance, logging, proc attrs); invoke afterHealthy after health check and beforeStop during stop; failures logged as warnings.
Tests
proxy/process_test.go
Added TestProcess_AfterHealthyHook and TestProcess_BeforeStopHook (skip on Windows) to assert hooks run at expected times.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

enhancement, configuration

Suggested reviewers

  • mostlygeek
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding lifecycle hooks for model health checks and shutdown, which aligns with the core functionality introduced across all modified files.
Description check ✅ Passed The description is directly related to the changeset, detailing the two new lifecycle hooks (afterHealthy and beforeStop), their behavior, and affected files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
proxy/process.go (1)

671-688: Consider adding a timeout for hook command execution.

The hook command runs without a timeout. If a hook hangs (e.g., network issue with a curl call), it will block startup or shutdown indefinitely. Consider using exec.CommandContext with a configurable or reasonable default timeout.

♻️ Suggested approach using context with timeout
-func (p *Process) runHookCommand(hookCmd string) error {
+func (p *Process) runHookCommand(hookCmd string) error {
+	// Use a timeout to prevent hooks from blocking indefinitely
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
 	args, err := config.SanitizeCommand(hookCmd)
 	if err != nil {
 		return fmt.Errorf("failed to sanitize hook command %q: %v", hookCmd, err)
 	}
 
-	cmd := exec.Command(args[0], args[1:]...)
+	cmd := exec.CommandContext(ctx, args[0], args[1:]...)
 	cmd.Stdout = p.processLogger
 	cmd.Stderr = p.processLogger
 	if p.cmd != nil {
 		cmd.Env = p.cmd.Env
 	}
 	setProcAttributes(cmd)
 
 	return cmd.Run()
 }

Alternatively, this could be a configurable per-hook timeout if flexibility is needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/process.go` around lines 671 - 688, The runHookCommand function
currently uses exec.Command and can hang indefinitely; update it to use a
context with timeout (e.g., context.WithTimeout) and exec.CommandContext to
ensure hook commands are killed after a reasonable or configurable deadline.
Modify runHookCommand (and its use of config.SanitizeCommand, exec.Command,
cmd.Env, setProcAttributes, and cmd.Run) to create a context with a default or
configurable timeout, pass that context into exec.CommandContext, and ensure the
context is canceled/cleaned up before returning so processes and resources are
released.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@proxy/process.go`:
- Around line 671-688: The runHookCommand function currently uses exec.Command
and can hang indefinitely; update it to use a context with timeout (e.g.,
context.WithTimeout) and exec.CommandContext to ensure hook commands are killed
after a reasonable or configurable deadline. Modify runHookCommand (and its use
of config.SanitizeCommand, exec.Command, cmd.Env, setProcAttributes, and
cmd.Run) to create a context with a default or configurable timeout, pass that
context into exec.CommandContext, and ensure the context is canceled/cleaned up
before returning so processes and resources are released.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 399a0177-e099-41f2-b02f-4918633ed21e

📥 Commits

Reviewing files that changed from the base of the PR and between a3725e7 and d7f8e52.

📒 Files selected for processing (5)
  • config.example.yaml
  • proxy/config/config.go
  • proxy/config/model_config.go
  • proxy/process.go
  • proxy/process_test.go

@candrews
Copy link

This would be great.

The use case I'm exploring is saving the KV cache on shutdown and restoring it on startup as described at ggml-org/llama.cpp#13606 which should make model switching a lot faster.

@chand1012
Copy link
Author

@candrews that's my use case as well, currently using it on my inference server and its working great!

@candrews
Copy link

@candrews that's my use case as well, currently using it on my inference server and its working great!

Can you share more of how you set up that up? And it's it working for you?

I'd rather start with you already have rather than reinvent the wheel :)

@chand1012
Copy link
Author

@candrews that's my use case as well, currently using it on my inference server and its working great!

Can you share more of how you set up that up? And it's it working for you?

I'd rather start with you already have rather than reinvent the wheel :)

Sure! Unfortunately it doesn't work for multimodal models (yet), but here's a snippet of my config.

models:
  qwopus-35b-moe:
    afterHealthy: "curl -s -X POST \"http://localhost:${PORT}/slots/1?action=restore\" -H \"Content-Type: application/json\" -d '{\"filename\": \"qwopus-35b-moe.bin\"}'"
    beforeStop: "curl -s -X POST \"http://localhost:${PORT}/slots/1?action=save\" -H \"Content-Type: application/json\" -d '{\"filename\": \"qwopus-35b-moe.bin\"}'"
    cmd: |
      ${build_dir}/bin/llama-server 
      --model ${models_dir}/qwen3.5/Qwopus-35B-A3B.Q4_K_M.gguf
      --chat-template-file ${models_dir}/qwen3.5/qwen3.5_chat_template.jinja
      --slot-save-path ${cache_dir}
      --host 127.0.0.1
      --port ${PORT}
      --n-gpu-layers -1
      --ctx-size ${128k_context}
      --cache-type-k q4_0
      --cache-type-v q4_0
      --flash-attn on
      --ubatch-size ${ubatch_size}
      --batch-size 2048
      --jinja
      --no-context-shift
      --keep -1
      --temp 0.7
      --top-p 0.95
      --top-k 20
      --min-p 0.0
      --presence-penalty 0
      --repeat-penalty 1.0
    aliases:
      - qwopus-moe
    ttl: 0

  omnicoder:
    afterHealthy: "curl -s -X POST \"http://localhost:${PORT}/slots/1?action=restore\" -H \"Content-Type: application/json\" -d '{\"filename\": \"omnicoder.bin\"}'"
    beforeStop: "curl -s -X POST \"http://localhost:${PORT}/slots/1?action=save\" -H \"Content-Type: application/json\" -d '{\"filename\": \"omnicoder.bin\"}'"
    cmd: |
      ${build_dir}/bin/llama-server 
      --model ${models_dir}/omnicoder/omnicoder-9b-q5_k_m.gguf
      --slot-save-path ${cache_dir}
      --host 127.0.0.1
      --port ${PORT}
      --n-gpu-layers -1
      --ctx-size ${256k_context}
      --cache-type-k q4_0
      --cache-type-v q4_0
      --flash-attn on
      --ubatch-size ${ubatch_size}
      --batch-size 2048
      --jinja
      --no-context-shift
      --keep -1
      --temp 0.6
      --top-p 0.95
      --top-k 20
      --min-p 0.0
      --presence-penalty 0
      --repeat-penalty 1.0
    ttl: 0

@candrews
Copy link

I thinking using a for loop to save/restore all slots would be even better (using a macro to define the number of slots).

it doesn't work for multimodal models (yet)

You can turn off multimodal with --no-mmproj and then it works great.

ggml-org/llama.cpp#19466 is the issue to follow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants