feat(agent-server): Add Skills, AGENTS.md, MCP support + SSE race fix by gary149 · Pull Request #13 · gary149/llama-agent

gary149 · 2026-01-08T16:59:13Z

Summary

Add Skills discovery support (agentskills.io spec) - per-session based on working_dir
Add AGENTS.md discovery support (agents.md spec) - per-session based on working_dir
Add MCP server support (Unix only) - global server-level initialization
Fix SSE streaming race condition that caused mutex crashes (use-after-free)
Fix Skills working_dir defaulting to "." to match CLI behavior

Changes

File	Description
`agent-server.cpp`	MCP initialization at server startup
`agent-session.cpp`	Skills/AGENTS.md discovery in session constructor
`agent-session.h`	Extended config with enable_skills, enable_agents_md, extra_skills_paths
`agent-routes.cpp`	SSE shared_ptr fix, new session config parsing

API Changes

New session creation options:

{
  "enable_skills": true,
  "skills_paths": ["/extra/skills/path"],
  "enable_agents_md": true
}

Test plan

Build succeeds
SSE streaming no longer crashes with "mutex lock failed"
Skills discovered from ./.llama-agent/skills and ~/.llama-agent/skills
AGENTS.md discovered from working directory
MCP tools registered at server startup (Unix)

🤖 Generated with Claude Code

Add the three missing features to reach CLI feature parity: - MCP servers: Initialize at server startup (Unix only), tools registered globally in tool registry for all sessions - Skills: Discover per-session based on working_dir, inject skills_prompt_section into agent_config - AGENTS.md: Discover per-session based on working_dir, inject agents_md_prompt_section into agent_config New session creation options: - enable_skills (bool): Enable/disable skill discovery - enable_agents_md (bool): Enable/disable AGENTS.md discovery - skills_paths (array): Additional skill search paths Files modified: - agent-server.cpp: Add MCP initialization at startup - agent-session.h: Add config fields and storage members - agent-session.cpp: Add discovery logic in constructor - agent-routes.cpp: Parse new session options

Fix use-after-free bug where sse_stream_res could be destroyed by the HTTP framework while the worker thread was still calling callbacks. Root cause: Raw pointer captured in callback lambda, but response object lifetime controlled by HTTP framework's on_complete() callback. Solution: Use shared_ptr to ensure response object lives until both: 1. HTTP framework is done streaming 2. Worker thread callback is done Add sse_shared_wrapper struct that holds shared_ptr and forwards calls to the underlying sse_stream_res.

Previously, Skills discovery skipped the project-local path when working_dir was empty. Now defaults to "." to match CLI behavior, ensuring project-local skills are discovered even without explicit working_dir configuration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

gary149 and others added 3 commits January 8, 2026 17:30

gary149 merged commit 0aa3de7 into master Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent-server): Add Skills, AGENTS.md, MCP support + SSE race fix#13

feat(agent-server): Add Skills, AGENTS.md, MCP support + SSE race fix#13
gary149 merged 3 commits intomasterfrom
feature/agent-server-full-parity

gary149 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gary149 commented Jan 8, 2026

Summary

Changes

API Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant