agent: add thread-safe parallel subagent support with prefix caching by gary149 · Pull Request #11 · gary149/llama-agent

gary149 · 2026-01-08T10:27:09Z

Summary

Enable multiple background subagents to run concurrently without crashes
Add KV cache prefix sharing to optimize memory usage and prompt processing time
Implement buffered output for readable interleaved output from parallel tasks

Changes

Thread-Safe Console Output

Add g_console_mutex to protect global console state
Add output_guard RAII class for atomic multi-line output
Prevents crashes from concurrent console::set_display() calls

Buffered Output System

New subagent_output_buffer class collects output per task
New subagent_output_manager singleton manages buffer lifecycles
Background tasks output with [task-id] prefixes for readability

KV Cache Prefix Caching

Add base_system_prompt field to tool_context
Subagent prompts start with parent's base prompt (~400 tokens)
Enables automatic KV cache reuse via llama.cpp server's prefix detection

Test plan

Build with cmake --build build --target llama-agent
Run single subagent task (synchronous) - verify output unchanged
Run multiple background subagents in parallel - verify no crashes
Check token stats show cached tokens for subagents

🤖 Generated with Claude Code

- Track subagent token stats (input/output/cached) in subagent_result - Collect stats from nested agent_loop after subagent completes - Add session_stats_ptr to tool_context for parent stats updates - Update /stats command to show main vs subagent token breakdown - Reset stats on /clear command

This commit enables multiple subagents to run concurrently without crashes and optimizes memory usage through KV cache prefix sharing. ## Thread-Safe Console Output - Add `g_console_mutex` to protect global console state in `console.cpp` - Add `output_guard` RAII class for atomic multi-line output operations - Add `set_display_unlocked()` internal function to avoid deadlocks when lock is already held ## Buffered Output for Background Tasks - New `subagent_output_buffer` class collects output segments per task - New `subagent_output_manager` singleton manages buffer lifecycles - Buffered output flushes atomically with task-ID prefixes for readable interleaved output from parallel subagents ## Subagent Display Updates - Add dual-mode output: direct for synchronous, buffered for background - Update print functions to accept optional buffer parameter - Scope class now supports both direct and buffered constructors ## KV Cache Prefix Sharing - Add `base_system_prompt` field to `tool_context` for sharing prompt prefix - Main agent stores base prompt (~400 tokens) containing identity and tools - Subagent prompts now start with parent's base prompt to enable automatic KV cache prefix detection and reuse by llama.cpp server The prefix caching optimization reduces prompt processing time for subagents by reusing cached tokens from the shared system prompt prefix.

…d per-thread state (ggml-org#18976) * Squashed commit of the following: commit b3c6bf4 Author: Abhijit Ramesh <abhijitramesh2k@gmail.com> Date: Mon Dec 1 18:29:00 2025 -0800 ggml webgpu: fix xielu parameter passing (#11) The XIELU operation was incorrectly using static_cast to convert float parameters to uint32_t, which converted numeric values instead of preserving IEEE 754 bit patterns. This caused incorrect values to be interpreted by the GPU shader. * Use reinterpret_cast to preserve float bit patterns when passing through uint32_t params buffer * Update WGSL shader parameter types from u32 to f32 * Re-enable XIELU support (was disabled due to numerical issues) Fixes NMSE test failures for XIELU operation on WebGPU backend. commit 5ca9b5e Author: neha-ha <137219201+neha-ha@users.noreply.github.com> Date: Tue Nov 18 12:17:00 2025 -0800 Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:13:06 2025 -0700 formatted embed wgsl and ggml-webgpu.cpp commit e1f6bae Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:08:37 2025 -0700 implemented REPL_Template support and removed bug in unary operators kernel commit 8c70b8f Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 15 16:14:20 2025 -0700 responded and dealt with PR comments commit f9282c6 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:41:41 2025 -0700 removed unnecesarry checking if node->src[1] exists for unary operators commit 4cf28d7 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:32:45 2025 -0700 All operators (inlcluding xielu) working commit 74c6add Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:16:48 2025 -0700 fixed autoconfig commit 3627499 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:10:46 2025 -0700 removed vestigial files commit cb08583 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:59:32 2025 -0700 abides by editor-config commit 5360e28 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:45:57 2025 -0700 rms_norm double declaration bug atoned commit 7b09baa Merge: 8a6ec84 74b8fc1 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 11:50:03 2025 -0700 resolving merge conflicts commit 8a6ec84 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 8 18:06:47 2025 -0700 unary operators pass ggml tests commit c3ae382 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 1 16:22:40 2025 -0700 neg passes backend test commit aa1c9b2 Author: James Contini <jamescontini@gmail.com> Date: Tue Sep 30 23:55:27 2025 -0700 neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> * Remove extra code and format * Add ops documentation (finally) * ggml webgpu: add SOFTPLUS unary operator Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern * ggml webgpu: add EXPM1 unary operator Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add FLOOR unary operator Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add CEIL unary operator Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add ROUND unary operator Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add TRUNC unary operator Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS) * Updates to webgpu get_memory * Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context * Small cleanup * Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state. * Cleanups * More cleanup * Move staging_buf mutex to global context * Resolve merge * Resolve merge * Resolve merge * Clean up merge errors, delete forward declaration, and run clang-format * Rename device_init to backend_init * Move webgpu_context to backend_context * Move buffer context members into global context and refactor function calls * Run clang-format * Remove commends * Move parameter buffers to per-thread, add single memset_tensor param buf * Fix CI compilation issue * Fix builds for emscripten not supporting subgroups * cleanup * cleanup --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

gary149 added 3 commits January 8, 2026 10:39

docs(agent): add subagents section to README

15036b8

gary149 changed the base branch from master to feature/subagent-support January 8, 2026 10:37

gary149 merged commit 79614c1 into feature/subagent-support Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: add thread-safe parallel subagent support with prefix caching#11

agent: add thread-safe parallel subagent support with prefix caching#11
gary149 merged 3 commits intofeature/subagent-supportfrom
feature/parallel-subagents-with-prefix-caching

gary149 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gary149 commented Jan 8, 2026

Summary

Changes

Thread-Safe Console Output

Buffered Output System

KV Cache Prefix Caching

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant