Skip to content

agent: add thread-safe parallel subagent support with prefix caching#11

Merged
gary149 merged 3 commits intofeature/subagent-supportfrom
feature/parallel-subagents-with-prefix-caching
Jan 8, 2026
Merged

agent: add thread-safe parallel subagent support with prefix caching#11
gary149 merged 3 commits intofeature/subagent-supportfrom
feature/parallel-subagents-with-prefix-caching

Conversation

@gary149
Copy link
Owner

@gary149 gary149 commented Jan 8, 2026

Summary

  • Enable multiple background subagents to run concurrently without crashes
  • Add KV cache prefix sharing to optimize memory usage and prompt processing time
  • Implement buffered output for readable interleaved output from parallel tasks

Changes

Thread-Safe Console Output

  • Add g_console_mutex to protect global console state
  • Add output_guard RAII class for atomic multi-line output
  • Prevents crashes from concurrent console::set_display() calls

Buffered Output System

  • New subagent_output_buffer class collects output per task
  • New subagent_output_manager singleton manages buffer lifecycles
  • Background tasks output with [task-id] prefixes for readability

KV Cache Prefix Caching

  • Add base_system_prompt field to tool_context
  • Subagent prompts start with parent's base prompt (~400 tokens)
  • Enables automatic KV cache reuse via llama.cpp server's prefix detection

Test plan

  • Build with cmake --build build --target llama-agent
  • Run single subagent task (synchronous) - verify output unchanged
  • Run multiple background subagents in parallel - verify no crashes
  • Check token stats show cached tokens for subagents

🤖 Generated with Claude Code

- Track subagent token stats (input/output/cached) in subagent_result
- Collect stats from nested agent_loop after subagent completes
- Add session_stats_ptr to tool_context for parent stats updates
- Update /stats command to show main vs subagent token breakdown
- Reset stats on /clear command
This commit enables multiple subagents to run concurrently without crashes
and optimizes memory usage through KV cache prefix sharing.

## Thread-Safe Console Output

- Add `g_console_mutex` to protect global console state in `console.cpp`
- Add `output_guard` RAII class for atomic multi-line output operations
- Add `set_display_unlocked()` internal function to avoid deadlocks when
  lock is already held

## Buffered Output for Background Tasks

- New `subagent_output_buffer` class collects output segments per task
- New `subagent_output_manager` singleton manages buffer lifecycles
- Buffered output flushes atomically with task-ID prefixes for
  readable interleaved output from parallel subagents

## Subagent Display Updates

- Add dual-mode output: direct for synchronous, buffered for background
- Update print functions to accept optional buffer parameter
- Scope class now supports both direct and buffered constructors

## KV Cache Prefix Sharing

- Add `base_system_prompt` field to `tool_context` for sharing prompt prefix
- Main agent stores base prompt (~400 tokens) containing identity and tools
- Subagent prompts now start with parent's base prompt to enable automatic
  KV cache prefix detection and reuse by llama.cpp server

The prefix caching optimization reduces prompt processing time for subagents
by reusing cached tokens from the shared system prompt prefix.
@gary149 gary149 changed the base branch from master to feature/subagent-support January 8, 2026 10:37
@gary149 gary149 merged commit 79614c1 into feature/subagent-support Jan 8, 2026
gary149 pushed a commit that referenced this pull request Feb 10, 2026
…d per-thread state (ggml-org#18976)

* Squashed commit of the following:

commit b3c6bf4
Author: Abhijit Ramesh <abhijitramesh2k@gmail.com>
Date:   Mon Dec 1 18:29:00 2025 -0800

    ggml webgpu: fix xielu parameter passing (#11)

    The XIELU operation was incorrectly using static_cast to convert
    float parameters to uint32_t, which converted numeric values instead
    of preserving IEEE 754 bit patterns. This caused incorrect values
    to be interpreted by the GPU shader.

    * Use reinterpret_cast to preserve float bit patterns when passing
      through uint32_t params buffer
    * Update WGSL shader parameter types from u32 to f32
    * Re-enable XIELU support (was disabled due to numerical issues)

    Fixes NMSE test failures for XIELU operation on WebGPU backend.

commit 5ca9b5e
Author: neha-ha <137219201+neha-ha@users.noreply.github.com>
Date:   Tue Nov 18 12:17:00 2025 -0800

    Refactored pipelines and workgroup calculations (#10)

    * refactored pipelines

    * refactored workgroup calculation

    * removed commented out block of prior maps

    * Clean up ceiling division pattern

    ---------

    Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
    Co-authored-by: Reese Levine <reeselevine1@gmail.com>

Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:13:06 2025 -0700

    formatted embed wgsl and ggml-webgpu.cpp

commit e1f6bae
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:08:37 2025 -0700

    implemented REPL_Template support and removed bug in unary operators kernel

commit 8c70b8f
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 15 16:14:20 2025 -0700

    responded and dealt with PR comments

commit f9282c6
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:41:41 2025 -0700

    removed unnecesarry checking if node->src[1] exists for unary operators

commit 4cf28d7
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:32:45 2025 -0700

    All operators (inlcluding xielu) working

commit 74c6add
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:16:48 2025 -0700

    fixed autoconfig

commit 3627499
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:10:46 2025 -0700

    removed vestigial files

commit cb08583
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:59:32 2025 -0700

    abides by editor-config

commit 5360e28
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:45:57 2025 -0700

    rms_norm double declaration bug atoned

commit 7b09baa
Merge: 8a6ec84 74b8fc1
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 11:50:03 2025 -0700

    resolving merge conflicts

commit 8a6ec84
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 8 18:06:47 2025 -0700

    unary operators pass ggml tests

commit c3ae382
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 1 16:22:40 2025 -0700

    neg passes backend test

commit aa1c9b2
Author: James Contini <jamescontini@gmail.com>
Date:   Tue Sep 30 23:55:27 2025 -0700

    neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

Co-authored-by: James Contini <jamescontini@gmail.com>
Co-authored-by: Neha Abbas <neabbas@ucsc.edu>
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>

* Remove extra code and format

* Add ops documentation (finally)

* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context

* Small cleanup

* Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state.

* Cleanups

* More cleanup

* Move staging_buf mutex to global context

* Resolve merge

* Resolve merge

* Resolve merge

* Clean up merge errors, delete forward declaration, and run clang-format

* Rename device_init to backend_init

* Move webgpu_context to backend_context

* Move buffer context members into global context and refactor function calls

* Run clang-format

* Remove commends

* Move parameter buffers to per-thread, add single memset_tensor param buf

* Fix CI compilation issue

* Fix builds for emscripten not supporting subgroups

* cleanup

* cleanup

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant