feat(agent): add subagent support with task tool by gary149 · Pull Request #10 · gary149/llama-agent

gary149 · 2026-01-08T09:23:06Z

Summary

Add task tool for spawning specialized subagents with restricted tool access
Implement four subagent types: explore (read-only), plan (architecture design), general (multi-step tasks), bash (command execution)
Add visual tree rendering for nested subagent tool calls
Support background execution with run_in_background parameter and resume for checking status

Implementation Details

New Files

tools/agent/subagent/subagent-types.h/.cpp - Type definitions and configurations
tools/agent/subagent/subagent-display.h/.cpp - Visual tree rendering with RAII scope management
tools/agent/subagent/subagent-runner.h/.cpp - Execution engine with background task support
tools/agent/tools/tool-task.cpp - Task tool implementation

Key Features

Nested agent_loop: Subagents create nested agent_loop instances sharing the same server_context
Filtered tools: Each subagent type has restricted tool access via to_chat_tools_filtered()
Bash restrictions: EXPLORE subagents only allow read-only commands (ls, cat, grep, git status, etc.)
Depth tracking: Prevents infinite subagent recursion (configurable via --max-subagent-depth)
Background execution: Start tasks without blocking, resume later to check status or get results

CLI Options

--max-subagent-depth N - Set maximum nesting depth (default: 3)
--no-subagents - Disable subagent support entirely

Test plan

Build with cmake --build build --target llama-agent -j
Test synchronous subagent: prompt with "Use the task tool with type explore to find all .cpp files"
Test background execution: "Start a background task to explore the codebase"
Verify bash restrictions: EXPLORE subagent should reject write commands
Test depth limiting: nested subagents should respect max depth

🤖 Generated with Claude Code

Add ability to spawn specialized subagents for autonomous task execution. Subagents run with restricted tool access and return results to the parent. Features: - Four subagent types: explore (read-only), plan, general, bash - Filtered tool access per subagent type - Bash command restrictions for read-only subagents (explore) - Background execution with run_in_background parameter - Resume functionality to check status/get results of background tasks - Depth limiting to prevent infinite subagent recursion - Visual tree rendering for nested tool calls - Shared interrupt flag for Ctrl+C propagation New files: - tools/agent/subagent/subagent-types.{h,cpp} - Type definitions - tools/agent/subagent/subagent-display.{h,cpp} - Visual output - tools/agent/subagent/subagent-runner.{h,cpp} - Execution engine - tools/agent/tools/tool-task.cpp - Task tool implementation Modified: - agent-loop: Added subagent constructor with filtered tools - tool-registry: Added execute_filtered() for bash restrictions - agent.cpp: Added --max-subagent-depth and --no-subagents flags - console: Added DISPLAY_TYPE_SUBAGENT

- Track subagent token stats (input/output/cached) in subagent_result - Collect stats from nested agent_loop after subagent completes - Add session_stats_ptr to tool_context for parent stats updates - Update /stats command to show main vs subagent token breakdown - Reset stats on /clear command

This commit enables multiple subagents to run concurrently without crashes and optimizes memory usage through KV cache prefix sharing. ## Thread-Safe Console Output - Add `g_console_mutex` to protect global console state in `console.cpp` - Add `output_guard` RAII class for atomic multi-line output operations - Add `set_display_unlocked()` internal function to avoid deadlocks when lock is already held ## Buffered Output for Background Tasks - New `subagent_output_buffer` class collects output segments per task - New `subagent_output_manager` singleton manages buffer lifecycles - Buffered output flushes atomically with task-ID prefixes for readable interleaved output from parallel subagents ## Subagent Display Updates - Add dual-mode output: direct for synchronous, buffered for background - Update print functions to accept optional buffer parameter - Scope class now supports both direct and buffered constructors ## KV Cache Prefix Sharing - Add `base_system_prompt` field to `tool_context` for sharing prompt prefix - Main agent stores base prompt (~400 tokens) containing identity and tools - Subagent prompts now start with parent's base prompt to enable automatic KV cache prefix detection and reuse by llama.cpp server The prefix caching optimization reduces prompt processing time for subagents by reusing cached tokens from the shared system prompt prefix.

…refix-caching agent: add thread-safe parallel subagent support with prefix caching

* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

…d per-thread state (ggml-org#18976) * Squashed commit of the following: commit b3c6bf4 Author: Abhijit Ramesh <abhijitramesh2k@gmail.com> Date: Mon Dec 1 18:29:00 2025 -0800 ggml webgpu: fix xielu parameter passing (#11) The XIELU operation was incorrectly using static_cast to convert float parameters to uint32_t, which converted numeric values instead of preserving IEEE 754 bit patterns. This caused incorrect values to be interpreted by the GPU shader. * Use reinterpret_cast to preserve float bit patterns when passing through uint32_t params buffer * Update WGSL shader parameter types from u32 to f32 * Re-enable XIELU support (was disabled due to numerical issues) Fixes NMSE test failures for XIELU operation on WebGPU backend. commit 5ca9b5e Author: neha-ha <137219201+neha-ha@users.noreply.github.com> Date: Tue Nov 18 12:17:00 2025 -0800 Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:13:06 2025 -0700 formatted embed wgsl and ggml-webgpu.cpp commit e1f6bae Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:08:37 2025 -0700 implemented REPL_Template support and removed bug in unary operators kernel commit 8c70b8f Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 15 16:14:20 2025 -0700 responded and dealt with PR comments commit f9282c6 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:41:41 2025 -0700 removed unnecesarry checking if node->src[1] exists for unary operators commit 4cf28d7 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:32:45 2025 -0700 All operators (inlcluding xielu) working commit 74c6add Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:16:48 2025 -0700 fixed autoconfig commit 3627499 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:10:46 2025 -0700 removed vestigial files commit cb08583 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:59:32 2025 -0700 abides by editor-config commit 5360e28 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:45:57 2025 -0700 rms_norm double declaration bug atoned commit 7b09baa Merge: 8a6ec84 74b8fc1 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 11:50:03 2025 -0700 resolving merge conflicts commit 8a6ec84 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 8 18:06:47 2025 -0700 unary operators pass ggml tests commit c3ae382 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 1 16:22:40 2025 -0700 neg passes backend test commit aa1c9b2 Author: James Contini <jamescontini@gmail.com> Date: Tue Sep 30 23:55:27 2025 -0700 neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> * Remove extra code and format * Add ops documentation (finally) * ggml webgpu: add SOFTPLUS unary operator Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern * ggml webgpu: add EXPM1 unary operator Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add FLOOR unary operator Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add CEIL unary operator Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add ROUND unary operator Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add TRUNC unary operator Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS) * Updates to webgpu get_memory * Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context * Small cleanup * Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state. * Cleanups * More cleanup * Move staging_buf mutex to global context * Resolve merge * Resolve merge * Resolve merge * Clean up merge errors, delete forward declaration, and run clang-format * Rename device_init to backend_init * Move webgpu_context to backend_context * Move buffer context members into global context and refactor function calls * Run clang-format * Remove commends * Move parameter buffers to per-thread, add single memset_tensor param buf * Fix CI compilation issue * Fix builds for emscripten not supporting subgroups * cleanup * cleanup --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

gary149 added 5 commits January 8, 2026 10:17

docs(agent): add subagents section to README

15036b8

Merge pull request #11 from gary149/feature/parallel-subagents-with-p…

79614c1

…refix-caching agent: add thread-safe parallel subagent support with prefix caching

gary149 merged commit 95cdb42 into master Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): add subagent support with task tool#10

feat(agent): add subagent support with task tool#10
gary149 merged 5 commits intomasterfrom
feature/subagent-support

gary149 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gary149 commented Jan 8, 2026

Summary

Implementation Details

New Files

Key Features

CLI Options

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant