-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat(developer_shell): Add output_tmpfile parameter to shell tool
#2817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| n = stdout_reader.read_until(b'\n', &mut stdout_buf), if !stdout_done => { | ||
| if n? == 0 { | ||
| stdout_done = true; | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was actually a pre-existing race condition where basically we could lose the last hunk of a process output.
This is a giant problem in general with select!. What would make this code a lot more elegant I think is to iterate over a stream - https://docs.rs/tokio-stream/0.1.17/tokio_stream/wrappers/struct.LinesStream.html
And then we'd unify those two streams into a single stream of enum Message { Stdout(String), Stderr(String) } which is one pattern recommended by https://blog.yoshuawuyts.com/futures-concurrency-3 - we then just drive that stream to completion, and don't need to carry these buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I went ahead and did this and yeah I think the result is a lot cleaner!
The goal is to lower the amount of tokens the LLM needs to parse for large commands like `make` and `cargo build`. Most of the time they only need stderr if they fail, and if they *don't* fail then there's no need to parse the output at all usually. As noted in the text this takes an optional regular expression and for key tasks like `cargo build` I think a regex filter of `^(warning)|(error):` works well here, avoiding a latency hit where the LLM may need to check the redirected output file for warnings. I also changed things so that we don't have a hardcoded command output limit, but instead we just automatically redirect to a tempfile anyways if it's large. Finally as part of the implmentation of this I cleaned up the way we read from stdout/stderr to greatly deduplicate code by iterating over a stream. Signed-off-by: Colin Walters <walters@verbum.org>
f69017e to
748c48f
Compare
|
@cgwalters yeah missed this one somehow - yeah something like this coudl work well, very nice. It is clear goose gobbles up tokens like candy. I have this change here: #3750 which doesnt' require any tool signature change - and is in line with how other things work (we can layer on smarts, I like the idea of looking for patterns and rules and heuristics - eg stack traces etc). related to this, we have a similar issue with text_view_file - which can easily swallow too much vs pushing back (that probably does need a signature change). In general having the tools return feedback vs up front prompting I have found can work quite well and doesn't add work to the LLM up front to work out what to do when (ie if something is large - you can tell it at that pointk with instructions on how to not hit that, or if you want to over ride). If you are interested in updating this branch would happily take a look, I think we really need things like think you are totally right. |
|
Thanks for looking at this. The out of the box developer_shell and other core MCP tools is IMO really key and sensitive part of what tools like goose provide (in combination with the model). Probably what we should do is actually start with a survey of what similar tools provide (e.g. from Gemini CLI, Claude Code, Cursor etc). Just as one example, Gemini's
Yes. And thanks for reviewing my PRs. I would like to advocate for more usage of Goose in my team and organization and it's helpful. Again I can't make hard promises but if I can demonstrate engagement it helps make the argument. |
|
@cgwalters yeah - 100% agree on a survey, I think we should start from that and work backwards? I like the idea of multiple matches, and even separating out styles of editor (ie we can use editor models - which do ma ny powerful things IF you enable them - which cursor does for example, they may be a better approach than being too clever with editing, but then it is worth seeing what is out there and what works well) @cgwalters also what is your nick on discord? (if on there) |
I updated https://github.com/cgwalters/cgwalters/blob/main/README.md for this (it's colinwalters) |
|
@cgwalters do you want to look at this again - or should we take another run at it given the subsequent changes? can focus on shell in one, and like you said, some work on editor in the others (I know editor view needs work so it can not choke on long lines - curious how others do it). |
|
closing this for now - as things have happened in other PRs (but we have much more follow on work to do - but should do it in issue or discussion I think?) |
|
Yes, #3750 basically ended up re-implementing this in a different way, right? I think the logic here is more elegant though in that instead of buffering a giant string in memory and then writing it to a temporary file at the end, the code in this PR automatically switched to a temporary file while we were reading from the child process. And actually now that I more carefully compare the two I see https://github.com/block/goose/pull/3750/files#r2257616339 I can try looking at re-doing this on main at some point. |
Stream command output directly to a temp file when it exceeds 50KB, avoiding memory bloat for commands with huge output. Inspired by block/goose#2817. New features: - output_filter param: pass a regex (e.g. "^(warning|error):") to capture only matching lines inline while full output goes to file - Automatic GC: only keep 3 most recent output files to avoid polluting $HOME The agent is informed that Grep and Read can be used to analyze the temp files without triggering GC. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Stream command output directly to a temp file when it exceeds 50KB, avoiding memory bloat for commands with huge output. Inspired by block/goose#2817. New feature: - output_filter param: pass a regex (e.g. "^(warning|error):") to capture only matching lines inline while full output goes to file The agent is informed that Grep and Read can be used to analyze the temp files. UI now shows filtered output stats (e.g. "Filtered: 7 matches from 2.1 KB (1.9 KB omitted)") so users can see how much output was filtered and the total size. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Stream command output directly to a temp file when it exceeds 50KB, avoiding memory bloat for commands with huge output. Inspired by block/goose#2817. New feature: - output_filter param: pass a regex (e.g. "^(warning|error):") to capture only matching lines inline while full output goes to file Both filtered and non-filtered modes buffer in memory until the threshold is exceeded, avoiding unnecessary filesystem I/O for small output. The agent is informed that Grep and Read can be used to analyze the temp files. UI now shows filtered output stats (e.g. "Filtered: 7 matches from 2.1 KB (1.9 KB omitted)") so users can see how much output was filtered and the total size. Assisted-by: OpenCode (Claude Sonnet 4)
The goal is to lower the amount of tokens the LLM needs to parse for large commands like
makeandcargo build. Most of the time they only need stderr if they fail, and if they don't fail then there's no need to parse the output at all usually.As noted in the text this takes an optional regular expression and for key tasks like
cargo buildI think a regex filter of^(warning)|(error):works well here, avoiding a latency hit where the LLM may need to check the redirected output file for warnings.I also changed things so that we don't have a hardcoded command output limit, but instead we just automatically redirect to a tempfile anyways if it's large.
This is my first contribution to this project, but looking forward to more!
Benchmarking
Of course I asked Goose to benchmark itself this way by comparing token results from a build of itself when using
output_tmpfilevs not. It claimed it worked with an up to 10x reduction in processed tokens.But I didn't do serious benchmarking, and as I've been playing with things more I think I'm getting throttled by Gemini...p95% latency there in the console shot up to 60+ seconds (maybe it's just me?). My next step here is going to be playing with local models.