Skip to content

Conversation

@dcieslak19973
Copy link
Contributor

Addresses #3382

Summary of Changes

  1. Tree-sitter Code Indexing Feature (Goose CLI)
  • New Feature: Adds experimental Tree-sitter-based code indexing to Goose CLI, supporting Rust, Python, JavaScript, TypeScript, Go, C++, Java, C#, and Swift.
  • New Command: goose repo index for indexing repositories and generating a .goose-repo-index.jsonl file with extracted code entities (classes, functions, relationships, docstrings, etc.).
  • New Module: repo.rs implements the indexer.
  • New Documentation: tree-sitter-indexing.md details supported languages, extracted entities, and upgrade guidance.
  1. Dependency Updates
  • Cargo.toml / Cargo.lock:
    • Adds dependencies for tree-sitter and language grammars (see above).
    • Adds walkdir and ignore for efficient file traversal.
    • Updates and reorders some existing dependencies for compatibility.
  1. CLI Enhancements
  • New Subcommand: Adds Repo subcommand to Goose CLI, with an index action for repository indexing.
  • Refactoring: Updates CLI argument parsing to support new subcommands and options.
  1. Example and Test Data
  • Adds Example Indexed Data: repo-index.jsonl and language-specific example source files for testing and demonstration.
  1. Internal Improvements
  • Entity Extraction: Richer extraction for Go, Python, and C++ (fields, generics, decorators, etc.).
  • Parent/Child Relationships: Improved tracking of parent entities for functions/methods.
  • Docstring/Comment Extraction: For Python and Rust, docstrings and doc comments are included in the index.

Impact

  • LLM/AI Assistant Support: Lays the foundation for advanced code navigation, search, and summarization features.
  • Multi-language Support: Enables entity-level code understanding for a wide range of languages.
  • Experimental: Marked as experimental; subject to change as Tree-sitter ecosystem evolves.

Copilot AI review requested due to automatic review settings July 13, 2025 05:21
@dcieslak19973 dcieslak19973 changed the title Tree-sitter Indexer DRAFT: Tree-sitter Indexer Jul 13, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds an experimental Tree-sitter–based repo indexing feature to the Goose CLI, including a new goose repo index command, multi‐language support examples, and accompanying documentation and dependency updates.

  • Introduces a new repo subcommand in cli.rs and its implementation in repo.rs to index code entities into a JSONL file
  • Adds example source files and a sample .goose-repo-index.jsonl output under examples/example-treesitter-repo/
  • Updates Cargo.toml to include Tree-sitter and ignore/walkdir dependencies and adds detailed docs in tree-sitter-indexing.md

Reviewed Changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
examples/example-treesitter-repo/src/typescript/test.ts Adds TypeScript example demonstrating classes and functions
examples/example-treesitter-repo/src/tsx/test.tsx Adds TSX React component example
examples/example-treesitter-repo/src/* Adds multi-language sample files (Swift, Rust, Python, JS, Java, Go, C#, C++)
examples/example-treesitter-repo/repo-index.jsonl Provides an example indexed output
documentation/docs/experimental/tree-sitter-indexing.md Documents the new indexing feature
crates/goose-cli/src/commands/repo.rs Implements Tree-sitter entity extraction logic
crates/goose-cli/src/commands/mod.rs Registers the new repo module
crates/goose-cli/src/cli.rs Adds RepoCommand::Index and command dispatch
crates/goose-cli/Cargo.toml Adds Tree-sitter and traversal dependencies
Comments suppressed due to low confidence (2)

crates/goose-cli/Cargo.toml:10

  • [nitpick] The [[bin]] section specifying the Goose binary was removed, which may affect how the binary is built or named. Restore or verify this section to ensure the CLI binary still builds as expected.
[lints]

@dcieslak19973 dcieslak19973 force-pushed the goose-3382-treesitter-index-20250712a branch from c142a79 to b50ef31 Compare July 14, 2025 02:06
@michaelneale michaelneale marked this pull request as draft July 14, 2025 04:32
@dcieslak19973 dcieslak19973 changed the title DRAFT: Tree-sitter Indexer Feature: Tree-sitter Indexer Jul 14, 2025
@dcieslak19973
Copy link
Contributor Author

@michaelneale - I think the code is currently in a state ready for discussion, both of the code and the concept. I'm open to feedback; I've not done benchmarks, but conceptually the idea is to help goose get better context into the LLM. Potentially down the road, goose could keep this data (or similar data) in memory (or update the file in the background) as it makes changes to a codebase.

@dcieslak19973
Copy link
Contributor Author

Any thoughts on accepting this or changes necessary to accept?

@michaelneale
Copy link
Collaborator

@dcieslak19973 I like this idea - but wonder if it can go in the main goose crate potentially so isn't limited to the GUI - would be nice ot be able to make these maps from anywhere - you interested in that?

@michaelneale michaelneale self-assigned this Aug 8, 2025
@dcieslak19973
Copy link
Contributor Author

Yes, I can look into doing that. Not sure if I'll have time this weekend, but probably within the next 2 weeks

@michaelneale
Copy link
Collaborator

thanks - will keep an eye open, curious how this goes, good research IMO

@dcieslak19973
Copy link
Contributor Author

dcieslak19973 commented Aug 11, 2025 via email

@dcieslak19973 dcieslak19973 force-pushed the goose-3382-treesitter-index-20250712a branch from 4cd65e9 to df33b72 Compare August 12, 2025 15:18
@dcieslak19973 dcieslak19973 marked this pull request as ready for review August 12, 2025 15:28
@dcieslak19973
Copy link
Contributor Author

I found some time late last night and this morning to CoPilot thru the request to add this to the UI as well

@michaelneale michaelneale added p1 Priority 1 - High (supports roadmap) performance Performance related and removed waiting labels Aug 13, 2025
@powershell.exe -Command "if (Test-Path ./target/x86_64-pc-windows-gnu/release/goosed.exe) { \
Write-Host 'Copying Windows binary and DLLs to ui/desktop/src/bin...'; \
Copy-Item -Path './target/x86_64-pc-windows-gnu/release/goosed.exe' -Destination './ui/desktop/src/bin/' -Force; \
if (Test-Path ./target/x86_64-pc-windows-gnu/release/goose.exe) { \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is related to this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed CoPilot going a little crazy there.

@michaelneale
Copy link
Collaborator

@dcieslak19973 ok this looks like a good start, took 1s to index goose, I tried a very very very large mono repo and was a minute and 300M or so, which is acceptable.

What would be next for consuming this index for a codesearch tool to expose it via an appropriate MCP? something like Zoekt but perhaps rust native? any ideas?

@dcieslak19973
Copy link
Contributor Author

@dcieslak19973 ok this looks like a good start, took 1s to index goose, I tried a very very very large mono repo and was a minute and 300M or so, which is acceptable.

What would be next for consuming this index for a codesearch tool to expose it via an appropriate MCP? something like Zoekt but perhaps rust native? any ideas?

Admittedly, I don't have a great plan for what's next. Some things, which are probably not mutually exclusive:

  1. Do additional research/analysis on how aider uses this technique. This implementation is really just based off reading https://aider.chat/docs/repomap.html .
  2. The jsonl file isn't meant to be permanent, though it offers a potentially easy integration path when goose's LLM wants to do a code search - in theory grep/rip-grep/etc. would find contents in this file.
  3. So, replacing (augmenting maybe) the jsonl with getting this into memory, probably some form of a Vector Database where the additional metadata (line numbers, etc.) could be stored with the embeddings. then, present goose with a codesearch tool that would do an embedding search
  4. Probably some way to do an incremental update of 3 as goose changes the codebase. To start, the easiest thing might be to just trigger full rebuilds and then look at smarter things to do on larger codebases that would benefit from incremental.
  5. Potentially add something like PageRank (on functions/classes/etc.) to this process. When goose wants to maybe understand the codebase at a high-level, feeding in (maybe thru another tool) the "important" features of the codebase would be useful
  6. If you follow claude code's work, they are "spawning" (I use this loosely as I'm not totally sure how they implement it) agents. Goose could follow suite; maybe there'd be a background "agent" that updates the vector-store after changes, etc.
  7. I've mentioned Zoekt, I think that would be a separate implementation to this, and one where a "codesearch" tool/agent could use a BOTH-AND search over the in-memory vectorstore and an in-memory Zoekt instance.

@dcieslak19973 dcieslak19973 force-pushed the goose-3382-treesitter-index-20250712a branch 2 times, most recently from d33acc8 to 730ade7 Compare August 17, 2025 15:42
@dcieslak19973
Copy link
Contributor Author

@michaelneale - I think this is closer to a better form. There's probably still some stuff to clean up in the cli (maybe add a slash command) and the UI

@dcieslak19973 dcieslak19973 force-pushed the goose-3382-treesitter-index-20250712a branch from 1042cdc to 28228d8 Compare August 18, 2025 00:03
…er config read

Consolidated changes: safer window config read, binary path checks, completion logs and notifications for repo-index; added renderer IPC event.

Signed-off-by: dcieslak19973 <[email protected]>
@dcieslak19973 dcieslak19973 reopened this Aug 18, 2025
@dcieslak19973 dcieslak19973 requested a review from a team as a code owner August 18, 2025 00:08
@@ -0,0 +1,241 @@
# Repository Search with Tree-sitter Indexing (Experimental)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing frontmatter:

Suggested change
# Repository Search with Tree-sitter Indexing (Experimental)
---
title: Repository Search with Tree-sitter Indexing
sidebar_label: Repository Search with Tree-sitter Indexing
sidebar_position: 5
---

@@ -0,0 +1,241 @@
# Repository Search with Tree-sitter Indexing (Experimental)

> Experimental repository indexing: graph + PageRank + blended fuzzy symbol search across multiple languages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Experimental repository indexing: graph + PageRank + blended fuzzy symbol search across multiple languages.

@michaelneale michaelneale marked this pull request as draft August 18, 2025 00:29
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for the comprehensive documentation. However, I think this needs to be restructured for our user-facing docs site. The current version reads more like internal technical documentation than a user guide.

Key issues:

  • Missing the "why": Users won't understand what problem this solves or when they'd use it
  • Too implementation-focused: Details about PageRank algorithms, data models, and Tree-sitter versions aren't relevant to users
  • No clear use cases: When would someone run goose repo index? What does it enable?

Suggested restructure:

  1. Start with the problem: What does this solve? (e.g., "Help Goose better understand your codebase structure")
  2. Simple use case: "Run this when you want Goose to have deeper knowledge of your project"
  3. Basic usage: Simple example with expected output
  4. Benefits: What can Goose do better after indexing?
  5. Supported languages

Move to separate docs (if necessary):

  • Implementation details
  • Technical architecture
  • Performance tuning
  • Contributing guidelines

Could you revise this to focus on the user experience?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this feedback. I've split this documentation into a more user-oriented one and a more technical/implementation detail one

…ove manual index cmd; add observability docs & events; add background_index module; gate UI menu by ALPHA_FEATURES
@michaelneale
Copy link
Collaborator

@dcieslak19973 we also recently removed the lancedb vector store as a required goose dependency so that might be causing some of the conflicts here. BTW I am working on an mcp that will use it which is external to goose, which could possibly be complemented by tree sitter as well - would you consider collaborating there?

@dcieslak19973
Copy link
Contributor Author

Could be. I also wasn't using devcontainers which may be part of it as well (not to mention my inexperience with Rust and typescript and the goose codebase).

While I'm somewhat sympathetic to the idea of making this an MCP in terms of modularity, etc. I do feel strongly that this should be part of the core functionality of goose, maybe as part of the "developer" built-ins, perhaps as I view goose as a tool similar to claude-code, aider, OpenCode, etc. and less as a general-purpose AI tool. Ultimately, the decision is yours/your team's as project maintainer.

For reference, aider includes repomap functionality directly in their codebase:

OpenCode (the sst version) includes tools like grep and even LSPs as part of their core functionality:

And I do believe that this functionality should represent a huge improvement over simple greps. Ideally, we'd run the goose coding benchmarks with this functionality enabled and disabled and get a comparison (and, I could be proven wrong)

@michaelneale
Copy link
Collaborator

michaelneale commented Aug 22, 2025

@dcieslak19973 yeah I think I agree with you - I have an MCP I am working on that uses lancedb, whch makes sense as an MCP as it is search, but this seems more core to the developer experience. What I am missing is how we search this index, how we surface it? The built in developer MCP is one way, or can have easily an additional built in one to turn on (which goose will know to turn on if needed).

I did test this on a very very large internal mono repo and it is workable, just a little heavy. What I think we should do is re-do this but in the goose-mcp crate, not the cli, and would make it less conflict heavy. WDYT?

If you like, can shift to a branch, and I can get you access to push to that and collaborate there?

@dcieslak19973
Copy link
Contributor Author

Sounds good. Apologies if the work is a bit rough as I've not done much GitHub collabs (nor Rust for that matter), but I did sleep at a Holiday Inn Express once and I have access to some LLMs

@michaelneale
Copy link
Collaborator

@dcieslak19973 " did sleep at a Holiday Inn Express once and I have access to some LLMs" - from my experience staying at holiday inns I was never on a holiday. Seemed very much like a work trip to me.

Yes I am keeping this open as need to open a new body of work with tree sitter - and looking at things like "zoekt" for fast code search so can quickly have tools that find code, index code and also find symbols and relationships. Still not sure about embeddings and vector DBs as haven't had much luck with them scaling to larger code bases (but I know it must be possible).

@michaelneale
Copy link
Collaborator

thanks - we will continue the work over here: #4530 in the developer MCP - with tree sitter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

p1 Priority 1 - High (supports roadmap) performance Performance related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants