-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Feature: Tree-sitter Indexer #3389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Tree-sitter Indexer #3389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds an experimental Tree-sitter–based repo indexing feature to the Goose CLI, including a new goose repo index command, multi‐language support examples, and accompanying documentation and dependency updates.
- Introduces a new
reposubcommand incli.rsand its implementation inrepo.rsto index code entities into a JSONL file - Adds example source files and a sample
.goose-repo-index.jsonloutput underexamples/example-treesitter-repo/ - Updates
Cargo.tomlto include Tree-sitter and ignore/walkdir dependencies and adds detailed docs intree-sitter-indexing.md
Reviewed Changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/example-treesitter-repo/src/typescript/test.ts | Adds TypeScript example demonstrating classes and functions |
| examples/example-treesitter-repo/src/tsx/test.tsx | Adds TSX React component example |
| examples/example-treesitter-repo/src/* | Adds multi-language sample files (Swift, Rust, Python, JS, Java, Go, C#, C++) |
| examples/example-treesitter-repo/repo-index.jsonl | Provides an example indexed output |
| documentation/docs/experimental/tree-sitter-indexing.md | Documents the new indexing feature |
| crates/goose-cli/src/commands/repo.rs | Implements Tree-sitter entity extraction logic |
| crates/goose-cli/src/commands/mod.rs | Registers the new repo module |
| crates/goose-cli/src/cli.rs | Adds RepoCommand::Index and command dispatch |
| crates/goose-cli/Cargo.toml | Adds Tree-sitter and traversal dependencies |
Comments suppressed due to low confidence (2)
crates/goose-cli/Cargo.toml:10
- [nitpick] The
[[bin]]section specifying the Goose binary was removed, which may affect how the binary is built or named. Restore or verify this section to ensure the CLI binary still builds as expected.
[lints]
c142a79 to
b50ef31
Compare
|
@michaelneale - I think the code is currently in a state ready for discussion, both of the code and the concept. I'm open to feedback; I've not done benchmarks, but conceptually the idea is to help goose get better context into the LLM. Potentially down the road, goose could keep this data (or similar data) in memory (or update the file in the background) as it makes changes to a codebase. |
|
Any thoughts on accepting this or changes necessary to accept? |
|
@dcieslak19973 I like this idea - but wonder if it can go in the main goose crate potentially so isn't limited to the GUI - would be nice ot be able to make these maps from anywhere - you interested in that? |
|
Yes, I can look into doing that. Not sure if I'll have time this weekend, but probably within the next 2 weeks |
|
thanks - will keep an eye open, curious how this goes, good research IMO |
|
I'm probably not going to be able to get to updating this until Labor Day at the earliest.
One direction I could see this going is to use Zoekt (the tech SourceGraph developed) to keep the codebase in memory in a very searchable format. I don't know if a Rust implementation exists, so there'd probably be some additional effort involved in porting the Go to Rust.
See: https://github.com/sourcegraph/zoekt/tree/main
|
4cd65e9 to
df33b72
Compare
|
I found some time late last night and this morning to CoPilot thru the request to add this to the UI as well |
| @powershell.exe -Command "if (Test-Path ./target/x86_64-pc-windows-gnu/release/goosed.exe) { \ | ||
| Write-Host 'Copying Windows binary and DLLs to ui/desktop/src/bin...'; \ | ||
| Copy-Item -Path './target/x86_64-pc-windows-gnu/release/goosed.exe' -Destination './ui/desktop/src/bin/' -Force; \ | ||
| if (Test-Path ./target/x86_64-pc-windows-gnu/release/goose.exe) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this is related to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, missed CoPilot going a little crazy there.
|
@dcieslak19973 ok this looks like a good start, took 1s to index goose, I tried a very very very large mono repo and was a minute and 300M or so, which is acceptable. What would be next for consuming this index for a codesearch tool to expose it via an appropriate MCP? something like Zoekt but perhaps rust native? any ideas? |
Admittedly, I don't have a great plan for what's next. Some things, which are probably not mutually exclusive:
|
d33acc8 to
730ade7
Compare
|
@michaelneale - I think this is closer to a better form. There's probably still some stuff to clean up in the cli (maybe add a slash command) and the UI |
1042cdc to
28228d8
Compare
…er config read Consolidated changes: safer window config read, binary path checks, completion logs and notifications for repo-index; added renderer IPC event. Signed-off-by: dcieslak19973 <[email protected]>
| @@ -0,0 +1,241 @@ | |||
| # Repository Search with Tree-sitter Indexing (Experimental) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing frontmatter:
| # Repository Search with Tree-sitter Indexing (Experimental) | |
| --- | |
| title: Repository Search with Tree-sitter Indexing | |
| sidebar_label: Repository Search with Tree-sitter Indexing | |
| sidebar_position: 5 | |
| --- |
| @@ -0,0 +1,241 @@ | |||
| # Repository Search with Tree-sitter Indexing (Experimental) | |||
|
|
|||
| > Experimental repository indexing: graph + PageRank + blended fuzzy symbol search across multiple languages. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| > Experimental repository indexing: graph + PageRank + blended fuzzy symbol search across multiple languages. |
documentation/docs/experimental/repo-search-with-tree-sitter-indexing.md
Outdated
Show resolved
Hide resolved
documentation/docs/experimental/repo-search-with-tree-sitter-indexing.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Thanks for the comprehensive documentation. However, I think this needs to be restructured for our user-facing docs site. The current version reads more like internal technical documentation than a user guide.
Key issues:
- Missing the "why": Users won't understand what problem this solves or when they'd use it
- Too implementation-focused: Details about PageRank algorithms, data models, and Tree-sitter versions aren't relevant to users
- No clear use cases: When would someone run
goose repo index? What does it enable?
Suggested restructure:
- Start with the problem: What does this solve? (e.g., "Help Goose better understand your codebase structure")
- Simple use case: "Run this when you want Goose to have deeper knowledge of your project"
- Basic usage: Simple example with expected output
- Benefits: What can Goose do better after indexing?
- Supported languages
Move to separate docs (if necessary):
- Implementation details
- Technical architecture
- Performance tuning
- Contributing guidelines
Could you revise this to focus on the user experience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this feedback. I've split this documentation into a more user-oriented one and a more technical/implementation detail one
…ove manual index cmd; add observability docs & events; add background_index module; gate UI menu by ALPHA_FEATURES
|
@dcieslak19973 we also recently removed the lancedb vector store as a required goose dependency so that might be causing some of the conflicts here. BTW I am working on an mcp that will use it which is external to goose, which could possibly be complemented by tree sitter as well - would you consider collaborating there? |
|
Could be. I also wasn't using devcontainers which may be part of it as well (not to mention my inexperience with Rust and typescript and the goose codebase). While I'm somewhat sympathetic to the idea of making this an MCP in terms of modularity, etc. I do feel strongly that this should be part of the core functionality of For reference, OpenCode (the sst version) includes tools like
And I do believe that this functionality should represent a huge improvement over simple greps. Ideally, we'd run the goose coding benchmarks with this functionality enabled and disabled and get a comparison (and, I could be proven wrong) |
|
@dcieslak19973 yeah I think I agree with you - I have an MCP I am working on that uses lancedb, whch makes sense as an MCP as it is search, but this seems more core to the developer experience. What I am missing is how we search this index, how we surface it? The built in developer MCP is one way, or can have easily an additional built in one to turn on (which goose will know to turn on if needed). I did test this on a very very large internal mono repo and it is workable, just a little heavy. What I think we should do is re-do this but in the goose-mcp crate, not the cli, and would make it less conflict heavy. WDYT? If you like, can shift to a branch, and I can get you access to push to that and collaborate there? |
|
Sounds good. Apologies if the work is a bit rough as I've not done much GitHub collabs (nor Rust for that matter), but I did sleep at a Holiday Inn Express once and I have access to some LLMs |
|
@dcieslak19973 " did sleep at a Holiday Inn Express once and I have access to some LLMs" - from my experience staying at holiday inns I was never on a holiday. Seemed very much like a work trip to me. Yes I am keeping this open as need to open a new body of work with tree sitter - and looking at things like "zoekt" for fast code search so can quickly have tools that find code, index code and also find symbols and relationships. Still not sure about embeddings and vector DBs as haven't had much luck with them scaling to larger code bases (but I know it must be possible). |
|
thanks - we will continue the work over here: #4530 in the developer MCP - with tree sitter |
Addresses #3382
Summary of Changes
goose repoindex for indexing repositories and generating a .goose-repo-index.jsonl file with extracted code entities (classes, functions, relationships, docstrings, etc.).repo.rsimplements the indexer.tree-sitter-indexing.mddetails supported languages, extracted entities, and upgrade guidance.Cargo.toml/ Cargo.lock:tree-sitterand language grammars (see above).walkdirandignorefor efficient file traversal.repo-index.jsonland language-specific example source files for testing and demonstration.Impact