Codebase Indexer

A production-ready tool for indexing and searching GitHub repositories using advanced AST-based chunking, vector embeddings, and semantic search capabilities. This system enables intelligent code search and retrieval by creating vector embeddings of code chunks and storing them in FAISS vector databases.

Overview

Codebase Indexer is a standalone tool designed for developers and teams who need to efficiently search and understand large codebases. It uses tree-sitter for AST-based code parsing and supports incremental updates through Merkle tree-based change detection, making it ideal for maintaining up-to-date indexes of rapidly evolving projects.

Key Features

Multi-language Support: Supports 25+ programming languages including Python, Java, JavaScript, TypeScript, C/C++, Go, Rust, Ruby, Bash, and more
AST-based Code Chunking: Intelligent code chunking that respects language syntax and structure using tree-sitter
Incremental Updates: Only re-indexes changed files using Merkle tree-based change detection for efficient operations
Semantic Search: Find relevant code snippets using natural language queries powered by vector embeddings
Efficient Storage: FAISS vector databases optimized for fast similarity search at scale
Multi-repository Management: Index and search across multiple repositories with isolated namespaces
Rich CLI Interface: User-friendly command-line interface with progress tracking and detailed feedback
Rate Limiting: Built-in rate limiting for API calls to ensure reliable operation

Installation

Prerequisites

Python 3.8 or higher
GitHub Personal Access Token
2GB+ RAM recommended for large repositories

Setup

Clone the repository:

git clone https://github.com/yourusername/CodebaseIndexer.git
cd CodebaseIndexer

Install dependencies:

pip install -r requirements.txt

Configure environment variables by creating a .env file:

GITHUB_TOKEN=your_github_personal_access_token
VECTOR_STORE_DIR=./VectorStore  # Optional: defaults to ./VectorStore
MERKLE_DB_PATH=./merkle_trees.db  # Optional: defaults to ./merkle_trees.db

To obtain a GitHub Personal Access Token, visit GitHub Settings > Developer Settings > Personal Access Tokens.

Usage

Indexing a Repository

Index a new repository or update an existing one:

python main.py --repo owner/repo

You can also use the full GitHub URL:

python main.py --repo https://github.com/owner/repo

Exclude specific directories from indexing:

python main.py --repo owner/repo --exclude node_modules,dist,build,.git

Searching a Repository

Search for code snippets using natural language queries:

python main.py --repo owner/repo --search "function to parse JSON files"

Search results include:

File path and line numbers
Similarity score
Chunk type (function, class, etc.) and language
Preview of the code snippet

Deleting an Index

Remove a repository index to free up disk space:

python main.py --repo owner/repo --delete

Verbose Logging

Enable detailed logging for debugging or monitoring:

python main.py --repo owner/repo --verbose

Code Chunking Strategy

The system employs a multi-tiered chunking approach to handle diverse file types:

AST-based Chunking (Primary): Uses tree-sitter to parse code into syntax trees and extract semantic units like functions, classes, and methods. This preserves code context and structure.
Recursive Text Splitting (Fallback): For languages without tree-sitter support or when AST parsing fails, uses language-aware recursive splitting with configurable separators.
Simple Chunking (Final Fallback): For markdown, text files, and other non-code content, splits by paragraphs and sentences.

All chunks respect the configured token limits (default: 1000 tokens per chunk with 200 token overlap) to ensure optimal embedding quality.

Supported Languages

The indexer supports the following languages with tree-sitter AST parsing:

Compiled Languages: C, C++, Java, Go, Rust Scripting Languages: Python, JavaScript, TypeScript, Ruby, Bash, Perl, PHP, Groovy Build Systems: Make, Gradle Markup/Data: HTML, XML, YAML, JSON, TOML, INI, SQL Documentation: Markdown, Plain Text

Configuration

Default settings can be customized via environment variables:

Embedding Model: openai/text-embedding-3-large (via GitHub Models)
Max Tokens per Chunk: 1000 tokens
Overlap Tokens: 200 tokens
Max File Size: 1MB
Vector Similarity: Cosine similarity
Rate Limits: 15 requests/minute, 5 concurrent, 64000 tokens/request

Contributing

Contributions are welcome. Please ensure your code follows the existing style and includes appropriate tests.

License

This project is licensed under the Apache-2.0 License License. See the LICENSE file for details.

Acknowledgments

The codebase indexing approach is inspired by techniques described in How Cursor Indexes Codebases Fast, which provides insights into efficient codebase indexing for AI-powered development tools.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
embedding.py		embedding.py
indexer.py		indexer.py
main.py		main.py
merkle_tree.py		merkle_tree.py
requirements.txt		requirements.txt
store.py		store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Codebase Indexer

Overview

Key Features

Installation

Prerequisites

Setup

Usage

Indexing a Repository

Searching a Repository

Deleting an Index

Verbose Logging

Code Chunking Strategy

Supported Languages

Configuration

Contributing

License

Acknowledgments

About

Uh oh!

Languages

License

anirudhsengar/CodebaseIndexer

Folders and files

Latest commit

History

Repository files navigation

Codebase Indexer

Overview

Key Features

Installation

Prerequisites

Setup

Usage

Indexing a Repository

Searching a Repository

Deleting an Index

Verbose Logging

Code Chunking Strategy

Supported Languages

Configuration

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages