"I like them big, I like them CONTRIBUTING" ~ Moto Moto, probably
Welcome fellow CHONKer! We're excited that you want to contribute to Chonkie. Whether you're fixing bugs, adding features, or improving documentation, every contribution makes Chonkie a better library for everyone.
- Check the issues: Look for existing issues or open a new one to start a discussion.
- Read the docs: Familiarize yourself with Chonkie's docs and core concepts.
- Set up your environment: Follow our development setup guide below.
- Fork and clone the repository:
git clone https://github.com/your-username/chonkie.git
cd chonkie
- Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install with dev dependencies
pip install -e ".[dev]"
# If working on semantic features, also install semantic dependencies
pip install -e ".[dev,semantic]"
# For all features
pip install -e ".[dev,all]"
We use pytest for testing. Our tests are configured via pyproject.toml
. Before submitting a PR, make sure all tests pass:
# Run all tests
pytest
# Run specific test file
pytest tests/test_token_chunker.py
# Run tests with coverage
pytest --cov=chonkie
Chonkie uses ruff for code formatting and linting. Our configuration in pyproject.toml
enforces:
- Code formatting (
F
) - Import sorting (
I
) - Documentation style (
D
) - Docstring coverage (
DOC
)
# Run ruff
ruff check .
# Run ruff with auto-fix
ruff check --fix .
We use Google-style docstrings. Example:
def chunk_text(text: str, chunk_size: int = 512) -> List[str]:
"""Split text into chunks of specified size.
Args:
text: Input text to chunk
chunk_size: Maximum size of each chunk
Returns:
List of text chunks
Raises:
ValueError: If chunk_size <= 0
"""
pass
-
Branch Naming: Use descriptive branch names:
feature/description
for new featuresfix/description
for bug fixesdocs/description
for documentation changes
-
Commit Messages: Write clear commit messages:
feat: add batch processing to WordChunker
- Implement batch_process method
- Add tests for batch processing
- Update documentation
- Dependencies: If adding new dependencies:
- Core dependencies go in
project.dependencies
- Optional features go in
project.optional-dependencies
- Development tools go in the
dev
optional dependency group
- Core dependencies go in
Chonkie's package structure is:
src/
βββ chonkie/
βββ chunker/ # Chunking implementations
βββ embeddings/ # Embedding implementations
βββ refinery/ # Refinement utilities
Look for issues labeled good-first-issue
. These are great starting points for new contributors.
- Improve existing docs
- Add examples
- Fix typos
- Add tutorials
- Implement new chunking strategies
- Optimize existing chunkers
- Add new tokenizer support
- Improve test coverage
- Profile and optimize code
- Add benchmarks
- Improve memory usage
- Enhance batch processing
- Add new features to the library
- Add new optional dependencies
- Look for [FEAT] labels in issues, especially by Chonkie Maintainers
Current development dependencies are (as of January 1, 2025):
[project.optional-dependencies]
dev = [
"pytest>=6.2.0",
"datasets>=1.14.0",
"transformers>=4.0.0",
"ruff>=0.0.265"
]
Additional optional dependencies:
model2vec
: For model2vec embeddingsst
: For sentence-transformersopenai
: For OpenAI embeddingssemantic
: For semantic featuresall
: All optional dependencies
- All PRs need at least one review
- Maintainers will review for:
- Code quality (via ruff)
- Test coverage
- Performance impact
- Documentation completeness
- Adherence to principles
- Questions? Open an issue or ask in Discord
- Bugs? Open an issue or report in Discord
- Chat? Join our Discord!
- Email? Contact [email protected]
Every contribution helps make Chonkie better! We appreciate your time and effort in helping make Chonkie the CHONKiest it can be!
Remember:
"A journey of a thousand CHONKs begins with a single commit" ~ Ancient Proverb, probably