-
Notifications
You must be signed in to change notification settings - Fork 101
feat: add self-query functionality #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds self-query functionality to the RAGLite library, enabling automatic extraction of metadata filters from natural language queries to improve search precision.
- Implements
_self_queryfunction that uses LLM to extract metadata filters from queries - Adds metadata tracking in the database with a new
Metadatatable - Integrates self-query capability into the retrieval pipeline with a configurable flag
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/raglite/_config.py | Adds self_query boolean flag to RAGLiteConfig |
| src/raglite/_database.py | Defines new Metadata table for tracking available metadata values |
| src/raglite/_insert.py | Implements metadata aggregation and database updates during document insertion |
| src/raglite/_rag.py | Adds core self-query functionality and integrates it into retrieval pipeline |
| tests/test_insert.py | Tests metadata tracking functionality |
| tests/test_rag.py | Tests self-query extraction and retrieval integration |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Could you add a PR description? you can edit the first comment of Robbe to put it. @jirastorza |
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments, but one big topic. I propose to have a sync, when you have the time, to align.
…a storage, simplified self_query and insert logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Probably we will need a more difficult dataset for benchmarking this to see gain in performance. I think we are just finding all chunks from the right document, and it is just a matter of ordering them, which the filter wont help with. So not so surprising in my opinion. Also could you make sure the PR description is up to date? F.e. did you incorporate this change of ensuring all metadata used for filtering have the values stored as a list? if so it should be mentioned in description. @jirastorza |
Self-Query: Automatic Metadata Filter Extraction
This pull request introduces a robust self-query feature to RAGLite, enabling automatic extraction of metadata filters from natural language queries using an LLM. This enhancement allows users to search more intuitively without manually specifying metadata filters, while also introducing a comprehensive metadata management system.
Key Features
🔍 Self-Query Functionality
Metadatatable provides the LLM with available metadata fields and their possible values, ensuring generated filters are valid and groundedvector_searchandkeyword_searchmethodsRAGLiteConfig(self_query=True)(disabled by default)📊 Metadata Management System
_adapt_metadatautilityMetadatatable tracks all metadata fields and their allowed unique values, providing a catalog of available filters for self-queryPerformance Benchmarks
Dataset: CUAD (Contract Understanding Atticus Dataset)
Settings: Default RAGLite benchmarking configuration
Usage Example