feat: add self-query functionality #163

Robbe-Superlinear · 2025-10-13T11:52:02Z

Self-Query: Automatic Metadata Filter Extraction

This pull request introduces a robust self-query feature to RAGLite, enabling automatic extraction of metadata filters from natural language queries using an LLM. This enhancement allows users to search more intuitively without manually specifying metadata filters, while also introducing a comprehensive metadata management system.

Key Features

🔍 Self-Query Functionality

Automatic metadata extraction: Extracts metadata filters (e.g., manufacturer, year, type) directly from natural language queries
Context-aware filtering: The Metadata table provides the LLM with available metadata fields and their possible values, ensuring generated filters are valid and grounded
Integration: Works with both vector_search and keyword_search methods
Configurable: Enable via RAGLiteConfig(self_query=True) (disabled by default)

📊 Metadata Management System

Normalized storage: All document and chunk metadata values are stored now as lists via _adapt_metadata utility
Metadata tracking: New Metadata table tracks all metadata fields and their allowed unique values, providing a catalog of available filters for self-query
Automatic aggregation: Metadata table updated during document insertion

Performance Benchmarks

Dataset: CUAD (Contract Understanding Atticus Dataset)
Settings: Default RAGLite benchmarking configuration

Configuration	MAP	MRR	Answers Found	Chunks Retrieved	Avg/Query	vs Main
main	0.6386	0.6386	89.19%	4,440	40.0	—
self-query (LLM)	0.6254	0.6254	88.29%	2,941	26.5	-33.8%
self-query (regex)	0.6230	0.6230	88.29%	2,941	26.5	-33.8%

Usage Example

from raglite import Document, RAGLiteConfig, insert_documents, rag
from raglite._search import _self_query
# Configure with self-query enabled
my_config = RAGLiteConfig(
    db_url="duckdb:///raglite.db",
    llm="gpt-4.1-nano",
    embedder="text-embedding-3-small",
    self_query=True,  # Enable automatic metadata extraction
)
# Insert documents with metadata
car_docs = [
    Document.from_text(
        "# Audi e-tron\nThe Audi e-tron is a fully electric mid-size luxury crossover SUV.",
        manufacturer="Audi",
        year=2022,
        type="electric",
    ),
    Document.from_text(
        "# Honda Civic\nThe Honda Civic is a line of cars manufactured by Honda since 1972.",
        manufacturer="Honda",
        year=2023,
        type="sedan",
    ),
    Document.from_text(
        "# Chevrolet Silverado\nThe Chevrolet Silverado is a range of trucks by General Motors.",
        manufacturer="Chevrolet",
        year=2015,
        type="truck",
    ),
]
insert_documents(car_docs, config=my_config)
# Query naturally - metadata filters extracted automatically
query = "What car does Audi offer?"
metadata_filter = _self_query(query, config=my_config)
print(metadata_filter)  # {'manufacturer': ['Audi']}
# Use in RAG pipeline
messages = [{"role": "user", "content": query}]
chunk_spans = []
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
for update in stream:
    print(update, end="")

src/raglite/_insert.py

src/raglite/_rag.py

Copilot

Pull Request Overview

This PR adds self-query functionality to the RAGLite library, enabling automatic extraction of metadata filters from natural language queries to improve search precision.

Implements _self_query function that uses LLM to extract metadata filters from queries
Adds metadata tracking in the database with a new Metadata table
Integrates self-query capability into the retrieval pipeline with a configurable flag

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/raglite/_config.py	Adds `self_query` boolean flag to RAGLiteConfig
src/raglite/_database.py	Defines new Metadata table for tracking available metadata values
src/raglite/_insert.py	Implements metadata aggregation and database updates during document insertion
src/raglite/_rag.py	Adds core self-query functionality and integrates it into retrieval pipeline
tests/test_insert.py	Tests metadata tracking functionality
tests/test_rag.py	Tests self-query extraction and retrieval integration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/raglite/_rag.py

tests/test_rag.py

emilradix · 2025-10-15T12:00:56Z

Could you add a PR description? you can edit the first comment of Robbe to put it. @jirastorza

Co-authored-by: Copilot <[email protected]>

Robbe-Superlinear

Some small comments, but one big topic. I propose to have a sync, when you have the time, to align.

src/raglite/_search.py

src/raglite/_insert.py

src/raglite/_search.py

…a storage, simplified self_query and insert logic.

Robbe-Superlinear

LGTM

emilradix · 2025-10-22T08:55:01Z

Probably we will need a more difficult dataset for benchmarking this to see gain in performance. I think we are just finding all chunks from the right document, and it is just a matter of ordering them, which the filter wont help with. So not so surprising in my opinion.

Also could you make sure the PR description is up to date? F.e. did you incorporate this change of ensuring all metadata used for filtering have the values stored as a list? if so it should be mentioned in description. @jirastorza

jirastorza added 3 commits October 9, 2025 14:16

feat: add self-query functionality

11a3850

fix: modified self_query_prompt

f954bca

fix: modified self_query_prompt

f0e66da

Robbe-Superlinear commented Oct 13, 2025

View reviewed changes

Robbe-Superlinear assigned Robbe-Superlinear and jirastorza Oct 13, 2025

jirastorza added 2 commits October 14, 2025 07:17

fix: code simplification

2e6c436

fix: test rag

3507ad5

jirastorza marked this pull request as ready for review October 14, 2025 13:01

jirastorza marked this pull request as draft October 14, 2025 13:03

fix: add self_query option to config and update tool calling logic.

238d3a1

jirastorza marked this pull request as ready for review October 14, 2025 13:56

emilradix requested a review from Copilot October 15, 2025 10:46

Copilot AI reviewed Oct 15, 2025

View reviewed changes

src/raglite/_rag.py Outdated Show resolved Hide resolved

src/raglite/_rag.py Outdated Show resolved Hide resolved

tests/test_rag.py Outdated Show resolved Hide resolved

jirastorza and others added 8 commits October 15, 2025 14:02

fix: corret logger

b8055da

Co-authored-by: Copilot <[email protected]>

fix: linting

9e32790

fix: simplify rag test.

c8e4fa9

fix: remove repetitive self_query call.

ff97cd2

fix: move self_query to _search.py

e12ed5b

fix: modify test structure.

752ea2b

fix: allow list metadata values.

b0b46a6

fix: allow list type metadata handling.

5d575e9

Robbe-Superlinear commented Oct 17, 2025

View reviewed changes

jirastorza added 5 commits October 17, 2025 11:40

fix: reduce MetadataValues to hashable types, modify document metadat…

b32f070

…a storage, simplified self_query and insert logic.

fix: adapt test.

f937fe6

fix: adapt test case to changes.

f68d1c7

fix: additional test fix.

ecbcae2

fix: database chunk and document metadata.

fb5a01b

Robbe-Superlinear removed their assignment Oct 22, 2025

Robbe-Superlinear commented Oct 22, 2025

View reviewed changes

fix: update README.

15a6000

feat: add self-query functionality #163

Are you sure you want to change the base?

feat: add self-query functionality #163

Uh oh!

Conversation

Robbe-Superlinear commented Oct 13, 2025 • edited by jirastorza Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Self-Query: Automatic Metadata Filter Extraction

Key Features

🔍 Self-Query Functionality

📊 Metadata Management System

Performance Benchmarks

Usage Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emilradix commented Oct 15, 2025

Uh oh!

Robbe-Superlinear left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Robbe-Superlinear left a comment

Choose a reason for hiding this comment

Uh oh!

emilradix commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Robbe-Superlinear commented Oct 13, 2025 •

edited by jirastorza

Loading

emilradix commented Oct 22, 2025 •

edited

Loading