Skip to content

Conversation

Robbe-Superlinear
Copy link

@Robbe-Superlinear Robbe-Superlinear commented Oct 13, 2025

Self-Query: Automatic Metadata Filter Extraction

This pull request introduces a robust self-query feature to RAGLite, enabling automatic extraction of metadata filters from natural language queries using an LLM. This enhancement allows users to search more intuitively without manually specifying metadata filters, while also introducing a comprehensive metadata management system.

Key Features

🔍 Self-Query Functionality

  • Automatic metadata extraction: Extracts metadata filters (e.g., manufacturer, year, type) directly from natural language queries
  • Context-aware filtering: The Metadata table provides the LLM with available metadata fields and their possible values, ensuring generated filters are valid and grounded
  • Integration: Works with both vector_search and keyword_search methods
  • Configurable: Enable via RAGLiteConfig(self_query=True) (disabled by default)

📊 Metadata Management System

  • Normalized storage: All document and chunk metadata values are stored now as lists via _adapt_metadata utility
  • Metadata tracking: New Metadata table tracks all metadata fields and their allowed unique values, providing a catalog of available filters for self-query
  • Automatic aggregation: Metadata table updated during document insertion

Performance Benchmarks

Dataset: CUAD (Contract Understanding Atticus Dataset)
Settings: Default RAGLite benchmarking configuration

Configuration MAP MRR Answers Found Chunks Retrieved Avg/Query vs Main
main 0.6386 0.6386 89.19% 4,440 40.0
self-query (LLM) 0.6254 0.6254 88.29% 2,941 26.5 -33.8%
self-query (regex) 0.6230 0.6230 88.29% 2,941 26.5 -33.8%

Usage Example

from raglite import Document, RAGLiteConfig, insert_documents, rag
from raglite._search import _self_query
# Configure with self-query enabled
my_config = RAGLiteConfig(
    db_url="duckdb:///raglite.db",
    llm="gpt-4.1-nano",
    embedder="text-embedding-3-small",
    self_query=True,  # Enable automatic metadata extraction
)
# Insert documents with metadata
car_docs = [
    Document.from_text(
        "# Audi e-tron\nThe Audi e-tron is a fully electric mid-size luxury crossover SUV.",
        manufacturer="Audi",
        year=2022,
        type="electric",
    ),
    Document.from_text(
        "# Honda Civic\nThe Honda Civic is a line of cars manufactured by Honda since 1972.",
        manufacturer="Honda",
        year=2023,
        type="sedan",
    ),
    Document.from_text(
        "# Chevrolet Silverado\nThe Chevrolet Silverado is a range of trucks by General Motors.",
        manufacturer="Chevrolet",
        year=2015,
        type="truck",
    ),
]
insert_documents(car_docs, config=my_config)
# Query naturally - metadata filters extracted automatically
query = "What car does Audi offer?"
metadata_filter = _self_query(query, config=my_config)
print(metadata_filter)  # {'manufacturer': ['Audi']}
# Use in RAG pipeline
messages = [{"role": "user", "content": query}]
chunk_spans = []
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
for update in stream:
    print(update, end="")

@jirastorza jirastorza marked this pull request as ready for review October 14, 2025 13:01
@jirastorza jirastorza marked this pull request as draft October 14, 2025 13:03
@jirastorza jirastorza marked this pull request as ready for review October 14, 2025 13:56
@emilradix emilradix requested a review from Copilot October 15, 2025 10:46
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds self-query functionality to the RAGLite library, enabling automatic extraction of metadata filters from natural language queries to improve search precision.

  • Implements _self_query function that uses LLM to extract metadata filters from queries
  • Adds metadata tracking in the database with a new Metadata table
  • Integrates self-query capability into the retrieval pipeline with a configurable flag

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/raglite/_config.py Adds self_query boolean flag to RAGLiteConfig
src/raglite/_database.py Defines new Metadata table for tracking available metadata values
src/raglite/_insert.py Implements metadata aggregation and database updates during document insertion
src/raglite/_rag.py Adds core self-query functionality and integrates it into retrieval pipeline
tests/test_insert.py Tests metadata tracking functionality
tests/test_rag.py Tests self-query extraction and retrieval integration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@emilradix
Copy link
Contributor

Could you add a PR description? you can edit the first comment of Robbe to put it. @jirastorza

Copy link
Author

@Robbe-Superlinear Robbe-Superlinear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, but one big topic. I propose to have a sync, when you have the time, to align.

@Robbe-Superlinear Robbe-Superlinear removed their assignment Oct 22, 2025
Copy link
Author

@Robbe-Superlinear Robbe-Superlinear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@emilradix
Copy link
Contributor

emilradix commented Oct 22, 2025

Probably we will need a more difficult dataset for benchmarking this to see gain in performance. I think we are just finding all chunks from the right document, and it is just a matter of ordering them, which the filter wont help with. So not so surprising in my opinion.

Also could you make sure the PR description is up to date? F.e. did you incorporate this change of ensuring all metadata used for filtering have the values stored as a list? if so it should be mentioned in description. @jirastorza

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants