-
Notifications
You must be signed in to change notification settings - Fork 594
Open
Labels
Description
Enhanced Database Schema Ingestion Task
NOTE: This issue is part of Contribute-to-Win. Please comment first to get assigned. Read the details here
Problem Statement
The current relational database ingestion (migrate_relational_database
) takes an entire database and ingests all data into Cognee, which:
- Creates excessive noise in the knowledge graph
- Provides limited semantic value for most use cases
- Makes the graph unwieldy for schema-based reasoning
- Doesn't focus on the structural knowledge that's most valuable
Proposed Solution
Create a lightweight ingest_database_schema
task that captures only database schema structure plus representative sample data into a dedicated "database_schema" nodeset.
Requirements
Core Functionality
- Extract database schema information (tables, columns, relationships)
- Sample limited representative data per table (3-5 rows max)
- Store schema knowledge in isolated "database_schema" nodeset
- Preserve foreign key relationships between schema entities
- Support multiple database providers (SQLite, PostgreSQL, etc.)
Integration Points
- Add to existing ingestion pipeline options
- Support in
cognee.add()
withschema_only=True
parameter - Enable querying via
SearchType.CODE
for schema-specific searches - Compatible with existing database configuration system
Technical Design
New Data Models (cognee/tasks/schema/models.py
)
class DatabaseSchema(DataPoint):
"""Represents a complete database schema with sample data"""
schema_name: str
database_type: str # sqlite, postgres, etc.
tables: Dict[str, Dict] # Reuse existing schema format from SqlAlchemyAdapter
sample_data: Dict[str, List[Dict]] # Limited examples per table
extraction_timestamp: datetime
metadata: dict = {"index_fields": ["schema_name", "database_type"]}
class SchemaTable(DataPoint):
"""Represents an individual table schema with relationships"""
table_name: str
schema_name: str
columns: List[Dict] # Column definitions with types
primary_key: Optional[str]
foreign_keys: List[Dict] # Foreign key relationships
sample_rows: List[Dict] # Max 3-5 example rows
row_count_estimate: Optional[int] # Actual table size
metadata: dict = {"index_fields": ["table_name", "schema_name"]}
class SchemaRelationship(DataPoint):
"""Represents relationships between tables"""
source_table: str
target_table: str
relationship_type: str # "foreign_key", "one_to_many", etc.
source_column: str
target_column: str
metadata: dict = {"index_fields": ["source_table", "target_table"]}
Core Task Implementation (cognee/tasks/schema/ingest_database_schema.py
)
async def ingest_database_schema(
database_config: Dict,
schema_name: str = "default",
max_sample_rows: int = 5,
node_set: List[str] = ["database_schema"]
) -> List[DataPoint]:
"""
Ingest database schema with sample data into dedicated nodeset
Args:
database_config: Database connection configuration
schema_name: Name identifier for this schema
max_sample_rows: Maximum sample rows per table
node_set: Target nodeset (default: ["database_schema"])
Returns:
List of created DataPoint objects
"""
Key Functions
- Schema Extraction: Reuse existing
SqlAlchemyAdapter.extract_schema()
- Sample Data Collection: Query limited rows per table with
LIMIT
- Relationship Mapping: Create explicit relationship nodes for foreign keys
- Nodeset Assignment: Ensure all schema entities go to dedicated nodeset