Skip to content

Add task to ingest relational schema and examples instead of entire database #1350

@Vasilije1990

Description

@Vasilije1990

Enhanced Database Schema Ingestion Task

NOTE: This issue is part of Contribute-to-Win. Please comment first to get assigned. Read the details here

Problem Statement

The current relational database ingestion (migrate_relational_database) takes an entire database and ingests all data into Cognee, which:

  • Creates excessive noise in the knowledge graph
  • Provides limited semantic value for most use cases
  • Makes the graph unwieldy for schema-based reasoning
  • Doesn't focus on the structural knowledge that's most valuable

Proposed Solution

Create a lightweight ingest_database_schema task that captures only database schema structure plus representative sample data into a dedicated "database_schema" nodeset.

Requirements

Core Functionality

  • Extract database schema information (tables, columns, relationships)
  • Sample limited representative data per table (3-5 rows max)
  • Store schema knowledge in isolated "database_schema" nodeset
  • Preserve foreign key relationships between schema entities
  • Support multiple database providers (SQLite, PostgreSQL, etc.)

Integration Points

  • Add to existing ingestion pipeline options
  • Support in cognee.add() with schema_only=True parameter
  • Enable querying via SearchType.CODE for schema-specific searches
  • Compatible with existing database configuration system

Technical Design

New Data Models (cognee/tasks/schema/models.py)

class DatabaseSchema(DataPoint):
    """Represents a complete database schema with sample data"""
    schema_name: str
    database_type: str  # sqlite, postgres, etc.
    tables: Dict[str, Dict]  # Reuse existing schema format from SqlAlchemyAdapter
    sample_data: Dict[str, List[Dict]]  # Limited examples per table
    extraction_timestamp: datetime
    metadata: dict = {"index_fields": ["schema_name", "database_type"]}

class SchemaTable(DataPoint):
    """Represents an individual table schema with relationships"""
    table_name: str
    schema_name: str
    columns: List[Dict]  # Column definitions with types
    primary_key: Optional[str]
    foreign_keys: List[Dict]  # Foreign key relationships
    sample_rows: List[Dict]  # Max 3-5 example rows
    row_count_estimate: Optional[int]  # Actual table size
    metadata: dict = {"index_fields": ["table_name", "schema_name"]}

class SchemaRelationship(DataPoint):
    """Represents relationships between tables"""
    source_table: str
    target_table: str
    relationship_type: str  # "foreign_key", "one_to_many", etc.
    source_column: str
    target_column: str
    metadata: dict = {"index_fields": ["source_table", "target_table"]}

Core Task Implementation (cognee/tasks/schema/ingest_database_schema.py)

async def ingest_database_schema(
    database_config: Dict,
    schema_name: str = "default",
    max_sample_rows: int = 5,
    node_set: List[str] = ["database_schema"]
) -> List[DataPoint]:
    """
    Ingest database schema with sample data into dedicated nodeset
    
    Args:
        database_config: Database connection configuration
        schema_name: Name identifier for this schema
        max_sample_rows: Maximum sample rows per table
        node_set: Target nodeset (default: ["database_schema"])
    
    Returns:
        List of created DataPoint objects
    """

Key Functions

  1. Schema Extraction: Reuse existing SqlAlchemyAdapter.extract_schema()
  2. Sample Data Collection: Query limited rows per table with LIMIT
  3. Relationship Mapping: Create explicit relationship nodes for foreign keys
  4. Nodeset Assignment: Ensure all schema entities go to dedicated nodeset

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions