Extract and store document/chunk structure and relationships #3450

jacopo-chevallard · 2024-11-04T15:41:48Z

Currently, document chunks are stored individually into our vector database (PGVector), i.e. the only relationship we record is the one between a chunk and its original document.

We should expand this to extract the document layout (headers, footers, table, image, caption, …) and the relationships (chunk --> page --> file, previous_chunk --> chunk --> next_chunk, …) and store them into a database, see our scheme.

linear · 2024-11-04T15:41:50Z

CORE-278 Extract and save document/chunk structure and relationships

jacopo-chevallard added the rag: ingestion label Nov 4, 2024 — with Linear

jacopo-chevallard self-assigned this Nov 4, 2024

dosubot bot added the enhancement New feature or request label Nov 4, 2024

jacopo-chevallard changed the title ~~Extract and save document/chunk structure and relationships~~ Extract and store document/chunk structure and relationships Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract and store document/chunk structure and relationships #3450

Extract and store document/chunk structure and relationships #3450

jacopo-chevallard commented Nov 4, 2024 •

edited

Loading

linear bot commented Nov 4, 2024

Extract and store document/chunk structure and relationships #3450

Extract and store document/chunk structure and relationships #3450

Comments

jacopo-chevallard commented Nov 4, 2024 • edited Loading

linear bot commented Nov 4, 2024

jacopo-chevallard commented Nov 4, 2024 •

edited

Loading