Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and store document/chunk structure and relationships #3450

Open
jacopo-chevallard opened this issue Nov 4, 2024 — with Linear · 1 comment
Open

Extract and store document/chunk structure and relationships #3450

jacopo-chevallard opened this issue Nov 4, 2024 — with Linear · 1 comment
Assignees
Labels
enhancement New feature or request rag: ingestion

Comments

Copy link
Collaborator

jacopo-chevallard commented Nov 4, 2024

Currently, document chunks are stored individually into our vector database (PGVector), i.e. the only relationship we record is the one between a chunk and its original document.

We should expand this to extract the document layout (headers, footers, table, image, caption, …) and the relationships (chunk --> page --> file, previous_chunk --> chunk --> next_chunk, …) and store them into a database, see our scheme.

Copy link

linear bot commented Nov 4, 2024

@dosubot dosubot bot added the enhancement New feature or request label Nov 4, 2024
@jacopo-chevallard jacopo-chevallard changed the title Extract and save document/chunk structure and relationships Extract and store document/chunk structure and relationships Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rag: ingestion
Projects
None yet
Development

No branches or pull requests

1 participant