Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples: Make RAG examples a bit more generic and demoable #3085

Merged
merged 4 commits into from
Feb 8, 2024

Conversation

antoniivanov
Copy link
Collaborator

@antoniivanov antoniivanov commented Feb 6, 2024

For confluence-reader:

For embed-ingest-job-example:

  • parameterize the table names used in the job
  • add clean-up deleted rows (though afterwards I realised it's redundant for now as
    we need to drop the table first as the postgres ingestion does not
    support upserts (updates))
  • as the embedding job is written in so generic way. Actually, there's no
    need to tie it to confluence at all. It would work for any dataset.
  • Added multiple TODOs for missing features. The job could be even
    further generalized if our ingestion framework improves
  • renamed embed-ingest-job-exmaple ot pgvector-embedder to better show its responsibilities

Copy link
Collaborator

@duyguHsnHsn duyguHsnHsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Should we create ticket for some of the todos?

Copy link
Contributor

@yonitoo yonitoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome generalization. LGTM.
A bit of a side note but I recently read this SentenceTransformer issue suggesting that the stop words removal and lemmatization (part of our text cleaning) are not needed for transformer models - passing the original text is suggested. Maybe we can remove the whole cleaning part and directly embed (could be done here or as part of another story).

@antoniivanov
Copy link
Collaborator Author

Looks good to me! Should we create ticket for some of the todos?

Not yet. As we build a backlog for the next milestones, then we will create the needed tickets.

@antoniivanov
Copy link
Collaborator Author

Awesome generalization. LGTM. A bit of a side note but I recently read this SentenceTransformer issue suggesting that the stop words removal and lemmatization (part of our text cleaning) are not needed for transformer models - passing the original text is suggested. Maybe we can remove the whole cleaning part and directly embed (could be done here or as part of another story).

Ok. Maybe separately. But I also want to have some cleaning logic, because it's something you would expect ot have in a pipeline and we need to figure out how to handle it properly.

- the recursive method for find pages was crashing so repalced with more
CQL
- added passing parent id so we can   take only few pages for demo
purpsoes
- Noted bugs and issues in the code and added todos
- parameterize the table names used in the job
- add clean up deleted rows (though I realised it's redundant for now as
we need to drop the table first as the postgres ingestion  does not
support upserts (updates))
- as the embedding job is written in so generic way. Actually there's no
need to tie it to confluence at all. It would work for any dataset.
- Added multiple TODOs for missing features. The job could be even
further generalied if our ingestion frameowrk improves
@antoniivanov antoniivanov changed the title Make RAG examples a bit more generic and demoable examples: Make RAG examples a bit more generic and demoable Feb 8, 2024
@antoniivanov antoniivanov merged commit 62961c7 into main Feb 8, 2024
8 of 10 checks passed
@antoniivanov antoniivanov deleted the person/aivanov/rag branch February 8, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants