-
Notifications
You must be signed in to change notification settings - Fork 15.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core[patch]: improve index/aindex api when batch_size<n_docs #25754
core[patch]: improve index/aindex api when batch_size<n_docs #25754
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add unit tests for this?
@baskaryan yeah sure! |
Hi @eyurtsev can i do something to make it easier to merge? |
Standby we're working on the 0.3 release so there was an effective code freeze for the past 2 weeks |
Hi @federico-pisanu, Thank you for the PR! In order to merge the changes, the code would need to do the following:
|
@federico-pisanu let me double check the existing unit tests to make sure they were wrong |
@federico-pisanu OK confirmed -- no action is required from you at this time. I'll re-apply changes on aindex |
@eyurtsev Isn't it already done? |
@federico-pisanu yep didn't notice -- looks good added unit tests to cover the optimization -- i.e., we get different results based on which batch was mutated |
…in-ai#25754) - **Description:** prevent index function to re-index entire source document even if nothing has changed. - **Issue:** langchain-ai#22135 I worked on a solution to this issue that is a compromise between being cheap and being fast. In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch) My solution deletes documents from vector store and record manager only if at least one document has changed for that source. Hope this can help! --------- Co-authored-by: Eugene Yurtsev <[email protected]>
…angchain-ai#25754)" This reverts commit 2538963.
I worked on a solution to this issue that is a compromise between being cheap and being fast.
In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch)
My solution deletes documents from vector store and record manager only if at least one document has changed for that source.
Hope this can help!