Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core[patch]: improve index/aindex api when batch_size<n_docs #25754

Conversation

federico-pisanu
Copy link
Contributor

I worked on a solution to this issue that is a compromise between being cheap and being fast.
In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch)
My solution deletes documents from vector store and record manager only if at least one document has changed for that source.

Hope this can help!

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Aug 26, 2024
Copy link

vercel bot commented Aug 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Sep 30, 2024 8:52pm

@dosubot dosubot bot added Ɑ: core Related to langchain-core 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 26, 2024
Copy link
Collaborator

@baskaryan baskaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add unit tests for this?

@baskaryan baskaryan added the needs test PR needs to be updated with tests label Aug 28, 2024
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Aug 28, 2024
@federico-pisanu
Copy link
Contributor Author

@baskaryan yeah sure!
I checked the current tests and noticed that they where not properly testing the functionality. Patching the timestamps but not changing the patch between the "index" calls it was not properly simulating consequent calls. The test were passing but they were not testing the right thing.
Updating the tests was sufficient to check that the new code works and fixes the old behavior.

@federico-pisanu
Copy link
Contributor Author

Hi @eyurtsev can i do something to make it easier to merge?

@eyurtsev
Copy link
Collaborator

Standby we're working on the 0.3 release so there was an effective code freeze for the past 2 weeks

@eyurtsev eyurtsev assigned eyurtsev and unassigned baskaryan Sep 18, 2024
@eyurtsev
Copy link
Collaborator

Hi @federico-pisanu,

Thank you for the PR! In order to merge the changes, the code would need to do the following:

  1. Apply similar changes to the aindex API
  2. Revert modifications to existing unit tests (these modifications make it look like the PR is introducing a new bug!)
  3. Add a new unit tests that cover the relevant scenario.

@eyurtsev
Copy link
Collaborator

@federico-pisanu let me double check the existing unit tests to make sure they were wrong

@eyurtsev
Copy link
Collaborator

@federico-pisanu OK confirmed -- no action is required from you at this time. I'll re-apply changes on aindex

@federico-pisanu
Copy link
Contributor Author

@eyurtsev Isn't it already done?

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 18, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Sep 18, 2024
@eyurtsev
Copy link
Collaborator

@federico-pisanu yep didn't notice -- looks good added unit tests to cover the optimization -- i.e., we get different results based on which batch was mutated

@eyurtsev eyurtsev changed the title core: fix index api when batch_size<n_docs core[patch]: improve index/aindex api when batch_size<n_docs Sep 18, 2024
@eyurtsev eyurtsev enabled auto-merge (squash) September 18, 2024 15:55
@eyurtsev eyurtsev merged commit 2538963 into langchain-ai:master Sep 30, 2024
85 checks passed
Sheepsta300 pushed a commit to Sheepsta300/langchain that referenced this pull request Oct 1, 2024
…in-ai#25754)

- **Description:** prevent index function to re-index entire source
document even if nothing has changed.
- **Issue:** langchain-ai#22135

I worked on a solution to this issue that is a compromise between being
cheap and being fast.
In the previous code, when batch_size is greater than the number of docs
from a certain source almost the entire source is deleted (all documents
from that source except for the documents in the first batch)
My solution deletes documents from vector store and record manager only
if at least one document has changed for that source.

Hope this can help!

---------

Co-authored-by: Eugene Yurtsev <[email protected]>
KeiichiHirobe added a commit to KeiichiHirobe/langchain that referenced this pull request Dec 13, 2024
eyurtsev pushed a commit that referenced this pull request Dec 13, 2024
I reported the bug 2 weeks ago here:
#28447

I believe this is a critical bug for the indexer, so I submitted a PR to
revert the change and added unit tests to prevent similar bugs from
being introduced in the future.

@eyurtsev Could you check this?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: core Related to langchain-core lgtm PR looks good. Use to confirm that a PR is ready for merging. needs test PR needs to be updated with tests size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants