core[patch]: improve index/aindex api when batch_size<n_docs #25754

federico-pisanu · 2024-08-26T14:30:41Z

Description: prevent index function to re-index entire source document even if nothing has changed.
Issue: Bug in Indexing Function Causes Inconsistent Document Deletion #22135

I worked on a solution to this issue that is a compromise between being cheap and being fast.
In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch)
My solution deletes documents from vector store and record manager only if at least one document has changed for that source.

Hope this can help!

vercel · 2024-08-26T14:30:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Sep 30, 2024 8:52pm

baskaryan

could we add unit tests for this?

federico-pisanu · 2024-08-28T07:53:14Z

@baskaryan yeah sure!
I checked the current tests and noticed that they where not properly testing the functionality. Patching the timestamps but not changing the patch between the "index" calls it was not properly simulating consequent calls. The test were passing but they were not testing the right thing.
Updating the tests was sufficient to check that the new code works and fixes the old behavior.

federico-pisanu · 2024-09-13T13:15:00Z

Hi @eyurtsev can i do something to make it easier to merge?

eyurtsev · 2024-09-13T23:34:46Z

Standby we're working on the 0.3 release so there was an effective code freeze for the past 2 weeks

eyurtsev · 2024-09-18T15:06:06Z

Hi @federico-pisanu,

Thank you for the PR! In order to merge the changes, the code would need to do the following:

Apply similar changes to the aindex API
Revert modifications to existing unit tests (these modifications make it look like the PR is introducing a new bug!)
Add a new unit tests that cover the relevant scenario.

eyurtsev · 2024-09-18T15:07:22Z

@federico-pisanu let me double check the existing unit tests to make sure they were wrong

eyurtsev · 2024-09-18T15:11:24Z

@federico-pisanu OK confirmed -- no action is required from you at this time. I'll re-apply changes on aindex

federico-pisanu · 2024-09-18T15:19:13Z

@eyurtsev Isn't it already done?

eyurtsev · 2024-09-18T15:53:57Z

@federico-pisanu yep didn't notice -- looks good added unit tests to cover the optimization -- i.e., we get different results based on which batch was mutated

…in-ai#25754) - **Description:** prevent index function to re-index entire source document even if nothing has changed. - **Issue:** langchain-ai#22135 I worked on a solution to this issue that is a compromise between being cheap and being fast. In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch) My solution deletes documents from vector store and record manager only if at least one document has changed for that source. Hope this can help! --------- Co-authored-by: Eugene Yurtsev <[email protected]>

…angchain-ai#25754)" This reverts commit 2538963.

@eyurtsev

I reported the bug 2 weeks ago here: #28447 I believe this is a critical bug for the indexer, so I submitted a PR to revert the change and added unit tests to prevent similar bugs from being introduced in the future. @eyurtsev Could you check this?

fixed index api when batch_size<n_doc

1c48dab

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Aug 26, 2024

dosubot bot added Ɑ: core Related to langchain-core 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 26, 2024

federico-pisanu mentioned this pull request Aug 26, 2024

Bug in Indexing Function Causes Inconsistent Document Deletion #22135

Closed

5 tasks

baskaryan reviewed Aug 28, 2024

View reviewed changes

baskaryan added the needs test PR needs to be updated with tests label Aug 28, 2024

fixed tests

0ab19f9

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Aug 28, 2024

efriis assigned baskaryan Aug 30, 2024

baskaryan requested a review from eyurtsev September 2, 2024 21:55

eyurtsev assigned eyurtsev and unassigned baskaryan Sep 18, 2024

eyurtsev added 2 commits September 18, 2024 11:40

Merge branch 'master' into fix-index-api-when-batch_size-<-n_doc

79476ee

x

8e5aa39

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 18, 2024

eyurtsev approved these changes Sep 18, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Sep 18, 2024

eyurtsev changed the title ~~core: fix index api when batch_size<n_docs~~ core[patch]: improve index/aindex api when batch_size<n_docs Sep 18, 2024

eyurtsev enabled auto-merge (squash) September 18, 2024 15:55

eyurtsev added 3 commits September 30, 2024 16:37

Merge branch 'master' into fix-index-api-when-batch_size-<-n_doc

96f967c

Merge branch 'master' into fix-index-api-when-batch_size-<-n_doc

2b44727

x

7c615b6

eyurtsev merged commit 2538963 into langchain-ai:master Sep 30, 2024
85 checks passed

KeiichiHirobe mentioned this pull request Dec 2, 2024

[Bug] [core/indexer] PR #25754 seems like introducing bugs #28447

Open

5 tasks

KeiichiHirobe added a commit to KeiichiHirobe/langchain that referenced this pull request Dec 13, 2024

Revert "core[patch]: improve index/aindex api when batch_size<n_docs (l…

aeec698

…angchain-ai#25754)" This reverts commit 2538963.

eyurtsev mentioned this pull request Dec 13, 2024

[core/indexer] Reverts PR #25754 and add unit tests #28702

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core[patch]: improve index/aindex api when batch_size<n_docs #25754

core[patch]: improve index/aindex api when batch_size<n_docs #25754

federico-pisanu commented Aug 26, 2024

vercel bot commented Aug 26, 2024 •

edited

Loading

baskaryan left a comment

federico-pisanu commented Aug 28, 2024

federico-pisanu commented Sep 13, 2024

eyurtsev commented Sep 13, 2024

eyurtsev commented Sep 18, 2024

eyurtsev commented Sep 18, 2024

eyurtsev commented Sep 18, 2024

federico-pisanu commented Sep 18, 2024

eyurtsev commented Sep 18, 2024

core[patch]: improve index/aindex api when batch_size<n_docs #25754

core[patch]: improve index/aindex api when batch_size<n_docs #25754

Conversation

federico-pisanu commented Aug 26, 2024

vercel bot commented Aug 26, 2024 • edited Loading

baskaryan left a comment

Choose a reason for hiding this comment

federico-pisanu commented Aug 28, 2024

federico-pisanu commented Sep 13, 2024

eyurtsev commented Sep 13, 2024

eyurtsev commented Sep 18, 2024

eyurtsev commented Sep 18, 2024

eyurtsev commented Sep 18, 2024

federico-pisanu commented Sep 18, 2024

eyurtsev commented Sep 18, 2024

vercel bot commented Aug 26, 2024 •

edited

Loading