core: add new clean up strategy "scoped_full" to indexing #28505

KeiichiHirobe · 2024-12-04T09:04:08Z

Note that this PR is now Draft, so I didn't add change to aindex function and didn't add test codes for my change.
After we have an agreement on the direction, I will add commits.

batch_size is very difficult to decide because setting a large number like >10000 will impact VectorDB and RecordManager, while setting a small number will delete records unnecessarily, leading to redundant work, as the IMPORTANT section says.
On the other hand, we can't use full because the loader returns just a subset of the dataset in our use case.

I guess many people are in the same situation as us.

So, as one of the possible solutions for it, I would like to introduce a new argument, scoped_full_cleanup.
This argument will be valid only when claneup is Full. If True, Full cleanup deletes all documents that haven't been updated AND that are associated with source ids that were seen during indexing. Default is False.

This change keeps backward compatibility.

vercel · 2024-12-04T09:04:13Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 13, 2024 8:35pm

libs/core/langchain_core/indexing/api.py

eyurtsev

This looks very reasonable!

If you'd like to merge:

Could you introduce this as a new cleanup mode?
Add unit-tests

libs/core/langchain_core/indexing/api.py

eyurtsev · 2024-12-11T20:51:10Z

libs/core/langchain_core/indexing/api.py

@@ -259,6 +256,11 @@ def index(
                       specify a custom vector_field:
                       upsert_kwargs={"vector_field": "embedding"}
            .. versionadded:: 0.3.10
+        scoped_full_cleanup: This argument will be valid only when `claneup` is Full.


How about we turn this into another clean up mode so we don't introduce another parameter that only works conditionally?

This looks like cleanup == 'full_scoped' or something like that.

And we'd document:

In the important section -- that full_scoped is a solution to problem w/ batches being a best effort

This solution keeps track of the source ids in memory (probably fine for most use cases in terms of memory consumption) -- would require parallelizing for 10M+ docs anyway

libs/core/langchain_core/indexing/api.py

KeiichiHirobe · 2024-12-13T16:33:58Z

@eyurtsev Thank you for your review. I've just pushed 1 commit: 32c3f03

eyurtsev · 2024-12-13T20:23:40Z

Pushed a minor doc-string update

eyurtsev · 2024-12-13T20:24:34Z

@KeiichiHirobe thank you looks great!

add scoped_full_cleanup arg to index

2399e67

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 4, 2024

dosubot bot added the Ɑ: core Related to langchain-core label Dec 4, 2024

KeiichiHirobe commented Dec 4, 2024

View reviewed changes

libs/core/langchain_core/indexing/api.py Show resolved Hide resolved

Merge branch 'master' into new-option-indexer

7ef21dc

baskaryan requested a review from eyurtsev December 9, 2024 02:12

efriis assigned eyurtsev Dec 10, 2024

eyurtsev reviewed Dec 11, 2024

View reviewed changes

introduce new clenn up option scoped_full

a474efb

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 13, 2024

eyurtsev reviewed Dec 13, 2024

View reviewed changes

libs/core/langchain_core/indexing/api.py Outdated Show resolved Hide resolved

Update libs/core/langchain_core/indexing/api.py

7d68a66

eyurtsev reviewed Dec 13, 2024

View reviewed changes

libs/core/langchain_core/indexing/api.py Outdated Show resolved Hide resolved

Update libs/core/langchain_core/indexing/api.py

decf48e

KeiichiHirobe changed the title ~~[draft] core: add scoped_full_cleanup arg to index~~ core: add scoped_full_cleanup arg to index Dec 13, 2024

KeiichiHirobe changed the title ~~core: add scoped_full_cleanup arg to index~~ core: add new clean up strategy "scoped_full" to indexing Dec 13, 2024

update docs

32c3f03

vercel bot had a problem deploying to Preview December 13, 2024 16:42 Failure

eyurtsev added 2 commits December 13, 2024 15:17

Merge branch 'master' into new-option-indexer

7b96dc7

x

e45fa30

eyurtsev enabled auto-merge (squash) December 13, 2024 20:24

eyurtsev approved these changes Dec 13, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 13, 2024

vercel bot deployed to Preview December 13, 2024 20:35 View deployment

eyurtsev merged commit 258b3be into langchain-ai:master Dec 13, 2024
83 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: add new clean up strategy "scoped_full" to indexing #28505

core: add new clean up strategy "scoped_full" to indexing #28505

KeiichiHirobe commented Dec 4, 2024 •

edited

Loading

vercel bot commented Dec 4, 2024 •

edited

Loading

eyurtsev left a comment

eyurtsev Dec 11, 2024

KeiichiHirobe commented Dec 13, 2024

eyurtsev commented Dec 13, 2024

eyurtsev commented Dec 13, 2024

core: add new clean up strategy "scoped_full" to indexing #28505

core: add new clean up strategy "scoped_full" to indexing #28505

Conversation

KeiichiHirobe commented Dec 4, 2024 • edited Loading

vercel bot commented Dec 4, 2024 • edited Loading

eyurtsev left a comment

Choose a reason for hiding this comment

eyurtsev Dec 11, 2024

Choose a reason for hiding this comment

KeiichiHirobe commented Dec 13, 2024

eyurtsev commented Dec 13, 2024

eyurtsev commented Dec 13, 2024

KeiichiHirobe commented Dec 4, 2024 •

edited

Loading

vercel bot commented Dec 4, 2024 •

edited

Loading