Bug in Indexing Function Causes Inconsistent Document Deletion #22135

ericvaillancourt · 2024-05-24T17:11:06Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Here is an example that demonstrates the problem:

If I change the batch_size in api.py to a value that is larger than the number of elements in my list, everything works fine. By default, the batch_size is set to 100, and only the first 100 elements are handled correctly.

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain.indexes import SQLRecordManager, index

embeddings = OpenAIEmbeddings()

documents = []


for i in range(1, 201):
    page_content = f"data {i}"
    metadata = {"source": f"test.txt"}
    document = Document(page_content=page_content, metadata=metadata)
    documents.append(document)


collection_name = "test_index"

embedding = OpenAIEmbeddings()

vectorstore = Chroma(
    persist_directory="emb",
    embedding_function=embeddings
)
namespace = f"choma/{collection_name}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)

record_manager.create_schema()

idx = index(
    documents,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
# for the first run
# should be : {'num_added': 200, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
# and that's what we get.
print(idx)
idx = index(
    documents,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
# for the second run
# should be : {'num_added': 0, 'num_updated': 0, 'num_skipped': 200, 'num_deleted': 0}
# but we get : {'num_added': 100, 'num_updated': 0, 'num_skipped': 100, 'num_deleted': 100}
print(idx)

Error Message and Stack Trace (if applicable)

No response

Description

I've encountered a bug in the index function of Langchain when processing documents. The function behaves inconsistently during multiple runs, leading to unexpected deletions of documents. Specifically, when running the function twice in a row without any changes to the data, the first run indexes all documents as expected. However, on the second run, only the first batch of documents (batch_size=100) is correctly identified as already indexed and skipped, while the remaining documents are mistakenly deleted and re-indexed.

System Info

langchain==0.1.20
langchain-community==0.0.38
langchain-core==0.1.52
langchain-openai==0.1.7
langchain-postgres==0.0.4
langchain-text-splitters==0.0.2
langgraph==0.0.32
langsmith==0.1.59

Python 3.11.7

Platform : Windows 11

The text was updated successfully, but these errors were encountered:

eyurtsev · 2024-05-24T17:29:27Z

Apologies documentation is out of date on this. For the indexing function to be able to completely avoid redundant work, all the docs corresponding to a particular source need to be in the same batch. I'll try to update documentation.

If that criteria isn't met, it'll end up doing some redundant work, but should still result in the correct end state. The indexing logic optimizes for the amount of time that duplicated content exists in the index.

ericvaillancourt · 2024-05-24T18:03:18Z

OK but the batch size is set to 100. What if one source has more than 100 docs. The end result is still ok but does it re-calculate the embeddings?

magaton · 2024-05-29T11:45:31Z

Hello, I am also hitting this problem.
if I do not increase batc_size in indexer to be > than document size, I have deletes and adds although I did not change anything in the directory I am loading.
If batch size is > number of loaded documents, then the skip happens and everything is fine.

So, something seems not to be right here.

ericvaillancourt · 2024-06-05T20:31:31Z

I created my own indexing system to solve the problem. It is a bit more sophisticated because it is meant to be used with a multi-vector retriever. I have written an article on Medium.

You can find the code on my github

And watch the video on Youtube

federico-pisanu · 2024-08-26T14:54:53Z

Hi! i also ran into this problem and worked on a solution for this issue in this PR.
@eyurtsev i hope this can be helpful.

- **Description:** prevent index function to re-index entire source document even if nothing has changed. - **Issue:** #22135 I worked on a solution to this issue that is a compromise between being cheap and being fast. In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch) My solution deletes documents from vector store and record manager only if at least one document has changed for that source. Hope this can help! --------- Co-authored-by: Eugene Yurtsev <[email protected]>

…in-ai#25754) - **Description:** prevent index function to re-index entire source document even if nothing has changed. - **Issue:** langchain-ai#22135 I worked on a solution to this issue that is a compromise between being cheap and being fast. In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch) My solution deletes documents from vector store and record manager only if at least one document has changed for that source. Hope this can help! --------- Co-authored-by: Eugene Yurtsev <[email protected]>

dosubot · 2024-11-25T16:06:20Z

Hi, @ericvaillancourt. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

You reported a bug in the LangChain indexing function related to inconsistent document deletion with the default batch_size of 100.
Eyurtsev acknowledged outdated documentation and suggested batching all documents from a source together.
Other users, including magaton and federico-pisanu, experienced the same issue, with federico-pisanu proposing a solution via a pull request.
You developed a custom indexing system and shared resources on Medium and GitHub to address the problem.

Next Steps:

Please let me know if this issue is still relevant to the latest version of the LangChain repository. If so, you can keep the discussion open by commenting here.
Otherwise, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

duartecaldascardoso · 2024-12-17T16:54:17Z

Hi!
Is there any reason for the PR from @federico-pisanu to be reverted and not fixed instead? This problem still occurs and as far as I understand his solution was according to standards and had a solution built around fixing the underlying problem. It seems to have created bugs according to #28447 but since the problem is still there, should this not be fixed instead of reverted?
Thanks!

dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🔌: openai Primarily related to OpenAI integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 24, 2024

eyurtsev self-assigned this May 24, 2024

federico-pisanu mentioned this issue Aug 26, 2024

core[patch]: improve index/aindex api when batch_size<n_docs #25754

Merged

Cansisti mentioned this issue Oct 23, 2024

langchain-core 0.3.12 ❄️ AnacondaRecipes/langchain-core-feedstock#4

Merged

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 25, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in Indexing Function Causes Inconsistent Document Deletion #22135

Bug in Indexing Function Causes Inconsistent Document Deletion #22135

ericvaillancourt commented May 24, 2024

eyurtsev commented May 24, 2024

ericvaillancourt commented May 24, 2024

magaton commented May 29, 2024

ericvaillancourt commented Jun 5, 2024 •

edited

Loading

federico-pisanu commented Aug 26, 2024

dosubot bot commented Nov 25, 2024

duartecaldascardoso commented Dec 17, 2024 •

edited

Loading

Bug in Indexing Function Causes Inconsistent Document Deletion #22135

Bug in Indexing Function Causes Inconsistent Document Deletion #22135

Comments

ericvaillancourt commented May 24, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

eyurtsev commented May 24, 2024

ericvaillancourt commented May 24, 2024

magaton commented May 29, 2024

ericvaillancourt commented Jun 5, 2024 • edited Loading

federico-pisanu commented Aug 26, 2024

dosubot bot commented Nov 25, 2024

duartecaldascardoso commented Dec 17, 2024 • edited Loading

ericvaillancourt commented Jun 5, 2024 •

edited

Loading

duartecaldascardoso commented Dec 17, 2024 •

edited

Loading