Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Indexing Function Causes Inconsistent Document Deletion #22135

Closed
5 tasks done
ericvaillancourt opened this issue May 24, 2024 · 7 comments
Closed
5 tasks done
Assignees
Labels
03 enhancement Enhancement of existing functionality documentation Improvements or additions to documentation

Comments

@ericvaillancourt
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Here is an example that demonstrates the problem:

If I change the batch_size in api.py to a value that is larger than the number of elements in my list, everything works fine. By default, the batch_size is set to 100, and only the first 100 elements are handled correctly.

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain.indexes import SQLRecordManager, index

embeddings = OpenAIEmbeddings()

documents = []


for i in range(1, 201):
    page_content = f"data {i}"
    metadata = {"source": f"test.txt"}
    document = Document(page_content=page_content, metadata=metadata)
    documents.append(document)


collection_name = "test_index"

embedding = OpenAIEmbeddings()

vectorstore = Chroma(
    persist_directory="emb",
    embedding_function=embeddings
)
namespace = f"choma/{collection_name}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)

record_manager.create_schema()

idx = index(
    documents,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
# for the first run
# should be : {'num_added': 200, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
# and that's what we get.
print(idx)
idx = index(
    documents,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
# for the second run
# should be : {'num_added': 0, 'num_updated': 0, 'num_skipped': 200, 'num_deleted': 0}
# but we get : {'num_added': 100, 'num_updated': 0, 'num_skipped': 100, 'num_deleted': 100}
print(idx)

Error Message and Stack Trace (if applicable)

No response

Description

I've encountered a bug in the index function of Langchain when processing documents. The function behaves inconsistently during multiple runs, leading to unexpected deletions of documents. Specifically, when running the function twice in a row without any changes to the data, the first run indexes all documents as expected. However, on the second run, only the first batch of documents (batch_size=100) is correctly identified as already indexed and skipped, while the remaining documents are mistakenly deleted and re-indexed.

System Info

langchain==0.1.20
langchain-community==0.0.38
langchain-core==0.1.52
langchain-openai==0.1.7
langchain-postgres==0.0.4
langchain-text-splitters==0.0.2
langgraph==0.0.32
langsmith==0.1.59

Python 3.11.7

Platform : Windows 11

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🔌: openai Primarily related to OpenAI integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 24, 2024
@eyurtsev
Copy link
Collaborator

Apologies documentation is out of date on this. For the indexing function to be able to completely avoid redundant work, all the docs corresponding to a particular source need to be in the same batch. I'll try to update documentation.

If that criteria isn't met, it'll end up doing some redundant work, but should still result in the correct end state. The indexing logic optimizes for the amount of time that duplicated content exists in the index.

@eyurtsev eyurtsev added needs documentation PR needs to be updated with documentation documentation Improvements or additions to documentation 03 enhancement Enhancement of existing functionality and removed needs documentation PR needs to be updated with documentation Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations 🔌: openai Primarily related to OpenAI integrations labels May 24, 2024
@eyurtsev eyurtsev self-assigned this May 24, 2024
@ericvaillancourt
Copy link
Author

OK but the batch size is set to 100. What if one source has more than 100 docs. The end result is still ok but does it re-calculate the embeddings?

@magaton
Copy link

magaton commented May 29, 2024

Hello, I am also hitting this problem.
if I do not increase batc_size in indexer to be > than document size, I have deletes and adds although I did not change anything in the directory I am loading.
If batch size is > number of loaded documents, then the skip happens and everything is fine.

So, something seems not to be right here.

@ericvaillancourt
Copy link
Author

ericvaillancourt commented Jun 5, 2024

I created my own indexing system to solve the problem. It is a bit more sophisticated because it is meant to be used with a multi-vector retriever. I have written an article on Medium.

You can find the code on my github

And watch the video on Youtube

@federico-pisanu
Copy link
Contributor

Hi! i also ran into this problem and worked on a solution for this issue in this PR.
@eyurtsev i hope this can be helpful.

eyurtsev added a commit that referenced this issue Sep 30, 2024
- **Description:** prevent index function to re-index entire source
document even if nothing has changed.
- **Issue:** #22135

I worked on a solution to this issue that is a compromise between being
cheap and being fast.
In the previous code, when batch_size is greater than the number of docs
from a certain source almost the entire source is deleted (all documents
from that source except for the documents in the first batch)
My solution deletes documents from vector store and record manager only
if at least one document has changed for that source.

Hope this can help!

---------

Co-authored-by: Eugene Yurtsev <[email protected]>
Sheepsta300 pushed a commit to Sheepsta300/langchain that referenced this issue Oct 1, 2024
…in-ai#25754)

- **Description:** prevent index function to re-index entire source
document even if nothing has changed.
- **Issue:** langchain-ai#22135

I worked on a solution to this issue that is a compromise between being
cheap and being fast.
In the previous code, when batch_size is greater than the number of docs
from a certain source almost the entire source is deleted (all documents
from that source except for the documents in the first batch)
My solution deletes documents from vector store and record manager only
if at least one document has changed for that source.

Hope this can help!

---------

Co-authored-by: Eugene Yurtsev <[email protected]>
Copy link

dosubot bot commented Nov 25, 2024

Hi, @ericvaillancourt. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You reported a bug in the LangChain indexing function related to inconsistent document deletion with the default batch_size of 100.
  • Eyurtsev acknowledged outdated documentation and suggested batching all documents from a source together.
  • Other users, including magaton and federico-pisanu, experienced the same issue, with federico-pisanu proposing a solution via a pull request.
  • You developed a custom indexing system and shared resources on Medium and GitHub to address the problem.

Next Steps:

  • Please let me know if this issue is still relevant to the latest version of the LangChain repository. If so, you can keep the discussion open by commenting here.
  • Otherwise, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 25, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 2, 2024
@duartecaldascardoso
Copy link

duartecaldascardoso commented Dec 17, 2024

Hi!
Is there any reason for the PR from @federico-pisanu to be reverted and not fixed instead? This problem still occurs and as far as I understand his solution was according to standards and had a solution built around fixing the underlying problem. It seems to have created bugs according to #28447 but since the problem is still there, should this not be fixed instead of reverted?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants