-
Notifications
You must be signed in to change notification settings - Fork 15.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in Indexing Function Causes Inconsistent Document Deletion #22135
Comments
Apologies documentation is out of date on this. For the indexing function to be able to completely avoid redundant work, all the docs corresponding to a particular source need to be in the same batch. I'll try to update documentation. If that criteria isn't met, it'll end up doing some redundant work, but should still result in the correct end state. The indexing logic optimizes for the amount of time that duplicated content exists in the index. |
OK but the batch size is set to 100. What if one source has more than 100 docs. The end result is still ok but does it re-calculate the embeddings? |
Hello, I am also hitting this problem. So, something seems not to be right here. |
- **Description:** prevent index function to re-index entire source document even if nothing has changed. - **Issue:** #22135 I worked on a solution to this issue that is a compromise between being cheap and being fast. In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch) My solution deletes documents from vector store and record manager only if at least one document has changed for that source. Hope this can help! --------- Co-authored-by: Eugene Yurtsev <[email protected]>
…in-ai#25754) - **Description:** prevent index function to re-index entire source document even if nothing has changed. - **Issue:** langchain-ai#22135 I worked on a solution to this issue that is a compromise between being cheap and being fast. In the previous code, when batch_size is greater than the number of docs from a certain source almost the entire source is deleted (all documents from that source except for the documents in the first batch) My solution deletes documents from vector store and record manager only if at least one document has changed for that source. Hope this can help! --------- Co-authored-by: Eugene Yurtsev <[email protected]>
Hi, @ericvaillancourt. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale. Issue Summary:
Next Steps:
Thank you for your understanding and contribution! |
Hi! |
Checked other resources
Example Code
Here is an example that demonstrates the problem:
If I change the
batch_size
inapi.py
to a value that is larger than the number of elements in my list, everything works fine. By default, thebatch_size
is set to 100, and only the first 100 elements are handled correctly.Error Message and Stack Trace (if applicable)
No response
Description
I've encountered a bug in the index function of Langchain when processing documents. The function behaves inconsistently during multiple runs, leading to unexpected deletions of documents. Specifically, when running the function twice in a row without any changes to the data, the first run indexes all documents as expected. However, on the second run, only the first batch of documents (batch_size=100) is correctly identified as already indexed and skipped, while the remaining documents are mistakenly deleted and re-indexed.
System Info
langchain==0.1.20
langchain-community==0.0.38
langchain-core==0.1.52
langchain-openai==0.1.7
langchain-postgres==0.0.4
langchain-text-splitters==0.0.2
langgraph==0.0.32
langsmith==0.1.59
Python 3.11.7
Platform : Windows 11
The text was updated successfully, but these errors were encountered: