-
Notifications
You must be signed in to change notification settings - Fork 15.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core[patch]: throw exception indexing code if deletion fails in vectorstore #28103
core[patch]: throw exception indexing code if deletion fails in vectorstore #28103
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
bf6fea9
to
ab2c261
Compare
The changes made are valid. |
delete_ok = vector_store.delete(ids) | ||
if delete_ok is not None and delete_ok is False: | ||
msg = "delete failed" | ||
raise Exception(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to do the following:
- Improve the error message
- Use a more specific exception (you can subclass from LangChainException).
- If elif is untermedinated and needs to have a branch for the else that will raise a not implemented error.
It's not clear that desired behavior for most users is to raise an exception by default. I'd probably expect that most of the time, what users would want is an error logged via the python logger. In some cases, users may want to configure this to be stricter and raise an exception.
What exception did you bump into?
Very frequently things fail because:
- Serer is just down (OK to raise for this)
- Transient issue (e.g., network connectivity dropped or client issues too many requests) -- these types of errors need to be retried at the implementation vectorstore/document indexer implementation level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback.
I'd probably expect that most of the time, what users would want is an error logged via the python logger.
I know what you mean, but still, I believe we should throw an exception here.
Let's say users update documents, and the indexer
library fails to delete records in a vector database and proceed to the next step without stopping, then deletes the records in the record manager. This may happen in the current logic. If it happens, the records in the vector database won't be cleaned up forever, right? Users will see stale data due to that. I think we should avoid this situation. If we throw it here, the next batch execution will fix the situation if the vector database has been recovered.
Actually, this might have already happend in our environment because the number of records in Qdrant was a little bit more than the number of records in the table of record manager for some reasons. I can't think of any other reason why such an inconsistency would occur. Of course, we always update records in the vector database and the record manager only through the indexer library. If you have any other possible reasons in mind, please let me know.
To fix this, we had to manually delete dangling records in the vector database.
Serer is just down (OK to raise for this)
I suppose this situation. I am not sure, but I guess cloud services for vector databases are less stable than traditional databases like RDS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's make it stricter for now!
If we were creating a more relaxed mode, we'd need to skip cleaning of the record manager if a vectorstore deletion fails (to avoid the situation you're describing). The information could still be surfaced via logger.error
and potentially an additional field in IndexingResult
.
Feel free to add if you think useful, but I'm OK with the stricter solution for now.
Happy to merge if we can make the changes outlined above! A unit test will be required as well for this to be merged. |
@eyurtsev I submitted two additional commits. Could you check it? |
8bb0adf
to
6697096
Compare
Here's the abstraction for vectorstores: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/vectorstores/base.py#L124-L124 Vectorstore implementations should return If a record cannot be found but the operation is executed successfully by the server, the value from delete should be True (not False). If users bump into issues due to this PR, we should double check that the vectorstore implemented the correct semantics. |
@KeiichiHirobe thank you for the contribution! If you're using the indexing code, and are interested in contributing more... the one obvious thing that's missing is a delete API! A user may just want to delete data from the index (e.g., data that hasn't been updated in a while, or delete by source id etc.) |
Interesting. Actually, we implemented it partially in our code base like this:
I think I can work on this in a few months. |
The delete methods in the VectorStore and DocumentIndex interfaces return a status indicating the result. Therefore, we can assume that their implementations don't throw exceptions but instead return a result indicating whether the delete operations have failed. The current implementation doesn't check the returned value, so I modified it to throw an exception when the operation fails.