Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core[patch]: throw exception indexing code if deletion fails in vectorstore #28103

Merged
merged 5 commits into from
Dec 13, 2024

Conversation

KeiichiHirobe
Copy link
Contributor

The delete methods in the VectorStore and DocumentIndex interfaces return a status indicating the result. Therefore, we can assume that their implementations don't throw exceptions but instead return a result indicating whether the delete operations have failed. The current implementation doesn't check the returned value, so I modified it to throw an exception when the operation fails.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 14, 2024
Copy link

vercel bot commented Nov 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Dec 13, 2024 8:46pm

@Vruddhi18
Copy link
Contributor

The changes made are valid.

delete_ok = vector_store.delete(ids)
if delete_ok is not None and delete_ok is False:
msg = "delete failed"
raise Exception(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to do the following:

  1. Improve the error message
  2. Use a more specific exception (you can subclass from LangChainException).
  3. If elif is untermedinated and needs to have a branch for the else that will raise a not implemented error.

It's not clear that desired behavior for most users is to raise an exception by default. I'd probably expect that most of the time, what users would want is an error logged via the python logger. In some cases, users may want to configure this to be stricter and raise an exception.


What exception did you bump into?

Very frequently things fail because:

  1. Serer is just down (OK to raise for this)
  2. Transient issue (e.g., network connectivity dropped or client issues too many requests) -- these types of errors need to be retried at the implementation vectorstore/document indexer implementation level

Copy link
Contributor Author

@KeiichiHirobe KeiichiHirobe Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback.

I'd probably expect that most of the time, what users would want is an error logged via the python logger.

I know what you mean, but still, I believe we should throw an exception here.
Let's say users update documents, and the indexer library fails to delete records in a vector database and proceed to the next step without stopping, then deletes the records in the record manager. This may happen in the current logic. If it happens, the records in the vector database won't be cleaned up forever, right? Users will see stale data due to that. I think we should avoid this situation. If we throw it here, the next batch execution will fix the situation if the vector database has been recovered.
Actually, this might have already happend in our environment because the number of records in Qdrant was a little bit more than the number of records in the table of record manager for some reasons. I can't think of any other reason why such an inconsistency would occur. Of course, we always update records in the vector database and the record manager only through the indexer library. If you have any other possible reasons in mind, please let me know.
To fix this, we had to manually delete dangling records in the vector database.

Serer is just down (OK to raise for this)

I suppose this situation. I am not sure, but I guess cloud services for vector databases are less stable than traditional databases like RDS.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's make it stricter for now!

If we were creating a more relaxed mode, we'd need to skip cleaning of the record manager if a vectorstore deletion fails (to avoid the situation you're describing). The information could still be surfaced via logger.error and potentially an additional field in IndexingResult.

Feel free to add if you think useful, but I'm OK with the stricter solution for now.

@eyurtsev eyurtsev changed the title check if the deletion succeeded in indexing langchain[minor]: check if the deletion succeeded in indexing Dec 11, 2024
@eyurtsev
Copy link
Collaborator

eyurtsev commented Dec 11, 2024

Happy to merge if we can make the changes outlined above!

A unit test will be required as well for this to be merged.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 13, 2024
@KeiichiHirobe
Copy link
Contributor Author

@eyurtsev I submitted two additional commits. Could you check it?

@KeiichiHirobe KeiichiHirobe force-pushed the check-delete-status branch 2 times, most recently from 8bb0adf to 6697096 Compare December 13, 2024 16:28
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 13, 2024
@eyurtsev eyurtsev changed the title langchain[minor]: check if the deletion succeeded in indexing core[patch]: throw exception indexing code if deletion fails in vectorstore Dec 13, 2024
@eyurtsev
Copy link
Collaborator

Here's the abstraction for vectorstores: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/vectorstores/base.py#L124-L124

Vectorstore implementations should return False only when an exception occurs when trying to delete a record (e.g., can't reach database, bad query etc etc.)

If a record cannot be found but the operation is executed successfully by the server, the value from delete should be True (not False).

If users bump into issues due to this PR, we should double check that the vectorstore implemented the correct semantics.

@eyurtsev eyurtsev merged commit 67fd554 into langchain-ai:master Dec 13, 2024
81 checks passed
@eyurtsev
Copy link
Collaborator

@KeiichiHirobe thank you for the contribution!

If you're using the indexing code, and are interested in contributing more... the one obvious thing that's missing is a delete API! A user may just want to delete data from the index (e.g., data that hasn't been updated in a while, or delete by source id etc.)

@KeiichiHirobe
Copy link
Contributor Author

@eyurtsev

A user may just want to delete data from the index (e.g., data that hasn't been updated in a while, or delete by source id etc.)

Interesting. Actually, we implemented it partially in our code base like this:

    delta = timedelta(minutes=15)
    updated_before = datetime.now(timezone.utc) - delta
    num_deleted = 0
    uids_to_delete: list[str] = []
    while uids_to_delete := record_manager.list_keys(
        before=updated_before.timestamp(), limit=1000
    ):
        vector_store.delete(uids_to_delete)
        record_manager.delete_keys(uids_to_delete)
        num_deleted += len(uids_to_delete)
    logger.info(f"Deleted {num_deleted} stale docs updated before {updated_before}")

I think I can work on this in a few months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants