Skip to content

Conversation

@aperepel
Copy link
Contributor

@aperepel aperepel commented Oct 9, 2024

With a larger batch size for add_documents (e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error:

[values_dict[key][i] for key in values_dict]
     ~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Add additional size validation and a suggestion on how to remedy.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

With a larger batch size for `add_documents` (e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error:
```
[values_dict[key][i] for key in values_dict]
     ~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
```
Add additional size validation and a suggestion on how to remedy.
@aperepel aperepel requested review from a team as code owners October 9, 2024 20:12
@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/langchain-google-spanner-python API. label Oct 9, 2024
@averikitsch
Copy link
Collaborator

/gcbrun

@averikitsch averikitsch changed the title Validate the embeddings size to catch silent embeddings batch failures fix: Validate the embeddings size to catch silent embeddings batch failures Oct 10, 2024
@gauravpurohit06
Copy link
Contributor

@aperepel, thank you raising the pull request. Question: In your PR you are adding a validation check, which is required for sure. On a separate note: Have you tried diving it into batches and then generating the embeddings if it's failing for large batches?

Copy link
Contributor

@gauravpurohit06 gauravpurohit06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

@aperepel
Copy link
Contributor Author

@aperepel, thank you raising the pull request. Question: In your PR you are adding a validation check, which is required for sure. On a separate note: Have you tried diving it into batches and then generating the embeddings if it's failing for large batches?

Yes, that's what we've been doing. On a larger batch (usually around 1000 items), the embeddings service returned ~20-25% loss. The 100 size batch works perfectly stable, however.

We do use retry logic everywhere we run micro-batches, but if you were asking about pro-actively splitting it after a failure - that's probably too many LLM calls, as we must retry a whole split.

@averikitsch
Copy link
Collaborator

Tests are failing due to code coverage. Can you add a tests for your changes?

@kurtisvg
Copy link
Collaborator

kurtisvg commented Nov 1, 2024

@gauravpurohit06 @vishwarajanand FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the googleapis/langchain-google-spanner-python API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants