Skip to content

Conversation

@jteijema
Copy link

This converts any sequence (including pandas Series, if present) to a list. See #3371

This converts any sequence (including pandas Series, if present) to a list
@tomaarsen
Copy link
Member

Hello!

I appreciate you opening this PR, but I'm wary to accept this. I believe this form of blanket conversion to a list is very risky, as getting an error (e.g. from not having a list) is often helpful to avoid people accidentally entering the incorrect data. In this case, with a pd.Series, it's simply a case of the user having to convert it to a list (df["text"].to_list()).

For example, with this addition, users might pass e.g. this:

embedding = model.encode({
    "query1": "What is the capital of France?",
    "query2": "Where is the Eiffel Tower?",
    "document1": "Paris is the capital of France.",
})
print(embedding)
print(embedding.shape)

and it'll return totally reasonable seeming embeddings:

[[-0.0364329   0.03470342 -0.04595987 ...  0.04639664 -0.00478114
   0.09428656]
 [-0.03682578  0.02331674 -0.06834685 ...  0.01864047 -0.03758197
   0.02956604]
 [-0.01828033  0.07142022 -0.06651124 ...  0.03860461 -0.00139438
   0.03592813]]
[[-0.01771457  0.01135074 -0.02019599 ...  0.00480803 -0.05520521
   0.05263555]
 [ 0.01172293  0.03644514 -0.01706037 ...  0.01640886 -0.06969328
   0.05715602]
 [-0.10640715  0.04474735  0.01445298 ...  0.06645758  0.03369779
   0.00820341]]
(3, 384)

But the similarities are totally off, because we're actually embedding query1, query2, and document1:

similarity = model.similarity(embedding[:2], embedding[-1])
print(similarity)
# tensor([[0.4347],
#         [0.3837]])

Instead of the actual texts, which would instead give:

# tensor([[0.8561],
#         [0.3081]])

With other words, I would rather not do this.

  • Tom Aarsen

@jteijema
Copy link
Author

jteijema commented Jun 15, 2025

Hi @tomaarsen, thank you for the well put response. I agree fully with your assessment. How would you feel about making the returned error in #3371 more informative? This is the issue I was facing, and KeyError: np.int64(1) was not informative enough for me to asses directly what the issue at hand was. How would you feel about type checking the input?

@tomaarsen
Copy link
Member

Hello!

Apologies for the delay! What kind of checking would you propose? Checking if the input is a pandas Series would require adding extra imports that I'd like to avoid, and checking that the inputs are not e.g. a list, tuple, etc. also feels risky because there might be users passing custom objects that do turn into valid inputs when converting to a list.
Especially as I'm interested in improving multimodality support in the future. I think perhaps my preference is to not make any changes.

  • Tom Aarsen

@jteijema
Copy link
Author

jteijema commented Jul 3, 2025

Thank you for the response Tom, I can see how type checking is not the solution. An even less restrictive option would be to capture the KeyError and add a little more information to the error, explaining that the provided object has a non-consecutive index perhaps? This would follow the function, as this is the reason for the error, and would be a little more informative than KeyError: np.int64(1).

The object given to the encode function could either be intended for iteration or for direct indexing. I initially assumed it would be iterated over, which led to my confusion. Making it clear that the error arises from trying to access a non-consecutive index would help clarify the source of the issue. This additional context in the error message would be beneficial in understanding what went wrong.

Thank you for your response, I love the package and appreciate your work. If your preference goes to not making any changes then I fully understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants