Ensure encode() handles Series and other sequence types #3372

jteijema · 2025-05-22T11:59:46Z

This converts any sequence (including pandas Series, if present) to a list. See #3371

This converts any sequence (including pandas Series, if present) to a list

tomaarsen · 2025-05-23T10:33:50Z

Hello!

I appreciate you opening this PR, but I'm wary to accept this. I believe this form of blanket conversion to a list is very risky, as getting an error (e.g. from not having a list) is often helpful to avoid people accidentally entering the incorrect data. In this case, with a pd.Series, it's simply a case of the user having to convert it to a list (df["text"].to_list()).

For example, with this addition, users might pass e.g. this:

embedding = model.encode({
    "query1": "What is the capital of France?",
    "query2": "Where is the Eiffel Tower?",
    "document1": "Paris is the capital of France.",
})
print(embedding)
print(embedding.shape)

and it'll return totally reasonable seeming embeddings:

[[-0.0364329   0.03470342 -0.04595987 ...  0.04639664 -0.00478114
   0.09428656]
 [-0.03682578  0.02331674 -0.06834685 ...  0.01864047 -0.03758197
   0.02956604]
 [-0.01828033  0.07142022 -0.06651124 ...  0.03860461 -0.00139438
   0.03592813]]
[[-0.01771457  0.01135074 -0.02019599 ...  0.00480803 -0.05520521
   0.05263555]
 [ 0.01172293  0.03644514 -0.01706037 ...  0.01640886 -0.06969328
   0.05715602]
 [-0.10640715  0.04474735  0.01445298 ...  0.06645758  0.03369779
   0.00820341]]
(3, 384)

But the similarities are totally off, because we're actually embedding query1, query2, and document1:

similarity = model.similarity(embedding[:2], embedding[-1])
print(similarity)
# tensor([[0.4347],
#         [0.3837]])

Instead of the actual texts, which would instead give:

# tensor([[0.8561],
#         [0.3081]])

With other words, I would rather not do this.

Tom Aarsen

jteijema · 2025-06-15T20:13:53Z

Hi @tomaarsen, thank you for the well put response. I agree fully with your assessment. How would you feel about making the returned error in #3371 more informative? This is the issue I was facing, and KeyError: np.int64(1) was not informative enough for me to asses directly what the issue at hand was. How would you feel about type checking the input?

tomaarsen · 2025-06-26T09:01:28Z

Hello!

Apologies for the delay! What kind of checking would you propose? Checking if the input is a pandas Series would require adding extra imports that I'd like to avoid, and checking that the inputs are not e.g. a list, tuple, etc. also feels risky because there might be users passing custom objects that do turn into valid inputs when converting to a list.
Especially as I'm interested in improving multimodality support in the future. I think perhaps my preference is to not make any changes.

Tom Aarsen

jteijema · 2025-07-03T12:56:51Z

Thank you for the response Tom, I can see how type checking is not the solution. An even less restrictive option would be to capture the KeyError and add a little more information to the error, explaining that the provided object has a non-consecutive index perhaps? This would follow the function, as this is the reason for the error, and would be a little more informative than KeyError: np.int64(1).

The object given to the encode function could either be intended for iteration or for direct indexing. I initially assumed it would be iterated over, which led to my confusion. Making it clear that the error arises from trying to access a non-consecutive index would help clarify the source of the issue. This additional context in the error message would be beneficial in understanding what went wrong.

Thank you for your response, I love the package and appreciate your work. If your preference goes to not making any changes then I fully understand.

Ensure encode() handles Series and other sequence types

032a5a8

This converts any sequence (including pandas Series, if present) to a list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure encode() handles Series and other sequence types #3372

Ensure encode() handles Series and other sequence types #3372

Uh oh!

jteijema commented May 22, 2025

Uh oh!

tomaarsen commented May 23, 2025

Uh oh!

jteijema commented Jun 15, 2025 •

edited

Loading

Uh oh!

tomaarsen commented Jun 26, 2025

Uh oh!

jteijema commented Jul 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ensure encode() handles Series and other sequence types #3372

Are you sure you want to change the base?

Ensure encode() handles Series and other sequence types #3372

Uh oh!

Conversation

jteijema commented May 22, 2025

Uh oh!

tomaarsen commented May 23, 2025

Uh oh!

jteijema commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Jun 26, 2025

Uh oh!

jteijema commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jteijema commented Jun 15, 2025 •

edited

Loading

jteijema commented Jul 3, 2025 •

edited

Loading