Skip to content

fix for empty string input to openai/text-embedding-3-large#2634

Closed
ayush1298 wants to merge 4 commits intoembeddings-benchmark:mainfrom
ayush1298:fix_openai_models
Closed

fix for empty string input to openai/text-embedding-3-large#2634
ayush1298 wants to merge 4 commits intoembeddings-benchmark:mainfrom
ayush1298:fix_openai_models

Conversation

@ayush1298
Copy link
Collaborator

fix for #1650.
I dont have OpenAI API key to test whether its working correctly or not. @KennethEnevoldsen Can you please test it, when you get time.

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

@ayush1298
Copy link
Collaborator Author

@KennethEnevoldsen I have added fix that should based on what you explained. Also, to test it as we are calling model.encode so we should require OPENAI API KEY.
1 more thing is, I tested this using OPENAI API KEY(got it from friend), but still I am getting error for empty strings. I tried checking it but error was occuring even before calling of encode method. I was not able to find out what was cause of the error though. Can you please have a look at it?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addded some suggestions. Do give it another try and try to give me a debug trace if it doesn't work.


return np.array(all_embeddings)
all_embeddings = np.array(all_embeddings)
final_embeddings = np.zeros((len(sentences), self._embed_dim), dtype=np.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this already fill out the empty texts.

You can then simply fill in the text_embeddings using:

mask = [i for i, t in enumerate(text) if t.strip()]
final_emb[mask, :] = text_emb

Here is a small sample to show the idea:

import numpy as np

matrix1 = np.zeros((5, 4))
matrix2 = np.random.rand(3, 4)
matrix1[[0, 2, 4],:] = matrix2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get these one. Current implementation was 1st creating whole vector of zeros, and then only setting text embeddings, at positions where non-empty text is present. I think you are suggesting same only.

@ayush1298
Copy link
Collaborator Author

from __future__ import annotations
import os
import numpy as np
from mteb.models.openai_models import OpenAIWrapper

os.environ["OPENAI_API_KEY"] = "API-KEY"
wrapper = OpenAIWrapper(
    model_name="text-embedding-3-small",
    max_tokens=8191,
    embed_dim=1536,
)

# 3) call encode on empty, non‑empty, mixed
cases = [
    # ["hello world"],   
    ["", "foo", ""], 
]
print("Testing OpenAIWrapper with different cases:")

for case in cases:
    print(f"\nTesting case: {case!r}")
    try:
        out = wrapper.encode(case)
        print("output:", out)
        print("Shape:", out.shape)

    except Exception as e:
        print(f"Error: {e}")

@KennethEnevoldsen Getting the following output when running the above testing code:

Testing OpenAIWrapper with different cases:

Testing case: ['hello world']
output: [[-0.00676333 -0.03919632  0.03417581 ... -0.01964353 -0.01937133
  -0.02247135]]
Shape: (1, 1536)

Testing case: ['', 'foo', '']
Error: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

So, code is running but still not able to handle empty strings. I have tried to put some logging statements at start of encode method, but when input is empty, then I am getting error even before calling this encode method.
Can you have a look at it now?

@KennethEnevoldsen
Copy link
Contributor

Thanks for taking the time on this @ayush1298. I built upon you earlier pr. and made a fix: #2676

@ayush1298 ayush1298 deleted the fix_openai_models branch May 9, 2025 10:44
@ayush1298
Copy link
Collaborator Author

Thanks for taking the time on this @ayush1298. I built upon you earlier pr. and made a fix: #2676

Thanks @KennethEnevoldsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants