Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed redundant and potentially error cause validation for single doc OpenAI embedding #3819

Closed
wants to merge 17 commits into from

Conversation

Hase-U
Copy link
Contributor

@Hase-U Hase-U commented Apr 30, 2023

The line here should actually be a length comparison with the text as token
https://github.com/hwchase17/langchain/blob/18ec22fe56049aaea446406daab6d66d172dd48f/langchain/embeddings/openai.py#L210

But realistically, there is no need to use a function like len(encode(text)) here, and we can use self._get_len_safe_embeddings by default.
All langchain users will need to install tiktoken, but it's natural to think that using tiktoken is also necessary when using openai's embedding.

So the difference from the current situation is

  • tiktoken needs to be installed.
  • all docs will be encoded at the client side and sent to openai.

@Hase-U
Copy link
Contributor Author

Hase-U commented Apr 30, 2023

Also, this PR means that the use of _get_len_safe_embeddings is completely defaulted.
So please note that this PR over here needs to be merged first. (#3778)

@Hase-U
Copy link
Contributor Author

Hase-U commented Apr 30, 2023

#3811

Also, as pointed out here, embeddings/openai.py imports tiktoken in a different way than elsewhere in langchain, so I adjusted it accordingly.
However, since this is a problem of how to write the code, I think that nothing has changed functionally.
the attribute error seems to be caused by other reason

Copy link

@shawnesquivel shawnesquivel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reached the same conclusion on my fork. #3811 has more documentation on why this is the right fix.

@shawnesquivel
Copy link

shawnesquivel commented May 4, 2023

Was your OpenAIEmbeddings model parameter OK with the default (text-embedding-ada-002)? Personally, I had to change model to "gpt2" as per #3811.

File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tiktoken/registry.py", line 60, in get_encoding
    raise ValueError(f"Unknown encoding {encoding_name}")
ValueError: Unknown encoding text-embedding-ada-002

https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L107
Change:
model: str = "gpt2"

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 14, 2023
@leo-gan
Copy link
Collaborator

leo-gan commented Sep 13, 2023

@Hase-U Hi , could you, please, resolve the merging issues? After that ping me and I push this PR for the review. Thanks!

@efriis
Copy link
Member

efriis commented Nov 7, 2023

Closing because the PR wouldn't line up with the current directory structure of the library (would need to be in /libs/langchain/langchain instead of /langchain). Feel free to reopen against the current head if it's still relevant!

@efriis efriis closed this Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: embeddings Related to text embedding models module 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants