Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiktoken import bug? #3811

Closed
sudowoodo200 opened this issue Apr 29, 2023 · 11 comments
Closed

Tiktoken import bug? #3811

sudowoodo200 opened this issue Apr 29, 2023 · 11 comments

Comments

@sudowoodo200
Copy link

https://github.com/hwchase17/langchain/blob/adcad98bee03ac8486f328b4f316017a6ccfc808/langchain/embeddings/openai.py#L159

Getting "no attribute" error for tiktoken.model. Believe that this is because tiktoken has changed their import model, per code here. Change to tiktoken.encoding_for_model(self.model)?

@shawnesquivel
Copy link

Having this issue as well.

@shawnesquivel
Copy link

Changing to tiktoken.encoding_for_model(self.model) as you recommended gave me this error:

AttributeError: module 'tiktoken' has no attribute 'encoding_for_model'

tiktoken 0.1.2

@shawnesquivel
Copy link

shawnesquivel commented May 4, 2023

Here is the constructor __init__.py for tiktoken source

from .core import Encoding as Encoding
from .registry import get_encoding as get_encoding
from .registry import list_encoding_names as list_encoding_names

Thus we can see that if we use list_encoding_names we can get the list of good encoding names.
So in langchain/embeddings/openai.py
Old:

            # encoding = tiktoken.model.encoding_for_model(self.model)

New:

     print(tiktoken.list_encoding_names()) # check list of good encoding_names to use ['gpt2', 'r50k_base', 'p50k_base', 'cl100k_base']
     encoding = tiktoken.get_encoding(self.model)

As you can see in the source code, the model defaults to model: str = "text-embedding-ada-002". But that doesn't seem to work.

File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tiktoken/registry.py", line 60, in get_encoding
    raise ValueError(f"Unknown encoding {encoding_name}")
ValueError: Unknown encoding text-embedding-ada-002

In your project directory, e.g., main.py, you must use one of the embeddings models from the list_encoding_names output.So I set model="GPT2" and it worked. ( you can use any of these: ['gpt2', 'r50k_base', 'p50k_base', 'cl100k_base']).

   # model defaults to 'text-embedding-ada-002' which results in the unknown encoding error
    embeddings = OpenAIEmbeddings(
        model="gpt2", openai_api_key=os.environ.get("OPENAI_API_KEY")
    )

@shawnesquivel
Copy link

shawnesquivel commented May 4, 2023

In summary, my proposed changes that made it work for me are:

https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L188
Change:
encoding = tiktoken.model.encoding_for_model(self.model)

https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L107
Change:
model: str = "gpt2"

@dmytrokarpovych
Copy link

@shawnesquivel did you create PR with this fix?

@shawnesquivel
Copy link

@dmytrokarpovych

There is an open PR that is somewhat related #3819

I don't think it incorporates the default model that I used though.

@rahdor
Copy link

rahdor commented Jun 13, 2023

I had this issue and was able to resolve it by installing faiss-cpu

pip install faiss-cpu

@heavenkiller2018
Copy link

I have the same problem, and @rahdor ,your suggestion can't effect🤡

@Avigin
Copy link

Avigin commented Aug 30, 2023

I also received the this message with more details: "most likely due to a circular import"

After tracking the packages I found that the my local py file is the same as the file name being used: "token.py"
After I changed the local name it worked without problems

Copy link

dosubot bot commented Nov 29, 2023

Hi, @sudowoodo200. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue is about a bug in the import of the tiktoken library. The suggested change in the import code to tiktoken.encoding_for_model(self.model) did not work for one user, but they found a workaround by using tiktoken.get_encoding(self.model) instead. Another user mentioned that installing faiss-cpu resolved the issue for them. There is an open pull request related to this issue, but it does not include the default model fix.

If this issue is still relevant to the latest version of the LangChain repository, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself. If we don't hear back from you, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. If you have any further questions or concerns, please don't hesitate to reach out.

Best regards,
Dosu

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 6, 2023
@pepijnolivier
Copy link

I named my python script tiktoken.py which is what gave the error.
I renamed it and it works now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants