Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for different models in num_tokens_from_text function #90

Closed

Conversation

vidhula17
Copy link
Collaborator

@vidhula17 vidhula17 commented Oct 3, 2023

Why are these changes needed?

This PR extends the num_tokens_from_text function to support a wider range of language models beyond the "gpt-x" series. It enhances code flexibility and welcomes community contributions for various models, improving project versatility.

Related issue number

Closes #63

Checks

@vidhula17
Copy link
Collaborator Author

vidhula17 commented Oct 3, 2023

@microsoft-github-policy-service agree

Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @vidhula17 for the PR, nice job! I've left some comments, could you please address them? Let me know if you need any help.

Thanks again for your contribution!

"""Return the number of tokens used by a text for different models."""

# Define token counts for known models
known_models = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why gpt-3.5-turbo-0301 is not in the known model?

"gpt-4-0613": (3, 1),
"gpt-4-32k-0613": (3, 1),
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a parameter to the function, say model_token: dict = None. And add below code to support customizing model token_per_message without modifying code here.

if isinstance(model_token, dict):
    known_models.update(model_token)

The parameter can be passed in retrieve_config in autogen/autogen/agentchat/contrib/retrieve_user_proxy_agent.py

Comment on lines +55 to +63
if model == "your-new-model-name":
tokens_per_message = 3
tokens_per_name = 1
else:
raise NotImplementedError(
f"num_tokens_from_text() is not implemented for model {model}. See "
f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are "
f"converted to tokens."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if model == "your-new-model-name":
tokens_per_message = 3
tokens_per_name = 1
else:
raise NotImplementedError(
f"num_tokens_from_text() is not implemented for model {model}. See "
f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are "
f"converted to tokens."
)
tokens_per_message = 3
tokens_per_name = 1

)

# Use tiktoken to calculate the number of tokens in the text
encoding = tiktoken.encoding_for_model(model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
encoding = tiktoken.encoding_for_model(model)
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
logger.warning("Warning: model not found. Using cl100k_base encoding.")
encoding = tiktoken.get_encoding("cl100k_base")

try...catch is needed here.

Comment on lines +17 to +18
with self.assertRaises(NotImplementedError):
num_tokens_from_text(text, model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update.

@thinkall
Copy link
Collaborator

thinkall commented Oct 3, 2023

Also, code format checking is failed, please run pre-commit install in the root folder of your local repo, and then you'll enable code formatting for your changes.

@thinkall thinkall self-assigned this Oct 3, 2023
@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2023

Codecov Report

Merging #90 (0130a98) into main (5ff85a3) will increase coverage by 0.30%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main      #90      +/-   ##
==========================================
+ Coverage   41.03%   41.33%   +0.30%     
==========================================
  Files          17       17              
  Lines        2091     2083       -8     
  Branches      469      467       -2     
==========================================
+ Hits          858      861       +3     
+ Misses       1156     1145      -11     
  Partials       77       77              
Flag Coverage Δ
unittests 41.23% <75.00%> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
autogen/retrieve_utils.py 68.75% <75.00%> (+5.05%) ⬆️

@thinkall thinkall mentioned this pull request Oct 8, 2023
3 tasks
@thinkall thinkall closed this Oct 9, 2023
jackgerrits pushed a commit that referenced this pull request Oct 2, 2024
* Cancellation for model client #90

* format

* Use future
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to different models to num_tokens_from_text
4 participants