Add support for different models in num_tokens_from_text function #90

vidhula17 · 2023-10-03T11:45:59Z

Why are these changes needed?

This PR extends the num_tokens_from_text function to support a wider range of language models beyond the "gpt-x" series. It enhances code flexibility and welcomes community contributions for various models, improving project versatility.

Related issue number

Closes #63

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

vidhula17 · 2023-10-03T12:19:14Z

@microsoft-github-policy-service agree

thinkall

Thank you very much @vidhula17 for the PR, nice job! I've left some comments, could you please address them? Let me know if you need any help.

Thanks again for your contribution!

thinkall · 2023-10-03T13:04:20Z

autogen/retrieve_utils.py

+    """Return the number of tokens used by a text for different models."""
+
+    # Define token counts for known models
+    known_models = {


why gpt-3.5-turbo-0301 is not in the known model?

thinkall · 2023-10-03T13:13:00Z

autogen/retrieve_utils.py

+        "gpt-4-0613": (3, 1),
+        "gpt-4-32k-0613": (3, 1),
+    }
+


We can add a parameter to the function, say model_token: dict = None. And add below code to support customizing model token_per_message without modifying code here.

if isinstance(model_token, dict): known_models.update(model_token)

The parameter can be passed in retrieve_config in autogen/autogen/agentchat/contrib/retrieve_user_proxy_agent.py

thinkall · 2023-10-03T13:14:29Z

autogen/retrieve_utils.py

+        if model == "your-new-model-name":
+            tokens_per_message = 3
+            tokens_per_name = 1
+        else:
+            raise NotImplementedError(
+                f"num_tokens_from_text() is not implemented for model {model}. See "
+                f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are "
+                f"converted to tokens."
+            )


Suggested change

if model == "your-new-model-name":

tokens_per_message = 3

tokens_per_name = 1

else:

raise NotImplementedError(

f"num_tokens_from_text() is not implemented for model {model}. See "

f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are "

f"converted to tokens."

)

tokens_per_message = 3

tokens_per_name = 1

thinkall · 2023-10-03T13:17:41Z

autogen/retrieve_utils.py

+            )
+
+    # Use tiktoken to calculate the number of tokens in the text
+    encoding = tiktoken.encoding_for_model(model)


Suggested change

encoding = tiktoken.encoding_for_model(model)

try:

encoding = tiktoken.encoding_for_model(model)

except KeyError:

logger.warning("Warning: model not found. Using cl100k_base encoding.")

encoding = tiktoken.get_encoding("cl100k_base")

try...catch is needed here.

thinkall · 2023-10-03T13:17:53Z

test/test_num_tokens_from_text.py

+        with self.assertRaises(NotImplementedError):
+            num_tokens_from_text(text, model)


Need to update.

thinkall · 2023-10-03T13:22:15Z

Also, code format checking is failed, please run pre-commit install in the root folder of your local repo, and then you'll enable code formatting for your changes.

codecov-commenter · 2023-10-04T03:49:29Z

Codecov Report

Merging #90 (0130a98) into main (5ff85a3) will increase coverage by 0.30%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main      #90      +/-   ##
==========================================
+ Coverage   41.03%   41.33%   +0.30%     
==========================================
  Files          17       17              
  Lines        2091     2083       -8     
  Branches      469      467       -2     
==========================================
+ Hits          858      861       +3     
+ Misses       1156     1145      -11     
  Partials       77       77

Flag	Coverage Δ
unittests	`41.23% <75.00%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
autogen/retrieve_utils.py	`68.75% <75.00%> (+5.05%)`	⬆️

* Cancellation for model client #90 * format * Use future

Add support for different models in num_tokens_from_text function

3034a9b

vidhula17 had a problem deploying to openai October 3, 2023 11:46 — with GitHub Actions Failure

thinkall requested changes Oct 3, 2023

View reviewed changes

thinkall self-assigned this Oct 3, 2023

Merge branch 'main' into add-model-support-num-tokens

098e020

thinkall had a problem deploying to openai October 4, 2023 03:45 — with GitHub Actions Failure

Merge branch 'main' into add-model-support-num-tokens

4499dc2

thinkall had a problem deploying to openai October 5, 2023 14:48 — with GitHub Actions Failure

Merge branch 'main' into add-model-support-num-tokens

0130a98

qingyun-wu had a problem deploying to openai October 7, 2023 15:10 — with GitHub Actions Failure

thinkall mentioned this pull request Oct 8, 2023

Update num tokens from text #149

Merged

3 tasks

thinkall closed this Oct 9, 2023

jackgerrits pushed a commit that referenced this pull request Oct 2, 2024

Cancellation for model client #90 (#240)

c85da39

* Cancellation for model client #90 * format * Use future

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for different models in num_tokens_from_text function #90

Add support for different models in num_tokens_from_text function #90

vidhula17 commented Oct 3, 2023 •

edited

Loading

vidhula17 commented Oct 3, 2023 •

edited

Loading

thinkall left a comment

thinkall Oct 3, 2023

thinkall Oct 3, 2023

thinkall Oct 3, 2023

thinkall Oct 3, 2023

thinkall Oct 3, 2023

thinkall commented Oct 3, 2023

codecov-commenter commented Oct 4, 2023 •

edited

Loading

-    encoding = tiktoken.encoding_for_model(model)
+    try:
+        encoding = tiktoken.encoding_for_model(model)
+    except KeyError:
+        logger.warning("Warning: model not found. Using cl100k_base encoding.")
+        encoding = tiktoken.get_encoding("cl100k_base")

		with self.assertRaises(NotImplementedError):
		num_tokens_from_text(text, model)

Add support for different models in num_tokens_from_text function #90

Add support for different models in num_tokens_from_text function #90

Conversation

vidhula17 commented Oct 3, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

vidhula17 commented Oct 3, 2023 • edited Loading

thinkall left a comment

Choose a reason for hiding this comment

thinkall Oct 3, 2023

Choose a reason for hiding this comment

thinkall Oct 3, 2023

Choose a reason for hiding this comment

thinkall Oct 3, 2023

Choose a reason for hiding this comment

thinkall Oct 3, 2023

Choose a reason for hiding this comment

thinkall Oct 3, 2023

Choose a reason for hiding this comment

thinkall commented Oct 3, 2023

codecov-commenter commented Oct 4, 2023 • edited Loading

Codecov Report

vidhula17 commented Oct 3, 2023 •

edited

Loading

vidhula17 commented Oct 3, 2023 •

edited

Loading

codecov-commenter commented Oct 4, 2023 •

edited

Loading