feat(tokenizer):add support for open source llm tokenizers #1701

Halpph · 2024-02-16T15:00:56Z

Hello everyone, I saw the Contribute guide but for some reason the tests would always fail on _____________________________________ ERROR collecting test/test_function_utils.py _____________________________________ test/test_function_utils.py:288: in <module> class Currency(BaseModel): pydantic/main.py:197: in pydantic.main.ModelMetaclass.__new__ ??? pydantic/fields.py:497: in pydantic.fields.ModelField.infer ??? pydantic/fields.py:469: in pydantic.fields.ModelField._get_field_info ??? E ValueError: Fielddefault cannot be set inAnnotated for 'amount'

I tried but I'm very busy and I seem to not manage to make it work for now, I hope you can take a look and I'll try to run it again.

Why are these changes needed?

This PR solves the following issue https://github.com/microsoft/autogen/issues/1666
Basically while serving open source llm models we were always tokenizing using cl100k_base, but now we support the native way of each model by specifying it in the OAI_CONFIG_LIST

Related issue number

Closes #1666

Checks

Sadly I didn't manage to run checks because of the error mentioned above

Halpph · 2024-02-16T15:02:21Z

@microsoft-github-policy-service agree

codecov-commenter · 2024-02-16T15:03:03Z

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (aea5bed) 39.62% compared to head (a842f46) 15.93%.

Files	Patch %	Lines
autogen/token_count_utils.py	27.27%	16 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1701       +/-   ##
===========================================
- Coverage   39.62%   15.93%   -23.70%     
===========================================
  Files          57       57               
  Lines        6006     6024       +18     
  Branches     1338     1457      +119     
===========================================
- Hits         2380      960     -1420     
- Misses       3433     5027     +1594     
+ Partials      193       37      -156

Flag	Coverage Δ
unittests	`15.91% <27.27%> (-23.71%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Halpph · 2024-02-19T09:58:54Z

@olgavrou @AaronWard @kevin666aa @SDcodehub

Hello everyone, this is a workaround that I implemented for now in order to be able to continue with my work, but I'd like to discuss with you of a proper implementation before starting the refactoring.
For instance we could implement this using some kind of strategy pattern so that the desired strategy would be responsible for fetching the right tokenizer for each case. What do you think?

gitguardian · 2024-07-20T21:22:00Z

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.

^{_{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}}

ekzhu · 2024-10-02T22:04:40Z

@Halpph is this still the workaround you are using?

Halpph · 2024-10-03T07:15:18Z

I'm not actively using it at the moment, but yes.

rysweet · 2024-10-10T21:08:37Z

This PR is against AutoGen 0.2. AutoGen 0.2 has been moved to the 0.2 branch. Please rebase your PR on the 0.2 branch or update it to work with the new AutoGen 0.4 that is now in main.

rysweet · 2024-10-11T22:03:36Z

@Halpph If you can update to resolve conflicts and see if we can get CI to pass we can look at bringing this forward

rysweet · 2024-10-18T18:27:03Z

closing as stale, please reopen if you would like to bring it up to date.

Halpph had a problem deploying to openai1 February 16, 2024 15:01 — with GitHub Actions Failure

Halpph marked this pull request as draft February 16, 2024 15:09

feat(tokenizer):add support for open source llm tokenizers

a842f46

Halpph force-pushed the feat/support-open-source-llm-tokenizers branch from 9ea982a to a842f46 Compare February 16, 2024 15:18

Halpph had a problem deploying to openai1 February 16, 2024 15:19 — with GitHub Actions Failure

ekzhu requested a review from yiranwu0 February 16, 2024 17:18

sonichi requested review from olgavrou and AaronWard February 18, 2024 06:07

sonichi requested a review from SDcodehub February 18, 2024 06:07

sonichi added the models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) label Feb 18, 2024

gagb closed this Aug 26, 2024

gagb reopened this Aug 28, 2024

marklysze mentioned this pull request Aug 28, 2024

[Issue]: Autogen count tokens badly when using served Open Source Models #1666

Closed

ekzhu changed the base branch from main to 0.2 October 2, 2024 18:30

jackgerrits added the 0.2 Issues which are related to the pre 0.4 codebase label Oct 4, 2024

rysweet added the awaiting-op-response Issue or pr has been triaged or responded to and is now awaiting a reply from the original poster label Oct 10, 2024

rysweet closed this Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tokenizer):add support for open source llm tokenizers #1701

feat(tokenizer):add support for open source llm tokenizers #1701

Halpph commented Feb 16, 2024

Halpph commented Feb 16, 2024

codecov-commenter commented Feb 16, 2024 •

edited

Loading

Halpph commented Feb 19, 2024

gitguardian bot commented Jul 20, 2024 •

edited

Loading

ekzhu commented Oct 2, 2024

Halpph commented Oct 3, 2024

rysweet commented Oct 10, 2024

rysweet commented Oct 11, 2024

rysweet commented Oct 18, 2024

feat(tokenizer):add support for open source llm tokenizers #1701

feat(tokenizer):add support for open source llm tokenizers #1701

Conversation

Halpph commented Feb 16, 2024

Why are these changes needed?

Related issue number

Checks

Halpph commented Feb 16, 2024

codecov-commenter commented Feb 16, 2024 • edited Loading

Codecov Report

Halpph commented Feb 19, 2024

gitguardian bot commented Jul 20, 2024 • edited Loading

️✅ There are no secrets present in this pull request anymore.

ekzhu commented Oct 2, 2024

Halpph commented Oct 3, 2024

rysweet commented Oct 10, 2024

rysweet commented Oct 11, 2024

rysweet commented Oct 18, 2024

codecov-commenter commented Feb 16, 2024 •

edited

Loading

gitguardian bot commented Jul 20, 2024 •

edited

Loading