-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add string tokenize expression #2503
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2503 +/- ##
=======================================
Coverage ? 63.31%
=======================================
Files ? 978
Lines ? 108843
Branches ? 0
=======================================
Hits ? 68915
Misses ? 39928
Partials ? 0
|
56053a9
to
43d1373
Compare
@Vince7778 Do we have any plans on eventually supporting huggingface tokenizers as well? |
I think for now I won't, since huggingface has a much different API and a lot more customization. Seems like it would be a pretty big chore to support it. If there's enough demand we probably can but for now using UDFs would be better. |
20e8f4e
to
f63add8
Compare
This involves removing the series impl of tokenize and changing tests.
26337f1
to
7e6b299
Compare
Allows users to tokenize a string column using tiktoken and a variety of encoders.
Todo list:
Things that could be done in the future: