Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add string tokenize expression #2503

Merged
merged 24 commits into from
Jul 19, 2024
Merged

[FEAT] Add string tokenize expression #2503

merged 24 commits into from
Jul 19, 2024

Conversation

Vince7778
Copy link
Contributor

@Vince7778 Vince7778 commented Jul 11, 2024

Allows users to tokenize a string column using tiktoken and a variety of encoders.

Todo list:

  • Support for builtin models (cl100k_base, p50k_base, etc)
  • Support for loading models from a token file
  • Support for downloading models from the cloud
  • More tests
  • Fix error handling
  • Pattern argument
  • Special token support
  • All the tests
  • Update docs

Things that could be done in the future:

  • Add caching for token files so that the processes don't have to each download it.
  • Make a fork of tiktoken-rs for various fixes (accessing private fields, fixing unwraps, optimizing etc)
  • Add support for huggingface tokenizers
  • Add more granular special token support (custom inputs, using a subset of them)

@github-actions github-actions bot added the enhancement New feature or request label Jul 11, 2024
Copy link

codecov bot commented Jul 11, 2024

Codecov Report

Attention: Patch coverage is 90.71926% with 40 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@346868e). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2503   +/-   ##
=======================================
  Coverage        ?   63.31%           
=======================================
  Files           ?      978           
  Lines           ?   108843           
  Branches        ?        0           
=======================================
  Hits            ?    68915           
  Misses          ?    39928           
  Partials        ?        0           
Files Coverage Δ
daft/expressions/expressions.py 93.82% <100.00%> (ø)
src/daft-core/src/series/ops/utf8.rs 93.54% <100.00%> (ø)
src/daft-functions/src/lib.rs 53.33% <100.00%> (ø)
src/daft-functions/src/tokenize/mod.rs 100.00% <100.00%> (ø)
src/daft-functions/src/tokenize/special_tokens.rs 100.00% <100.00%> (ø)
src/daft-functions/src/tokenize/bpe.rs 96.35% <96.35%> (ø)
src/daft-functions/src/tokenize/encode.rs 87.09% <87.09%> (ø)
src/daft-functions/src/tokenize/decode.rs 73.56% <73.56%> (ø)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 12, 2024
@Vince7778 Vince7778 force-pushed the conor/tokenize-expr branch 2 times, most recently from 56053a9 to 43d1373 Compare July 15, 2024 23:38
@universalmind303
Copy link
Collaborator

universalmind303 commented Jul 16, 2024

@Vince7778 Do we have any plans on eventually supporting huggingface tokenizers as well?

@Vince7778
Copy link
Contributor Author

I think for now I won't, since huggingface has a much different API and a lot more customization. Seems like it would be a pretty big chore to support it. If there's enough demand we probably can but for now using UDFs would be better.

@Vince7778 Vince7778 marked this pull request as ready for review July 17, 2024 19:41
@Vince7778 Vince7778 enabled auto-merge (squash) July 18, 2024 17:54
@Vince7778 Vince7778 merged commit 876fca8 into main Jul 19, 2024
44 checks passed
@Vince7778 Vince7778 deleted the conor/tokenize-expr branch July 19, 2024 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants