Skip to content

Comments

Improve nvtext::tokenize_with_vocabulary performance#18522

Merged
rapids-bot[bot] merged 9 commits intorapidsai:branch-25.06from
davidwendt:vocab-tokenize-perf
May 5, 2025
Merged

Improve nvtext::tokenize_with_vocabulary performance#18522
rapids-bot[bot] merged 9 commits intorapidsai:branch-25.06from
davidwendt:vocab-tokenize-perf

Conversation

@davidwendt
Copy link
Contributor

Description

Part of refactoring split/tokenize common code and improvements.
This improves performance of the vocabulary-tokenizer mostly for smaller strings (<=128 bytes) by using an intermediate buffer to hold the token-counts before computing offsets.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 17, 2025
@davidwendt davidwendt self-assigned this Apr 17, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 17, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

davidwendt commented Apr 17, 2025

Benchmarks show performance improvement for the specialized logic for smaller strings

## [0] NVIDIA RTX A6000

| max_width | num_rows |   Ref Time |   Cmp Time |         Diff |   %Diff |
|-----------|----------|------------|------------|--------------|---------|
|    32     |  32768   | 194.769 us | 168.589 us |   -26.181 us | -13.44% |
|    64     |  32768   | 305.154 us | 245.225 us |   -59.929 us | -19.64% |
|    128    |  32768   | 573.142 us | 425.584 us |  -147.559 us | -25.75% |
|    256    |  32768   | 337.069 us | 338.036 us |     0.967 us |   0.29% |
|    32     |  262144  | 311.163 us | 300.832 us |   -10.331 us |  -3.32% |
|    64     |  262144  | 524.604 us | 498.134 us |   -26.469 us |  -5.05% |
|    128    |  262144  |   1.080 ms | 919.996 us |  -160.119 us | -14.82% |
|    256    |  262144  |   1.776 ms |   1.774 ms |    -2.098 us |  -0.12% |
|    32     | 2097152  |   1.471 ms |   1.424 ms |   -46.918 us |  -3.19% |
|    64     | 2097152  |   2.840 ms |   2.631 ms |  -208.705 us |  -7.35% |
|    128    | 2097152  |   6.575 ms |   5.292 ms | -1282.198 us | -19.50% |
|    256    | 2097152  |  13.569 ms |  13.521 ms |   -47.806 us |  -0.35% |

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 25, 2025
@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt davidwendt marked this pull request as ready for review April 29, 2025 16:39
@davidwendt davidwendt requested a review from a team as a code owner April 29, 2025 16:39
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit a5ed0e6 into rapidsai:branch-25.06 May 5, 2025
112 checks passed
@davidwendt davidwendt deleted the vocab-tokenize-perf branch May 5, 2025 15:10
vyasr added a commit to vyasr/cudf that referenced this pull request May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants