Improve nvtext::tokenize_with_vocabulary performance by davidwendt · Pull Request #18522 · rapidsai/cudf

davidwendt · 2025-04-17T22:54:35Z

Description

Part of refactoring split/tokenize common code and improvements.
This improves performance of the vocabulary-tokenizer mostly for smaller strings (<=128 bytes) by using an intermediate buffer to hold the token-counts before computing offsets.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-04-17T22:54:39Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

davidwendt · 2025-04-17T22:55:53Z

/ok to test

davidwendt · 2025-04-17T22:56:26Z

Benchmarks show performance improvement for the specialized logic for smaller strings

## [0] NVIDIA RTX A6000

| max_width | num_rows |   Ref Time |   Cmp Time |         Diff |   %Diff |
|-----------|----------|------------|------------|--------------|---------|
|    32     |  32768   | 194.769 us | 168.589 us |   -26.181 us | -13.44% |
|    64     |  32768   | 305.154 us | 245.225 us |   -59.929 us | -19.64% |
|    128    |  32768   | 573.142 us | 425.584 us |  -147.559 us | -25.75% |
|    256    |  32768   | 337.069 us | 338.036 us |     0.967 us |   0.29% |
|    32     |  262144  | 311.163 us | 300.832 us |   -10.331 us |  -3.32% |
|    64     |  262144  | 524.604 us | 498.134 us |   -26.469 us |  -5.05% |
|    128    |  262144  |   1.080 ms | 919.996 us |  -160.119 us | -14.82% |
|    256    |  262144  |   1.776 ms |   1.774 ms |    -2.098 us |  -0.12% |
|    32     | 2097152  |   1.471 ms |   1.424 ms |   -46.918 us |  -3.19% |
|    64     | 2097152  |   2.840 ms |   2.631 ms |  -208.705 us |  -7.35% |
|    128    | 2097152  |   6.575 ms |   5.292 ms | -1282.198 us | -19.50% |
|    256    | 2097152  |  13.569 ms |  13.521 ms |   -47.806 us |  -0.35% |

davidwendt · 2025-04-21T14:51:48Z

/ok to test

davidwendt · 2025-04-23T23:59:01Z

/ok to test

davidwendt · 2025-04-29T14:07:10Z

/ok to test

davidwendt · 2025-04-29T14:09:53Z

/ok to test

davidwendt · 2025-05-05T15:09:51Z

/merge

…i#18522)" This reverts commit a5ed0e6.

Improve nvtext::tokenize_with_vocabulary performance

4d59270

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 17, 2025

davidwendt self-assigned this Apr 17, 2025

Merge branch 'branch-25.06' into vocab-tokenize-perf

017d970

davidwendt added 2 commits April 21, 2025 15:17

Merge branch 'branch-25.06' into vocab-tokenize-perf

7a51413

Merge branch 'branch-25.06' into vocab-tokenize-perf

2fa0082

davidwendt added 2 commits April 24, 2025 07:57

Merge branch 'branch-25.06' into vocab-tokenize-perf

81cba90

Merge branch 'branch-25.06' into vocab-tokenize-perf

38f8367

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 25, 2025

davidwendt added 2 commits April 28, 2025 19:38

Merge branch 'branch-25.06' into vocab-tokenize-perf

ad976b9

change sync to nosync

a3b33c3

Merge branch 'branch-25.06' into vocab-tokenize-perf

c3e4510

davidwendt marked this pull request as ready for review April 29, 2025 16:39

davidwendt requested a review from a team as a code owner April 29, 2025 16:39

davidwendt requested review from karthikeyann and nvdbaranec April 29, 2025 16:39

PointKernel approved these changes May 1, 2025

View reviewed changes

shrshi approved these changes May 3, 2025

View reviewed changes

rapids-bot bot merged commit a5ed0e6 into rapidsai:branch-25.06 May 5, 2025
112 checks passed

davidwendt deleted the vocab-tokenize-perf branch May 5, 2025 15:10

vyasr added a commit to vyasr/cudf that referenced this pull request May 6, 2025

Revert "Improve nvtext::tokenize_with_vocabulary performance (rapidsa…

ca8a155

…i#18522)" This reverts commit a5ed0e6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Improve nvtext::tokenize_with_vocabulary performance#18522

Improve nvtext::tokenize_with_vocabulary performance#18522
rapids-bot[bot] merged 9 commits intorapidsai:branch-25.06from
davidwendt:vocab-tokenize-perf

davidwendt commented Apr 17, 2025

Uh oh!

copy-pr-bot bot commented Apr 17, 2025

Uh oh!

davidwendt commented Apr 17, 2025

Uh oh!

davidwendt commented Apr 17, 2025 •

edited

Loading

Uh oh!

davidwendt commented Apr 21, 2025

Uh oh!

davidwendt commented Apr 23, 2025

Uh oh!

davidwendt commented Apr 29, 2025

Uh oh!

davidwendt commented Apr 29, 2025

Uh oh!

davidwendt commented May 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

davidwendt commented Apr 17, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Apr 17, 2025

Uh oh!

davidwendt commented Apr 17, 2025

Uh oh!

davidwendt commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidwendt commented Apr 21, 2025

Uh oh!

davidwendt commented Apr 23, 2025

Uh oh!

davidwendt commented Apr 29, 2025

Uh oh!

davidwendt commented Apr 29, 2025

Uh oh!

davidwendt commented May 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidwendt commented Apr 17, 2025 •

edited

Loading