Tokenizer batch encode #17

thomasw21 · 2021-07-25T20:42:59Z

This allows to leverage HF's tokenizers' batch_encode method. I observed a 30% speedup on my colab (with 2 workers ... so I don't know how it translates for c4 with 16/32 workers), so we might need to test out with long runs? Also #18 should be able to leverage this feature nicely as each work write directly on disk.

stas00 · 2021-07-26T19:30:04Z

FWIW, testing this branch I get no difference in the speed. Unless you meant it improves when used with #18?

thomasw21 · 2021-07-26T19:39:42Z

I think it's clear that in current case tokenizer is not the bottleneck, ie otherwise adding workers would help. I'm hoping with #18 it's going to be faster

thomasw21 · 2021-08-01T22:50:45Z

Okay so the code makes no difference for is the flag --split-sentences isn't set. I'll have to think a bit more if we want to improve them on chunks of 25 documents. And this is highly dependent if at the end the tokenizer we use is going to have a batch efficient implementation (maybe using an HF tokenizer?). Seeing as it's unclear, I'd advocate to close this PR, and re-open when we start thinking about which tokenizer we use.

* initial commit * script fix

Thomas added 2 commits July 25, 2021 22:30

Allow tokenizers to batch encode

26d048d

Woops

4f0ec2e

thomasw21 requested a review from TevenLeScao July 25, 2021 20:43

stas00 mentioned this pull request Jul 26, 2021

Faster preprocessing #18

Merged

3 tasks

thomasw21 mentioned this pull request Jul 29, 2021

WIP: Faster preprocess multi #32

Closed

thomasw21 closed this Aug 1, 2021

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 20, 2021

Curriculum learning support (bigscience-workshop#17)

db97cd2

* initial commit * script fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer batch encode #17

Tokenizer batch encode #17

Uh oh!

thomasw21 commented Jul 25, 2021 •

edited

Loading

Uh oh!

stas00 commented Jul 26, 2021

Uh oh!

thomasw21 commented Jul 26, 2021

Uh oh!

thomasw21 commented Aug 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tokenizer batch encode #17

Tokenizer batch encode #17

Uh oh!

Conversation

thomasw21 commented Jul 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jul 26, 2021

Uh oh!

thomasw21 commented Jul 26, 2021

Uh oh!

thomasw21 commented Aug 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thomasw21 commented Jul 25, 2021 •

edited

Loading