Skip to content

Conversation

@thomasw21
Copy link
Member

@thomasw21 thomasw21 commented Jul 25, 2021

This allows to leverage HF's tokenizers' batch_encode method. I observed a 30% speedup on my colab (with 2 workers ... so I don't know how it translates for c4 with 16/32 workers), so we might need to test out with long runs? Also #18 should be able to leverage this feature nicely as each work write directly on disk.

@thomasw21 thomasw21 requested a review from TevenLeScao July 25, 2021 20:43
@stas00
Copy link
Contributor

stas00 commented Jul 26, 2021

FWIW, testing this branch I get no difference in the speed. Unless you meant it improves when used with #18?

@stas00 stas00 mentioned this pull request Jul 26, 2021
3 tasks
@thomasw21
Copy link
Member Author

I think it's clear that in current case tokenizer is not the bottleneck, ie otherwise adding workers would help. I'm hoping with #18 it's going to be faster

@thomasw21
Copy link
Member Author

Okay so the code makes no difference for is the flag --split-sentences isn't set. I'll have to think a bit more if we want to improve them on chunks of 25 documents. And this is highly dependent if at the end the tokenizer we use is going to have a batch efficient implementation (maybe using an HF tokenizer?). Seeing as it's unclear, I'd advocate to close this PR, and re-open when we start thinking about which tokenizer we use.

@thomasw21 thomasw21 closed this Aug 1, 2021
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants