We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main
I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta However, export_vocab script expects Ġ prefix in the vocabulary. CLIP's vocabulary uses </w> as a suffix and not a prefix. I tried to modify the script to detect ending </w> instead of Ġ to append 0x2581: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91 but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:
Ġ
</w>
0x2581
Input string: "a photo of a really, functistaner big cat."
"a photo of a really, functistaner big cat."
Hugging faces: 49406, 320, 1125, 539, 320, 1414, 267, 8679, 555, 2203, 528, 1205, 2368, 269, 49407] BlingFire: 320 1125 539 320 1414 11 1499 66 555 2203 517 1205 2368 13
Is there some way to make BlingFire support CLIP version of tokenizer?
My current scripts and reproduction steps: https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main
I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta
However, export_vocab script expects
Ġ
prefix in the vocabulary. CLIP's vocabulary uses</w>
as a suffix and not a prefix.I tried to modify the script to detect ending
</w>
instead ofĠ
to append0x2581
: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:
Input string:
"a photo of a really, functistaner big cat."
Is there some way to make BlingFire support CLIP version of tokenizer?
My current scripts and reproduction steps:
https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip
The text was updated successfully, but these errors were encountered: