Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to multimap based nfd_map due to compile time issues #5799

Merged
merged 3 commits into from
Mar 1, 2024

Conversation

iamlemec
Copy link
Collaborator

Fixes issues with #5740. Yields same tokenizer outcomes as before but brings compile time back to normal. Performance also seems to be unchanged. Though in testing I did notice that #5740 itself does appear to reduce perfomance substantially, so there is hopefully a better long-term solution here.

unicode.h Outdated Show resolved Hide resolved
@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Feb 29, 2024

Compile times would likely be reduced by making unicode.h a proper .cpp file instead of a header full of static tables and functions - especially if you include the time to compile the tests.

The rust implementation that apage43 linked uses minimally perfect hash tables for faster lookup (all keys are guaranteed to have a different hash so collisions are impossible) - if lookup is a bottleneck, maybe it would be worth trying to implement something similar. (Somebody should make a flamegraph first to confirm - maybe I will if I have time.)

@iamlemec
Copy link
Collaborator Author

Turns out we were spending an inordinate amount of time creating std::locale("utf-8") with every single character in to_lower. Made it static and we're pretty close to the old performance now.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not back to original level, but much better:

# before
real	0m5.506s
user	0m5.352s
sys	    0m0.101s

# after
real	0m6.396s
user	0m6.216s
sys	0m0.109s

@ggerganov ggerganov merged commit 9600d59 into ggerganov:master Mar 1, 2024
59 checks passed
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants