-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve BERT tokenization for accented characters and non-latin scripts #5740
Conversation
Much better, though there are still some differences: --- good_tokens.txt 2024-02-26 16:40:56.425215339 -0500
+++ lcpp_tokens.txt 2024-02-26 16:40:56.757218189 -0500
@@ -200554,7 +200554,6 @@
1337: ष
29870: ##ल
29869: ##र
-29879: ##ो
29863: ##न
1317: ग
1000: "
@@ -200633,7 +200632,11 @@
29836: ##و
29817: ##ت
25573: ##ا
-100: [UNK]
+1282: س
+23673: ##ل
+29836: ##و
+15394: ##د
+29836: ##و
23856: kota
16183: sal
6784: ##ud
@@ -254648,7 +254651,6 @@
3747: influence
1997: of
2332: king
-1097: æ
10760: ##the
20850: ##lb
19058: ##ald
|
Also, can we format nfd_map better in the source code? At a quick glance, it doesn't even appear to be sorted. |
Sorted the NFD map and capped at 200 chars width. As for the tokenization variances:
Other than that, seeing some other variances, such as |
The Unicode standard has a lot to say about this: https://unicode.org/faq/casemap_charprop.html#casemap There must be an MIT-licensed case mapping implementation somewhere that we could use if we wanted to be more correct. |
This is C++ source, why not use the standard library for casing operations? https://en.cppreference.com/w/cpp/locale/tolower |
Excellent idea. Ok, so I'm going to use: std::tolower((wchar_t)code, std::locale("en_US.UTF-8")) Works on Unix as |
C++11 has char32_t, which I think does what you need. |
For some reason if I cast it to anything other than |
Apparently even as of C++20, char32_t is only implemented for the std::codecvt facet, not the std::ctype facet. std::use_facet will throw std::bad_cast if the locale does not provide the given facet, so that's what you are seeing. |
Co-authored-by: Jared Van Bortel <[email protected]>
This PR increased the compile time for time c++ -std=c++11 -fPIC -O3 -pthread -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -c llama.cpp -o llama.o # before
real 0m5.506s
user 0m5.352s
sys 0m0.101s
# after
real 0m29.161s
user 0m27.767s
sys 0m1.029s We should fix this ASAP. The problem stems from the large NFD map |
Yup, seeing roughly 8s -> 20s on Linux. I suppose it's not happy about the map-vector setup. Looking into alternative data structures now. |
Kinda tricky, but I have it working and back to normal compile times when typedef std::pair<uint32_t, uint32_t> CodePos;
struct CodePosCompare {
bool operator()(const CodePos & lhs, const CodePos & rhs) const {
return lhs.first < rhs.first;
}
};
static const std::multimap<CodePos, uint32_t, CodePosCompare> nfd_map = {
{{0xC0, 0}, 0x41}, {{0xC0, 1}, 0x300}, {{0xC1, 0}, 0x41}, {{0xC1, 1}, 0x301}, ...
}; The docs say that the insertion order is respected when items are equivalent, so |
* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>
} | ||
return text2; | ||
#endif | ||
return std::tolower(wchar_t(code), std::locale("en_US.UTF-8")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamlemec This depends on the system having the en_US.UTF-8
locale available, which is not always true in practice: nomic-ai/gpt4all#2160
Even if std::locale("")
is available and not C
, it could be e.g. tr_TR.UTF-8
, which has different rules about how to uppercase/lowercase i
and I
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's not safe to assume that lowercasing one letter at a time is the same as lowercasing the whole string. See the notes at cppreference and the special-cased entry for capital sigma in UnicodeData.
* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>
I believe this addresses most of the issues involved in #5496. Adds in a (big) table to
unicode.h
with NFD codepoint mappings for accented characters and uses this in tokenization. Also addresses some other issues regarding conversion to lowercase in greek and cyrillic as well as puncuation padding.