Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BERT tokenization for accented characters and non-latin scripts #5740

Merged
merged 5 commits into from
Feb 28, 2024

Conversation

iamlemec
Copy link
Collaborator

I believe this addresses most of the issues involved in #5496. Adds in a (big) table to unicode.h with NFD codepoint mappings for accented characters and uses this in tokenization. Also addresses some other issues regarding conversion to lowercase in greek and cyrillic as well as puncuation padding.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Feb 26, 2024

Much better, though there are still some differences:

--- good_tokens.txt	2024-02-26 16:40:56.425215339 -0500
+++ lcpp_tokens.txt	2024-02-26 16:40:56.757218189 -0500
@@ -200554,7 +200554,6 @@
 1337: ष
 29870: ##ल
 29869: ##र
-29879: ##ो
 29863: ##न
 1317: ग
 1000: "
@@ -200633,7 +200632,11 @@
 29836: ##و
 29817: ##ت
 25573: ##ا
-100: [UNK]
+1282: س
+23673: ##ل
+29836: ##و
+15394: ##د
+29836: ##و
 23856: kota
 16183: sal
 6784: ##ud
@@ -254648,7 +254651,6 @@
 3747: influence
 1997: of
 2332: king
-1097: æ
 10760: ##the
 20850: ##lb
 19058: ##ald
  1. षेलुरोङ् from the Wikipedia page on Manila becomes षेलुरोन ् ग ् via wiki.test.raw mojibake. Not sure that it matters.
  2. كوتا سلودوڠ from the same page is partially ignored by the official implementation for some reason (probably a bug?)
  3. Æthelbald from the page on Tatwine, Æ is dropped instead of being correctly converted to lowercase. Also, shouldn't we prevent the next token from merging with the previous word in cases like this? (king thelbald would be better than kingthelbald)

@cebtenzzre
Copy link
Collaborator

Also, can we format nfd_map better in the source code? At a quick glance, it doesn't even appear to be sorted.

llama.cpp Outdated Show resolved Hide resolved
@iamlemec
Copy link
Collaborator Author

Sorted the NFD map and capped at 200 chars width. As for the tokenization variances:

  1. Looks like is an accent character and we properly drop it.
  2. Yeah, I think we're closer to correct here.
  3. Right now I'm just hard coding standard Latin, Greek, and Cyrillic letters for lowercasing. There doesn't seem to be any systematic list of lowercase-able letters I can find. Open to ideas here.

Other than that, seeing some other variances, such as Đ doesn't have an NFD entry and various Devangari cases, but overall looking good.

@cebtenzzre
Copy link
Collaborator

3. There doesn't seem to be any systematic list of lowercase-able letters I can find. Open to ideas here.

The Unicode standard has a lot to say about this: https://unicode.org/faq/casemap_charprop.html#casemap

There must be an MIT-licensed case mapping implementation somewhere that we could use if we wanted to be more correct.

@cebtenzzre
Copy link
Collaborator

This is C++ source, why not use the standard library for casing operations? https://en.cppreference.com/w/cpp/locale/tolower

@iamlemec
Copy link
Collaborator Author

Excellent idea. Ok, so I'm going to use:

std::tolower((wchar_t)code, std::locale("en_US.UTF-8"))

Works on Unix as wchar_t is 32 bits, but on Windows wchar_t is only 16 bits, so we just don't touch anything above 0xFFFF. There are a very small number of capital letters above 0xFFFF, but they don't appear to be from any scripts currently in wide use. Also checked the BGE and Nomic codepoints and none are in that range.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Feb 27, 2024

Works on Unix as wchar_t is 32 bits, but on Windows wchar_t is only 16 bits, so we just don't touch anything above 0xFFFF.

C++11 has char32_t, which I think does what you need.

llama.cpp Show resolved Hide resolved
@iamlemec
Copy link
Collaborator Author

For some reason if I cast it to anything other than wchar_t, it will terminate on a std::bad_cast (on Linux). I don't have access to a Windows machine, so I can't really test different options there. Though the CI says the current version at least compiles.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Feb 27, 2024

For some reason if I cast it to anything other than wchar_t, it will terminate on a std::bad_cast (on Linux).

Apparently even as of C++20, char32_t is only implemented for the std::codecvt facet, not the std::ctype facet. std::use_facet will throw std::bad_cast if the locale does not provide the given facet, so that's what you are seeing.

llama.cpp Outdated Show resolved Hide resolved
Co-authored-by: Jared Van Bortel <[email protected]>
@ggerganov ggerganov merged commit 177628b into ggerganov:master Feb 28, 2024
59 checks passed
@ggerganov
Copy link
Owner

@iamlemec

This PR increased the compile time for llama.cpp source file from 5s to 30s on M2 Ultra:

time c++ -std=c++11 -fPIC -O3 -pthread  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -c llama.cpp -o llama.o
# before
real	0m5.506s
user	0m5.352s
sys	    0m0.101s

# after
real	0m29.161s
user	0m27.767s
sys	    0m1.029s

We should fix this ASAP. The problem stems from the large NFD map

@iamlemec
Copy link
Collaborator Author

Yup, seeing roughly 8s -> 20s on Linux. I suppose it's not happy about the map-vector setup. Looking into alternative data structures now.

@iamlemec
Copy link
Collaborator Author

Kinda tricky, but I have it working and back to normal compile times when nfd_map is

typedef std::pair<uint32_t, uint32_t> CodePos;
struct CodePosCompare {
    bool operator()(const CodePos & lhs, const CodePos & rhs) const {
        return lhs.first < rhs.first;
    }
};
static const std::multimap<CodePos, uint32_t, CodePosCompare> nfd_map = {
    {{0xC0, 0}, 0x41}, {{0xC0, 1}, 0x300}, {{0xC1, 0}, 0x41}, {{0xC1, 1}, 0x301}, ...
};

The docs say that the insertion order is respected when items are equivalent, so equal_range will give the correct results.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* implement nfd for stripping accents in wpm tokenizer

* sort nfd map; reuse iterator

* use builtin tolower

* add locale include

* Simplify to_lower cases

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Jared Van Bortel <[email protected]>
}
return text2;
#endif
return std::tolower(wchar_t(code), std::locale("en_US.UTF-8"));
Copy link
Collaborator

@cebtenzzre cebtenzzre Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamlemec This depends on the system having the en_US.UTF-8 locale available, which is not always true in practice: nomic-ai/gpt4all#2160

Even if std::locale("") is available and not C, it could be e.g. tr_TR.UTF-8, which has different rules about how to uppercase/lowercase i and I.

Copy link
Collaborator

@cebtenzzre cebtenzzre Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's not safe to assume that lowercasing one letter at a time is the same as lowercasing the whole string. See the notes at cppreference and the special-cased entry for capital sigma in UnicodeData.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* implement nfd for stripping accents in wpm tokenizer

* sort nfd map; reuse iterator

* use builtin tolower

* add locale include

* Simplify to_lower cases

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Jared Van Bortel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants