Improve BERT tokenization for accented characters and non-latin scripts #5740

iamlemec · 2024-02-26T21:04:28Z

I believe this addresses most of the issues involved in #5496. Adds in a (big) table to unicode.h with NFD codepoint mappings for accented characters and uses this in tokenization. Also addresses some other issues regarding conversion to lowercase in greek and cyrillic as well as puncuation padding.

cebtenzzre · 2024-02-26T21:59:28Z

Much better, though there are still some differences:

--- good_tokens.txt	2024-02-26 16:40:56.425215339 -0500
+++ lcpp_tokens.txt	2024-02-26 16:40:56.757218189 -0500
@@ -200554,7 +200554,6 @@
 1337: ष
 29870: ##ल
 29869: ##र
-29879: ##ो
 29863: ##न
 1317: ग
 1000: "
@@ -200633,7 +200632,11 @@
 29836: ##و
 29817: ##ت
 25573: ##ا
-100: [UNK]
+1282: س
+23673: ##ل
+29836: ##و
+15394: ##د
+29836: ##و
 23856: kota
 16183: sal
 6784: ##ud
@@ -254648,7 +254651,6 @@
 3747: influence
 1997: of
 2332: king
-1097: æ
 10760: ##the
 20850: ##lb
 19058: ##ald

षेलुरोङ् from the Wikipedia page on Manila becomes षेलुरोन ् ग ् via wiki.test.raw mojibake. Not sure that it matters.
كوتا سلودوڠ from the same page is partially ignored by the official implementation for some reason (probably a bug?)
Æthelbald from the page on Tatwine, Æ is dropped instead of being correctly converted to lowercase. Also, shouldn't we prevent the next token from merging with the previous word in cases like this? (king thelbald would be better than kingthelbald)

cebtenzzre · 2024-02-26T22:04:53Z

Also, can we format nfd_map better in the source code? At a quick glance, it doesn't even appear to be sorted.

llama.cpp

iamlemec · 2024-02-27T07:26:31Z

Sorted the NFD map and capped at 200 chars width. As for the tokenization variances:

Looks like ो is an accent character and we properly drop it.
Yeah, I think we're closer to correct here.
Right now I'm just hard coding standard Latin, Greek, and Cyrillic letters for lowercasing. There doesn't seem to be any systematic list of lowercase-able letters I can find. Open to ideas here.

Other than that, seeing some other variances, such as Đ doesn't have an NFD entry and various Devangari cases, but overall looking good.

cebtenzzre · 2024-02-27T16:42:13Z

3. There doesn't seem to be any systematic list of lowercase-able letters I can find. Open to ideas here.

The Unicode standard has a lot to say about this: https://unicode.org/faq/casemap_charprop.html#casemap

There must be an MIT-licensed case mapping implementation somewhere that we could use if we wanted to be more correct.

cebtenzzre · 2024-02-27T16:46:37Z

This is C++ source, why not use the standard library for casing operations? https://en.cppreference.com/w/cpp/locale/tolower

iamlemec · 2024-02-27T17:42:37Z

Excellent idea. Ok, so I'm going to use:

std::tolower((wchar_t)code, std::locale("en_US.UTF-8"))

Works on Unix as wchar_t is 32 bits, but on Windows wchar_t is only 16 bits, so we just don't touch anything above 0xFFFF. There are a very small number of capital letters above 0xFFFF, but they don't appear to be from any scripts currently in wide use. Also checked the BGE and Nomic codepoints and none are in that range.

cebtenzzre · 2024-02-27T18:16:33Z

Works on Unix as wchar_t is 32 bits, but on Windows wchar_t is only 16 bits, so we just don't touch anything above 0xFFFF.

C++11 has char32_t, which I think does what you need.

llama.cpp

iamlemec · 2024-02-27T18:45:48Z

For some reason if I cast it to anything other than wchar_t, it will terminate on a std::bad_cast (on Linux). I don't have access to a Windows machine, so I can't really test different options there. Though the CI says the current version at least compiles.

cebtenzzre · 2024-02-27T19:04:11Z

For some reason if I cast it to anything other than wchar_t, it will terminate on a std::bad_cast (on Linux).

Apparently even as of C++20, char32_t is only implemented for the std::codecvt facet, not the std::ctype facet. std::use_facet will throw std::bad_cast if the locale does not provide the given facet, so that's what you are seeing.

llama.cpp

Co-authored-by: Jared Van Bortel <[email protected]>

ggerganov · 2024-02-29T09:15:43Z

@iamlemec

This PR increased the compile time for llama.cpp source file from 5s to 30s on M2 Ultra:

time c++ -std=c++11 -fPIC -O3 -pthread  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -c llama.cpp -o llama.o

# before
real	0m5.506s
user	0m5.352s
sys	    0m0.101s

# after
real	0m29.161s
user	0m27.767s
sys	    0m1.029s

We should fix this ASAP. The problem stems from the large NFD map

iamlemec · 2024-02-29T14:24:00Z

Yup, seeing roughly 8s -> 20s on Linux. I suppose it's not happy about the map-vector setup. Looking into alternative data structures now.

iamlemec · 2024-02-29T17:38:20Z

Kinda tricky, but I have it working and back to normal compile times when nfd_map is

typedef std::pair<uint32_t, uint32_t> CodePos;
struct CodePosCompare {
    bool operator()(const CodePos & lhs, const CodePos & rhs) const {
        return lhs.first < rhs.first;
    }
};
static const std::multimap<CodePos, uint32_t, CodePosCompare> nfd_map = {
    {{0xC0, 0}, 0x41}, {{0xC0, 1}, 0x300}, {{0xC1, 0}, 0x41}, {{0xC1, 1}, 0x301}, ...
};

The docs say that the insertion order is respected when items are equivalent, so equal_range will give the correct results.

* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>

cebtenzzre · 2024-03-25T17:06:12Z

llama.cpp

        }
-        return text2;
+#endif
+        return std::tolower(wchar_t(code), std::locale("en_US.UTF-8"));


@iamlemec This depends on the system having the en_US.UTF-8 locale available, which is not always true in practice: nomic-ai/gpt4all#2160

Even if std::locale("") is available and not C, it could be e.g. tr_TR.UTF-8, which has different rules about how to uppercase/lowercase i and I.

Also, it's not safe to assume that lowercasing one letter at a time is the same as lowercasing the whole string. See the notes at cppreference and the special-cased entry for capital sigma in UnicodeData.

* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>

implement nfd for stripping accents in wpm tokenizer

9c996e3

cebtenzzre reviewed Feb 26, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

sort nfd map; reuse iterator

2dd36d6

ggerganov approved these changes Feb 27, 2024

View reviewed changes

ggerganov requested a review from cebtenzzre February 27, 2024 14:13

cebtenzzre approved these changes Feb 27, 2024

View reviewed changes

use builtin tolower

6b33a09

cebtenzzre reviewed Feb 27, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

add locale include

9242cf1

cebtenzzre reviewed Feb 27, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

Simplify to_lower cases

801abe5

Co-authored-by: Jared Van Bortel <[email protected]>

cebtenzzre approved these changes Feb 27, 2024

View reviewed changes

ggerganov merged commit 177628b into ggerganov:master Feb 28, 2024
59 checks passed

hiepxanh mentioned this pull request Feb 29, 2024

BERT wordpiece tokenizer differers from official HF implementation #5496

Closed

ggerganov mentioned this pull request Feb 29, 2024

Attempt to fix android build #5752

Merged

iamlemec mentioned this pull request Feb 29, 2024

Switch to multimap based nfd_map due to compile time issues #5799

Merged

phymbert mentioned this pull request Mar 6, 2024

server: tests: embeddings use a real embeddings model #5908

Merged

cebtenzzre reviewed Mar 25, 2024

View reviewed changes

cebtenzzre mentioned this pull request Mar 25, 2024

wpm : portable unicode tolower #6305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BERT tokenization for accented characters and non-latin scripts #5740

Improve BERT tokenization for accented characters and non-latin scripts #5740

iamlemec commented Feb 26, 2024

cebtenzzre commented Feb 26, 2024 •

edited

Loading

cebtenzzre commented Feb 26, 2024

iamlemec commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024

iamlemec commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024 •

edited

Loading

iamlemec commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024 •

edited

Loading

ggerganov commented Feb 29, 2024

iamlemec commented Feb 29, 2024

iamlemec commented Feb 29, 2024

cebtenzzre Mar 25, 2024 •

edited

Loading

cebtenzzre Mar 25, 2024 •

edited

Loading

Improve BERT tokenization for accented characters and non-latin scripts #5740

Improve BERT tokenization for accented characters and non-latin scripts #5740

Conversation

iamlemec commented Feb 26, 2024

cebtenzzre commented Feb 26, 2024 • edited Loading

cebtenzzre commented Feb 26, 2024

iamlemec commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024

iamlemec commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024 • edited Loading

iamlemec commented Feb 27, 2024

cebtenzzre commented Feb 27, 2024 • edited Loading

ggerganov commented Feb 29, 2024

iamlemec commented Feb 29, 2024

iamlemec commented Feb 29, 2024

cebtenzzre Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

cebtenzzre Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

cebtenzzre commented Feb 26, 2024 •

edited

Loading

cebtenzzre commented Feb 27, 2024 •

edited

Loading

cebtenzzre commented Feb 27, 2024 •

edited

Loading

cebtenzzre Mar 25, 2024 •

edited

Loading

cebtenzzre Mar 25, 2024 •

edited

Loading