This package tokenizes (splits) words, sentences and graphemes, based on Unicode text segmentation (UAX #29), for Unicode version 15.0.0.
Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. The Unicode standard is better: it is multi-lingual, and handles punctuation, special characters, etc.
dotnet add package UAX29
using UAX29;
using System.Text;
var example = "Hello, 🌏 world. 你好,世界.";
// The tokenizer can split words, graphemes or sentences.
// It operates on strings, UTF-8 bytes, and streams.
var words = Split.Words(example);
// Iterate over the tokens
foreach (var word in words)
{
// word is ReadOnlySpan<char>
// If you need it back as a string:
Console.WriteLine(word.ToString());
}
/*
Hello
,
🌏
world
.
你
好
,
世
界
.
*/
var utf8bytes = Encoding.UTF8.GetBytes(example);
var graphemes = Split.Graphemes(utf8bytes);
// Iterate over the tokens
foreach (var grapheme in graphemes)
{
// grapheme is a ReadOnlySpan<byte> of UTF-8 bytes
// If you need it back as a string:
var s = Encoding.UTF8.GetString(grapheme);
Console.WriteLine(s);
}
/*
H
e
l
l
o
,
🌏
w
o
r
l
d
.
你
好
,
世
界
.
*/
There are also optional extension methods in the spirit of string.Split
:
using UAX29.Extensions;
example.SplitWords();
For UTF-8 bytes, pass byte[]
, Span<byte>
or Stream
; the resulting tokens will be ReadOnlySpan<byte>
.
For strings/chars, pass string
, char[]
, Span<char>
or TextReader
/StreamReader
; the resulting tokens will be ReadOnlySpan<char>
.
If you have Memory<byte|char>
, pass Memory.Span
.
We use the official Unicode test suites. Status:
This is the same spec that is implemented in Lucene's StandardTokenizer.
When tokenizing words, I get around 120MB/s on my Macbook M2. For typical text, that's around 30 million tokens/s. Benchmarks
The tokenizer is implemented as a ref struct
, so you should see zero allocations for static text such as byte[]
or string
/char
.
Calling Split.Words
returns a lazy enumerator, and will not allocate per-token. There are ToList
and ToArray
methods for convenience, which will allocate.
For Stream
or TextReader
/StreamReader
, a buffer needs to be allocated behind the scenes. You can specify the size when calling Split.Words
. You can also optionally pass your own byte[]
or char[]
to do your own allocation, perhaps with ArrayPool. Or, you can re-use the buffer by calling SetStream
on an existing tokenizer, which will avoid re-allocation.
Pass Options.OmitWhitespace
if you would like whitespace-only tokens not to be returned (for words only).
The tokenizer expects valid (decodable) UTF-8 bytes or UTF-16 chars as input. We make an effort to ensure that all bytes will be returned even if invalid, i.e. to be lossless in any case, though the resulting tokenization may not be useful. Garbage in, garbage out.
Renamed methods:
Tokenizer.GetWords(input)
→ Split.Words(input)
Renamed package, namespace and methods:
dotnet add package uax29.net
→ dotnet add package UAX29
using uax29
→ using UAX29
Tokenizer.Create(input)
→ Tokenizer.GetWords(input)
Tokenizer.Create(input, TokenType.Graphemes)
→ Tokenizer.GetGraphemes(input)
I previously implemented this for Go.
StringInfo.GetTextElementEnumerator
The .Net Core standard library has a similar enumerator for graphemes.