Releases: isi-nlp/ulf-tokenizer
Releases · isi-nlp/ulf-tokenizer
ulf-tokenizer version 1.3.10
Changes in version 1.3.10:
- Handles Georgian text.
- Normalizes non-standard spaces to ASCII space.
Yet more Cyrillic improvements
Changes in version 1.3.9:
- More improvements in handling of Cyrillic text,
especially punctuation at start end end of words
and splitting Cyrillic words such as сізден.Алдын (using capitalization). - Better handling of Numero sign, Middle dot, Bullet
More Cyrillic improvements: name initials; number.Word
More Cyrillic improvements: name initials; number.Word
- Н.И.Вавлов → Н . И . Вавлов
- 1.Жер → 1 . Жер
- VI.Үйге → VI . Үйге
better Cyrillic text tokenization
New version 1.3.7:
- Better handling of Cyrillic text, especially hyphenated tokens.
- Better handling of some em/en-dashes, replacement character at beginning or end of token.
v1.3.6
gitignore