Skip to content

Releases: isi-nlp/ulf-tokenizer

ulf-tokenizer version 1.3.10

24 Apr 04:56
Compare
Choose a tag to compare

Changes in version 1.3.10:

  • Handles Georgian text.
  • Normalizes non-standard spaces to ASCII space.

Yet more Cyrillic improvements

06 Dec 06:48
Compare
Choose a tag to compare

Changes in version 1.3.9:

  • More improvements in handling of Cyrillic text,
    especially punctuation at start end end of words
    and splitting Cyrillic words such as сізден.Алдын (using capitalization).
  • Better handling of Numero sign, Middle dot, Bullet

More Cyrillic improvements: name initials; number.Word

01 Dec 19:08
Compare
Choose a tag to compare

More Cyrillic improvements: name initials; number.Word

  • Н.И.Вавлов → Н . И . Вавлов
  • 1.Жер → 1 . Жер
  • VI.Үйге → VI . Үйге

better Cyrillic text tokenization

30 Nov 08:11
Compare
Choose a tag to compare

New version 1.3.7:

  • Better handling of Cyrillic text, especially hyphenated tokens.
  • Better handling of some em/en-dashes, replacement character at beginning or end of token.

v1.3.6

14 Feb 00:21
Compare
Choose a tag to compare
gitignore