Hello, Tokenex!

#Tokenex - An extended cross language, cross platform, flexible tokenizer. Thang Duong (issues: e-mail me [email protected])

#tl;dr; Here are some selling points:

Configurable. It's configurable to remove space, leave space, use regex, don't use regex, and, in the near future, use your own list of words and regex.
By default, it's does some simple regex. For example, numbers to , ordinals to , etc.
Works in C++, Python, C#.
Has some tools to tokenize training data.

Hello, Tokenex!

Download v0.1 package here: https://dlwteam.visualstudio.com/6894e6c5-e313-4afd-9c46-95fbfd40a7a8/_api/_versioncontrol/itemContent?repositoryId=e6217287-ad4a-452a-b169-a75286a95971&path=%2Fdist%2Ftokenex-0.1-py3-none-any.whl&version=GBmaster&contentOnly=false&__v=5

Go to where you downloaded it and run:

sudo pip3 install tokenex-0.1-py3-none-any.whl

For Python3

Using the default config

from tokenex.tokenizer as Tokenizer
t = Tokenizer()
t.tokenize('hello, world')

Keep space and don't use regex

from tokenex.tokenizer as Tokenizer
t = Tokenizer(3)
t.tokenize('hello, world')

##Everything below here requires checking out the repo

To play with this:

Shell Test App

cd tokenex ./test "hello===world! It is about to be 2018! We are one year closer to 2020. I hope the 2020s will be better than 2010s"

Python

cd tokenex python test.py "hello===world! It is about to be 2018! We are one year closer to 2020. I hope the 2020s will be better than 2010s"

C# Test App

cd tokenex tokenizercsTest.exe "hello===world! It is about to be 2018! We are one year closer to 2020. I hope the 2020s will be better than 2010s"

The output should look something like this:

hello === world ! It is about to be ! We are one year closer to . I hope the will be better than

How to use the Python API?

Add the ./tokenex directory to your PYTHONPATH environment variable.

(in Linux)

export PYTHONPATH=$PYTHONPATH:./tokenex

(in Windows)

set PYTHONPATH=%PATH%:./tokenex

See ./tokenex/test.py for an example.

To use the default tokenizer settings:

import tokenizer as t
tok = t.Tokenizer(0)                           # config option 0 is standard
tokens = tok.tokenize("hello, world!")
print(' '.join(tokens))

This version will discard space, \r, \n, \t, etc. and it will automatically translit from UTF-8 to ASCII.

To just tokenize and keep everything unchanged:

import tokenizer as t
tok = t.Tokenizer(3)                           # config option 3 keeps everything
tokens = tok.tokenize("hello, world!")
print(' '.join(tokens))

For additional details, see the class definition in tokenex/tokenizer.py.

How to use the C# API?

Reference tokenizerCS package (.\tokenex\tokenizerCS.dll)

using Tokenizex;

Then:

Tokenizer tok = new Tokenizer(0);           // change this config option to 3 to keep everything
var result = tok.TokenizeEx("hello, world!");
foreach (var token in result)
        System.Console.WriteLine("{0}", token);

How to tokenize a large txt training data file?

Use the program .\tokenex\tokenize.exe (or ./tokenex/tokenize for Linux).

In Linux (and it should be similar in Windows):

./tokenex/tokenize >

Relevant subdirectories:

libtokenizer The core tokenization code. When compiled, it generates a .lib in Windows, and a .a in Linux.
tokenex A snapshot of compiled output for both Windows and Linux.
tokenize A program to tokenize text files.
tokenizer A wrapper around libtokenizer to create dynamic library (.DLL in Winodws, .so in Linux).
tokenizercs A C# wrapper around tokenizer dynamic library.
tokenizercsTest A test C# app demonstrating how to use tokenizercs.

Building in Windows

Load tokenex.sln, and build all.

Building in Linux

(assuming you're already in the base directory)

make

Then check the ./bin directory for output files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hello, Tokenex!

For Python3

Using the default config

Keep space and don't use regex

How to use the Python API?

How to use the C# API?

How to tokenize a large txt training data file?

Relevant subdirectories:

Building in Windows

Building in Linux

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
bin		bin
dist		dist
ext		ext
lexertl_test		lexertl_test
libtokenizer		libtokenizer
nuget		nuget
test		test
tokenex		tokenex
tokenize		tokenize
tokenizer		tokenizer
tokenizercs		tokenizercs
tokenizercsTest		tokenizercsTest
tools		tools
README.md		README.md
license.txt		license.txt
linux.build.sh		linux.build.sh
makefile		makefile
setup.py		setup.py
tokenex.nuspec		tokenex.nuspec
tokenex.sln		tokenex.sln
win.build.bat		win.build.bat

License

thangduong/tokenex

Folders and files

Latest commit

History

Repository files navigation

Hello, Tokenex!

For Python3

Using the default config

Keep space and don't use regex

How to use the Python API?

How to use the C# API?

How to tokenize a large txt training data file?

Relevant subdirectories:

Building in Windows

Building in Linux

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages