gotokenizer

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

Support Maximum Matching Method
Support Minimum Matching Method
Support Reverse Maximum Matching
Support Reverse Minimum Matching
Support Bidirectional Maximum Matching
Support Bidirectional Minimum Matching
Support using Stop Tokens
Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
	"fmt"

	"github.com/xujiajun/gotokenizer"
)

func main() {
	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器，支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
	// NewMaxMatch default wordFilter is NumAndLetterWordFilter
	mm := gotokenizer.NewMaxMatch(dictPath)
	// load dict
	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 ， 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

	// enabled filter stop tokens 
	mm.EnabledFilterStopToken = true
	mm.StopTokens = gotokenizer.NewStopTokens()
	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

xujiajun

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data/zh		data/zh
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
bidirectional_maximum_matching.go		bidirectional_maximum_matching.go
bidirectional_maximum_matching_test.go		bidirectional_maximum_matching_test.go
bidirectional_minimum_matching.go		bidirectional_minimum_matching.go
bidirectional_minimum_matching_test.go		bidirectional_minimum_matching_test.go
bigram_dict.go		bigram_dict.go
common_test.go		common_test.go
dict.go		dict.go
maximum_matching.go		maximum_matching.go
maximum_matching_test.go		maximum_matching_test.go
minimum_matching.go		minimum_matching.go
minimum_matching_test.go		minimum_matching_test.go
num_letter_wordfilter.go		num_letter_wordfilter.go
reverse_maximum_matching.go		reverse_maximum_matching.go
reverse_maximum_matching_test.go		reverse_maximum_matching_test.go
reverse_minimum_matching.go		reverse_minimum_matching.go
reverse_minimum_matching_test.go		reverse_minimum_matching_test.go
stop_tokens.go		stop_tokens.go
stop_tokens_test.go		stop_tokens_test.go
tokenizer.go		tokenizer.go
utils.go		utils.go
utils_test.go		utils_test.go
wordfilter.go		wordfilter.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gotokenizer

Motivation

Features

Installation

Usage

Contributing

Author

License

Acknowledgements

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

License

xujiajun/gotokenizer

Folders and files

Latest commit

History

Repository files navigation

gotokenizer

Motivation

Features

Installation

Usage

Contributing

Author

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages