Ruby bindings for Cppjieba. C++11 required. (gcc 4.8+)
The TRIE tree has high memory usage. For default dict, it uses ~120 MB memory.
Add this line to your application's Gemfile:
gem 'cppjieba_rb', require: false
Or pin a version:
gem 'cppjieba_rb', '~> 0.4.4', require: false
Or install it as:
$ gem install cppjieba_rb
Segmentation mode is described in cppjieba.
Mix Segment mode (HMM with Max Prob, default):
require 'cppjieba_rb'
seg = CppjiebaRb::Segment.new # equivalent to "CppjiebaRb::Segment.new(mode: :mix)"
words = seg.segment("令狐冲是云计算行业的专家")
# ["令狐冲", "是", "云计算", "行业", "的", "专家"]
The alternative convenient method:
CppjiebaRb.segment('令狐冲是云计算行业的专家', mode: :mix)
HMM or Max probability (mp) Segment mode:
seg = CppjiebaRb::Segment.new(mode: :hmm_ # or mode: :mp
seg.segment("令狐冲是云计算行业的专家")
require 'cppjieba_rb'
CppjiebaRb.segment_tag("《忍者蝙蝠侠》续集《忍者蝙蝠侠vs极道联盟》发布角色预告片。")
# {"《"=>"x", "忍者"=>"n", "蝙蝠侠"=>"n", "》"=>"x", "续集"=>"v", "vs"=>"eng", "极道"=>"x", "联盟"=>"j", "发布"=>"v", "角色"=>"n", "预告片"=>"n", "。"=>"x"}
require 'cppjieba_rb'
CppjiebaRb.extract_keyword("山西退沙村的明代鼓楼,在今年初被拆掉了。", 5)
# [
# ["退沙村", 11.739204307083542],
# ["拆掉", 9.65218240993],
# ["鼓楼", 9.37888907493],
# ["今年初", 8.89004235788],
# ["明代", 6.52667579263]
# ]
- Fork it ( https://github.com/erickguan/cppjieba_rb/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
- including 367w dict and provide the option for it.
- cppjieba implements trie tree, it's memory consuming