Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 282 Bytes

README.md

File metadata and controls

24 lines (17 loc) · 282 Bytes

leomax_tokenizer

这个仓库是对 fast_tokenizer 的学习

编译环境

ubuntu

gcc-10.5

macos

clang-14.0.3

分词算法

WordPiece

  • 测试分词词典
wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt