Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation" published in NNW journal.
It implements the following 4
models for CWS:
- Bi-LSTM
- Bi-LSTM-CRF
- ID-CNN
- ID-CNN-CRF
- Python >= 3.6
- TensorFlow >= 1.2
Both CPU and GPU are supported. GPU training is 10
times faster.
Run following script to convert corpus to TensorFlow dataset.
$ ./scripts/make.sh
$ ./scripts/run.sh $dataset $model
$dataset
can bepku
,msr
,asSC
orcityuSC
.$model
can becnn
orbilstm
.
For example:
$ ./scripts/run.sh pku cnn
It will train a cnn
model on pku
dataset, then evaluate performance on test set.
To enable CRF layer, simply append --viterbi
to your command, e.g.
$ ./scripts/run.sh pku cnn --viterbi
- Corpora are from SIGHAN05, converted to Simplified Chinese via HanLP. Note that the SIGHAN datasets should only be used for research purposes.
- Model implementations adopted from https://github.com/iesl/dilated-cnn-ner by Emma Strubell.