README

CRF chunking with word representations
--------------------------------------

    scripts and steps by Joseph Turian

A standard baseline in NLP is chunking (shallow parsing), which was the
CoNLL 2000 shared task. Training a CRF is a standard approach to this task
(Sha + Pereira 2003). In fact, many CRF implementations include
instructions for reimplementing the Sha+Pereira chunker with identical
features:
    crfsgd: http://leon.bottou.org/projects/sgd
    crf++: http://crfpp.sourceforge.net/
    CRFsuite: http://www.chokkan.org/software/crfsuite/

We use CRFsuite because it makes it simple to modify the feature
generation code, so one can easily add new features.

We have instructions and scripts for how we add word representations
(Brown clusters and/or word embeddings) to the training.

INSTALLATION:
-------------

Download and install CRFsuite: http://www.chokkan.org/software/crfsuite/

You will need my common Python library:
    https://github.com/turian/common

Go into data/ and download the CoNLL train and test files:
    cd data/
    wget http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz
    wget http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz
    gunzip *.gz
    
Download word representations:
    cd representations/

    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt

    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz
    ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz
    ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz 
    ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz
    ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz

    wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz
    ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz 


BATCH EVALUATIONS
-----------------

./scripts/train-and-evaluate.py -name baseline --dev --l2  2

WARNING: Everytime you change the --features parameter, you should also
change the --name.


NOTE:
-----

CRFsuite has benchmark results on the CoNLL shared task:
    http://www.chokkan.org/software/crfsuite/benchmark.html

However, I did not achievable achieve comparable F1 score on the CoNLL
test set until I used the following parameters:

Dev F1  Test F1  params
94.04     93.63     l2=2
94.03     93.65     l2=3.2, possible_transitions=1
94.15     93.73     l2=3.2, possible_transitions=1, possible_states=1
94.16     93.79     SGD, l2=3.2, possible_transitions=1, possible_states=1

I chose the l2 penalty on the dev set, which was a subset of the
training data.
I then used this l2 penalty and trained over the entire training set.