GitHub - Waino/morfessor-emprune: Morfessor EM+Prune

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 480 Commits
docs		docs
morfessor		morfessor
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README		README
ez_setup.py		ez_setup.py
setup.py		setup.py
Repository files navigation

Morfessor 2.0 - Quick start
===========================


Installation
------------

Morfessor 2.0 is installed using setuptools library for Python. To
build and install the module and scripts to default paths, type

python setup.py install

For details, see http://docs.python.org/install/


Documentation
-------------

User instructions for Morfessor 2.0 are available in the docs directory
as Sphinx source files (see http://sphinx-doc.org/). Instructions how
to build the documentation can be found in docs/README.

The documentation is also available on-line at http://morfessor.readthedocs.org/

Morfessor EM+Prune
------------------

This branch includes the modifications to Morfessor that enable
training using Expectation Maximization and Pruning.

Morfessor EM+Prune training achieves better Morfessor cost than the
earlier local search algorithm.

A simple usage example ::

    # Create 1M substring seed lexicon direct from a pretokenized corpus
    freq_substr.py --lex-size 1000000 < corpus > freq_substr.1M
    
    # Perform Morfessor EM+Prune training. Autotuning with 10k lexicon size.
    morfessor \
        --em-prune freq_substr.1M \
        -t corpus \
        --num-morph-types 10000 \
        --save-segmentation emprune.model
    
    # Segment data using the Viterbi algorithm
    morfessor-segment \
        testdata \
        --em-prune emprune.model \
        --output segmented.testdata

Additional options for freq_substr.py ::

    --traindata-list
        Training data is a list of word types preceded by counts, not a corpus.

    --prune-redundant "-1"
        Setting prune-redundant to -1 disables pre-pruning of redundant substrings.
        Note the quotes, to prevent the dash from being interpreted as a flag.

    --forcesplit-before XYZ
        Force a splitting point before the characters X, Y and Z
    --forcesplit-after XYZ
        Force a splitting point after the characters X, Y and Z
    --forcesplit-both XYZ
        Force a splitting point both before and after the characters X, Y and Z
        Note that hyphens are NOT force split by default anymore,
        to get the same forcesplitting as Baseline,
        you need to specify --forcesplit-both "-"

Additional options for EM+Prune training ::

    --traindata-list
        Training data is a list of word types preceded by counts, not a corpus.

    --prune-criterion {mdl,autotune,lexicon}
        mdl: (alpha-weighted) Minimum Description Length pruning.
        autotune: MDL with automatic tuning of alpha for lexicon size.
                  If you want a fixed lexicon size, use this.
                  Use --num-morph-types to specify size of lexicon.
        lexicon: lexicon size with omitted prior or pretuned alpha.
                 You probably want "autotune" instead.

    --num-morph-types N
        Goal lexicon size.

    --prune-proportion 0.2
        How large proportion of lexicon to prune in each epoch.

    --em-subepochs 3
        How many sub-epochs of EM to perform.

    --expected-freq-threshold 0.5
        Also prune subwords with expected count less than this.

    --lateen {none,full,prune}
       Lateen EM training mode.
       none: "soft" EM (default)
       full: Lateen-EM
       prune: EM+Viterbi-prune

    --no-bayesianify
        Leave out the Bayesian EM exp digamma transformation of expected counts.

    --no-lexicon-cost
        Omit prior entirely.

    --freq-distr-cost {baseline,omit}
        Frequency distribution prior to use.
        baseline: Approximate Morfessor Baseline prior (default).
        omit: set frequency distribution cost to zero.

    --save-pseudomodel
        use the trained EM+Prune model to segment the training data,
        and save the resulting segmentation as if it was a Morfessor Baseline model.


Additional options for segmentation ::

    --sample-nbest
        Sample alternative segmentations from n-best list.
        Approximates --sample, but is much faster.

    --sample
        Sample from full distribution. You probably want --sample-nbest instead.

    --sampling-temperature 0.5
        (Inverted) temperature parameter for sampling. (1.0 = unsmoothed)

A note on pretokenization and boundary markers ::

Morfessor EM+Prune is typically used with *word* boundary markers (marks where the whitespace should go), rather than the *morph*       boundary markers (marks word-internal boundaries) used by previous Morfessors.
Make sure that the word boundary markers are present in the corpus / word count lists used for Morfessor EM+Prune training, and also in
the input to Morfessor EM+Prune during segmentation.
Some ways to achieve this is to use the pyonmttok library with spacer_annotate=True and joiner_annotate=False,
or the dynamicdata dataloader with pretokenize=True.
This will insert '▁' (unicode lower one eight block \u2581) as word boundary markers.
Also remember to adjust your detokenization post-processing script appropriately.


Contact
-------

Questions or feedback? Email: [email protected]


Citing
------

If you use the Morfessor EM+Prune training algorithm, please cite

@inproceedings{gronroos2020morfessor,
    title={Morfessor {EM+Prune}: Improved Subword Segmentation with Expectation Maximization and Pruning},
    author = {Gr{\"o}nroos, Stig-Arne and Sami Virpioja and Mikko Kurimo},
    year = {2020},
    month = {may},
    address = {Marseilles, France},
    booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
    publisher = {ELRA},
}

ArXiv preprint available online at

https://arxiv.org/abs/2003.03131


For the original Morfessor 2.0: Python implementation, please cite

@techreport{virpioja2013morfessor,
    address = {Helsinki, Finland},
    type = {Report},
    title = {Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
    language = {eng},
    number = {25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY},
    institution = {Department of Signal Processing and Acoustics, Aalto University},
    author = {Virpioja, Sami and Smit, Peter and Grönroos, Stig-Arne and Kurimo, Mikko},
    year = {2013},
    pages = {38}
}

The report is available online at 

http://urn.fi/URN:ISBN:978-952-60-5501-5
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 4

Languages

License

Waino/morfessor-emprune

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages