-
Notifications
You must be signed in to change notification settings - Fork 2
Morfessor EM+Prune
License
Waino/morfessor-emprune
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Morfessor 2.0 - Quick start =========================== Installation ------------ Morfessor 2.0 is installed using setuptools library for Python. To build and install the module and scripts to default paths, type python setup.py install For details, see http://docs.python.org/install/ Documentation ------------- User instructions for Morfessor 2.0 are available in the docs directory as Sphinx source files (see http://sphinx-doc.org/). Instructions how to build the documentation can be found in docs/README. The documentation is also available on-line at http://morfessor.readthedocs.org/ Morfessor EM+Prune ------------------ This branch includes the modifications to Morfessor that enable training using Expectation Maximization and Pruning. Morfessor EM+Prune training achieves better Morfessor cost than the earlier local search algorithm. A simple usage example :: # Create 1M substring seed lexicon direct from a pretokenized corpus freq_substr.py --lex-size 1000000 < corpus > freq_substr.1M # Perform Morfessor EM+Prune training. Autotuning with 10k lexicon size. morfessor \ --em-prune freq_substr.1M \ -t corpus \ --num-morph-types 10000 \ --save-segmentation emprune.model # Segment data using the Viterbi algorithm morfessor-segment \ testdata \ --em-prune emprune.model \ --output segmented.testdata Additional options for freq_substr.py :: --traindata-list Training data is a list of word types preceded by counts, not a corpus. --prune-redundant "-1" Setting prune-redundant to -1 disables pre-pruning of redundant substrings. Note the quotes, to prevent the dash from being interpreted as a flag. --forcesplit-before XYZ Force a splitting point before the characters X, Y and Z --forcesplit-after XYZ Force a splitting point after the characters X, Y and Z --forcesplit-both XYZ Force a splitting point both before and after the characters X, Y and Z Note that hyphens are NOT force split by default anymore, to get the same forcesplitting as Baseline, you need to specify --forcesplit-both "-" Additional options for EM+Prune training :: --traindata-list Training data is a list of word types preceded by counts, not a corpus. --prune-criterion {mdl,autotune,lexicon} mdl: (alpha-weighted) Minimum Description Length pruning. autotune: MDL with automatic tuning of alpha for lexicon size. If you want a fixed lexicon size, use this. Use --num-morph-types to specify size of lexicon. lexicon: lexicon size with omitted prior or pretuned alpha. You probably want "autotune" instead. --num-morph-types N Goal lexicon size. --prune-proportion 0.2 How large proportion of lexicon to prune in each epoch. --em-subepochs 3 How many sub-epochs of EM to perform. --expected-freq-threshold 0.5 Also prune subwords with expected count less than this. --lateen {none,full,prune} Lateen EM training mode. none: "soft" EM (default) full: Lateen-EM prune: EM+Viterbi-prune --no-bayesianify Leave out the Bayesian EM exp digamma transformation of expected counts. --no-lexicon-cost Omit prior entirely. --freq-distr-cost {baseline,omit} Frequency distribution prior to use. baseline: Approximate Morfessor Baseline prior (default). omit: set frequency distribution cost to zero. --save-pseudomodel use the trained EM+Prune model to segment the training data, and save the resulting segmentation as if it was a Morfessor Baseline model. Additional options for segmentation :: --sample-nbest Sample alternative segmentations from n-best list. Approximates --sample, but is much faster. --sample Sample from full distribution. You probably want --sample-nbest instead. --sampling-temperature 0.5 (Inverted) temperature parameter for sampling. (1.0 = unsmoothed) A note on pretokenization and boundary markers :: Morfessor EM+Prune is typically used with *word* boundary markers (marks where the whitespace should go), rather than the *morph* boundary markers (marks word-internal boundaries) used by previous Morfessors. Make sure that the word boundary markers are present in the corpus / word count lists used for Morfessor EM+Prune training, and also in the input to Morfessor EM+Prune during segmentation. Some ways to achieve this is to use the pyonmttok library with spacer_annotate=True and joiner_annotate=False, or the dynamicdata dataloader with pretokenize=True. This will insert '▁' (unicode lower one eight block \u2581) as word boundary markers. Also remember to adjust your detokenization post-processing script appropriately. Contact ------- Questions or feedback? Email: [email protected] Citing ------ If you use the Morfessor EM+Prune training algorithm, please cite @inproceedings{gronroos2020morfessor, title={Morfessor {EM+Prune}: Improved Subword Segmentation with Expectation Maximization and Pruning}, author = {Gr{\"o}nroos, Stig-Arne and Sami Virpioja and Mikko Kurimo}, year = {2020}, month = {may}, address = {Marseilles, France}, booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference}, publisher = {ELRA}, } ArXiv preprint available online at https://arxiv.org/abs/2003.03131 For the original Morfessor 2.0: Python implementation, please cite @techreport{virpioja2013morfessor, address = {Helsinki, Finland}, type = {Report}, title = {Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline}, language = {eng}, number = {25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY}, institution = {Department of Signal Processing and Acoustics, Aalto University}, author = {Virpioja, Sami and Smit, Peter and Grönroos, Stig-Arne and Kurimo, Mikko}, year = {2013}, pages = {38} } The report is available online at http://urn.fi/URN:ISBN:978-952-60-5501-5
About
Morfessor EM+Prune
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published