Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference on new corpus by trained alignments #46

Open
juncaofish opened this issue Dec 3, 2019 · 7 comments
Open

Inference on new corpus by trained alignments #46

juncaofish opened this issue Dec 3, 2019 · 7 comments

Comments

@juncaofish
Copy link

I've already trained on large corpus in parallel to get word alignments. How could I further infer with the word alignments to get the translation probability for the new corpus?

@hieuhoang
Copy link

you have to run a phrase-table extraction algorithm with the corpus and alignment as input. eg. step 4,5,6 of the moses training
http://www.statmt.org/moses/?n=FactoredTraining.HomePage

@nomadlx
Copy link

nomadlx commented Jan 6, 2020

You can have a look at this file force_align.py, i guess this code is used to be align a new corpus by using a trained conditional probability.

@stribizhev
Copy link

Has any one got a working demo script? Best for a model supporting SentencePiece tokenization.

@Brucewuzhang
Copy link

I rewrote the source code using pure python codes (I can't share it with you for some reason). I think anyone can implement fast align after reading the source code. My suggestion is that don't use statistical word alignment models for SentencePiece tokenization based algorithms. They are not compatible in my view. But statistical word alignment models can be useful depending on your purpose.

@bricksdont
Copy link

It is unclear whether the original question is about

a) word-aligning a corpus with a previously trained fast_align model (nomadlx assumed this was the case)
b) obtaining translation probabilities for phrases or sentences given word translation probabilities (Hieu assumed this was the case)

If a), then you might find useful: to train a fast_align model:

https://gist.github.com/bricksdont/7a9ac764d874b90853eff88d53971033

and to apply a trained model:

https://gist.github.com/bricksdont/0d1718c7c3fc05714b582afe4c3b5005

@bricksdont
Copy link

Edge cases that can break force_align.py:

  • The script only works with Python 2 at the moment
  • If the input corpus has lines where source or target are empty, such as
    This is a test. |||
    
    the process hangs indefinitely.

@liesun1994
Copy link

same issues as #33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants