-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce probabilities and backoffs from python #405
Comments
I get different results from both you and KenLM, but I believe KenLM is making a mistake here. What I got with KenLM on your corpus:
What I got with my own Python script:
I believe KenLM is miscalculating the discounts for 1-grams. And this is because KenLM miscounts the number of 1-grams with adjusted counts = 1 and 2. |
There are two other causes for the discrepancy:
While this is a faithful implementation of the second equation in Sec 3.3 of this paper: |
I'm trying to make a python script that computes the probabilities and backoffs similarly to
kenLM
.The goal is to reproduce the same outputs, given the same corpus.
However, no matter how much I read the documentation and the paper, I can't get it to work... I would love some external help in order to get it to work and successfully reproduce the same result as
kenLM
.I'm testing on a toy corpus. Here is the content of
test.txt
:I can train a LM using
kenLM
with the following command :lmplz --text test.txt --arpa test.lm -o 2
Now in the
test.lm
file, I can access the probabilities and backoffs computed, for each 1-gram and 2-grams.Here is my python script to compute the probabilities and backoffs :
I followed the formulas from this paper.
But after running this script, I get different probabilities and backoffs. For example, for the 1-gram
went
:kenLM
givesp=-0.6726411
andbackoff=-0.033240937
p=-0.6292122373715772
andbackoff=-0.022929960646239318
For the 2-grams
You go
:kenLM
givesp=-0.4305645
p=-0.3960932540172504
What is the reason for such discrepancies ?
The text was updated successfully, but these errors were encountered: