Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce probabilities and backoffs from python #405

Open
astariul opened this issue Dec 12, 2022 · 2 comments
Open

Reproduce probabilities and backoffs from python #405

astariul opened this issue Dec 12, 2022 · 2 comments

Comments

@astariul
Copy link

I'm trying to make a python script that computes the probabilities and backoffs similarly to kenLM.

The goal is to reproduce the same outputs, given the same corpus.

However, no matter how much I read the documentation and the paper, I can't get it to work... I would love some external help in order to get it to work and successfully reproduce the same result as kenLM.


I'm testing on a toy corpus. Here is the content of test.txt :

I went to
I go to
You went to
You go to
Joe went to
Joe go to
Anna went ski
Anna went ski

I can train a LM using kenLM with the following command : lmplz --text test.txt --arpa test.lm -o 2

Now in the test.lm file, I can access the probabilities and backoffs computed, for each 1-gram and 2-grams.


Here is my python script to compute the probabilities and backoffs :

from collections import Counter
import math

from nltk.util import ngrams as compute_ngrams


def main(order=2):
    # Count N-grams
    c = Counter()
    with open("test.txt", "r") as f:
        for line in f:
            processed_line = line.strip().split(" ")
            processed_line = ["<s>"] + processed_line + ["</s>"]

            for ngram in compute_ngrams(processed_line, order):
                c[ngram] += 1

    # Adjust counts
    last_order = c
    a = Counter(c)
    for _ in range(1, order):
        new_order = Counter()
        for ngram, count in last_order.items():
            if count > 0:
                suffix = ngram[1:]
                new_order[suffix] += 1
        a.update(new_order)
        last_order = new_order

    # Make sure <s> is in there (altough its count is 0)
    a[("<s>",)] = 0

    # Compute smoothing statistics
    t = [Counter() for i in range(order + 1)]
    for ngram, count in a.items():
        if 1 <= count <= 4:
            t[len(ngram)][count] += 1

    # Create discount function
    def discount(n, k):
        if k == 0:
            return 0
        
        if k >= 3:
            k = 3

        y = t[n][1] / (t[n][1] + 2 * t[n][2])
        return k - (k + 1) * y * t[n][k + 1] / t[n][k]

    # Just checking (https://github.com/kpu/kenlm/blob/master/lm/builder/adjust_counts.cc#L60)
    for n in range(1, order + 1):
        for i in range(0, 5):
            assert 0 <= discount(n, i) <= i, f"{discount(n, i)}"

    # Compute pseudo-probabilities and backoff
    u = {}
    b = {}
    for ngram, count in a.items():
        o = len(ngram)
        prefix = ngram[:-1]
        prefix_count = 0
        pcount = Counter()

        # One pass to compute the denominator + backoff terms
        for ngram2, count2 in a.items():
            # if (len(prefix) == 0 and len(ngram2) == 1) or ngram2[:-1] == prefix:
            if ngram2[:-1] == prefix: # and len(ngram2) == o:
                prefix_count += count2
                pcount[count2] += 1

        u[ngram] = (count - discount(o, count)) / prefix_count
        b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count

    # Add empty backoff for maximum order and for </s>
    for ngram in u:
        if len(ngram) == order:
            b[ngram] = 0
    b[("</s>",)] = 0

    # Add unknown word
    u[("<unk>",)] = 0
    b[("<unk>",)] = 0

    # Interpolate to compute the final probabilities
    p = {}

    # First, unigrams
    vocab = [ngram for ngram in u if len(ngram) == 1]
    vocab_size = len(vocab)
    for ngram in vocab:
        p[ngram] = u[ngram] + b[tuple()] / vocab_size

    # Then, compute ngrams, order by order
    for i in range(1, order):
        o = i + 1

        for ngram in u:
            # Skip ngram of wrong order (either they were already computed, or will be computed later)
            if len(ngram) != o:
                continue

            prefix = ngram[:-1]
            suffix = ngram[1:]

            p[ngram] = u[ngram] + b[prefix] * p[suffix]

    # Display
    print("Ngrams probabilities & backoff (and pseudo probabilities) :\n")
    for ngram in p:
        print(f"{ngram} -> {math.log10(p[ngram])} --- {math.log10(b[ngram]) if b[ngram] > 0 else 0}")




if __name__ == "__main__":
    main()

I followed the formulas from this paper.

But after running this script, I get different probabilities and backoffs. For example, for the 1-gram went :

  • kenLM gives p=-0.6726411 and backoff=-0.033240937
  • My script gives p=-0.6292122373715772 and backoff=-0.022929960646239318

For the 2-grams You go :

  • kenLM gives p=-0.4305645
  • My script gives p=-0.3960932540172504

What is the reason for such discrepancies ?

@MaigoAkisame
Copy link

I get different results from both you and KenLM, but I believe KenLM is making a mistake here.

What I got with KenLM on your corpus:

-0.6726411      went    -0.022929981
-0.4034029      You go

What I got with my own Python script:

-0.614294447744974      went    -0.022929960646239318
-0.39343950744536116    You go

I believe KenLM is miscalculating the discounts for 1-grams.
KenLM has: D1=0.4 D2=1.6 D3+=1.4
I have: D1=0.5555555555555556, D2=1.1666666666666665, D3+=0.7777777777777777

And this is because KenLM miscounts the number of 1-grams with adjusted counts = 1 and 2.
Using the same method as in Issue #427, I find that KenLM prints s.n[1] == 4 and s.n[2] == 3, i.e. KenLM thinks there are 4 1-grams with adjusted count = 1, and 3 1-grams with adjusted count = 2.
But actually, there are 5 1-grams with adjusted count = 1 (I, You, Joe, Anna, ski) -- these all occur after only 1 type of token;
and 2 1-grams with adjusted count = 2 (to and </s>) -- these occur after 2 types of tokens.

@MaigoAkisame
Copy link

MaigoAkisame commented Apr 14, 2023

There are two other causes for the discrepancy:

  1. KenLM does not include <s> when calculating the vocabulary size, while your program does.
    I think KenLM's approach makes more sense -- <s> is never to be predicted, so we don't need to assign any probability to it.

  2. When calculating the backoff, you're only summing the probability mass discounted from ngrams with adjusted count <= 3:

        b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count

While this is a faithful implementation of the second equation in Sec 3.3 of this paper:
$\displaystyle b(w_1^{n-1}) = \frac{\sum_{i=1}^{3} D_n(i) | {x: a(w_1^{n-1}x) = i} |}{\sum_x a(w_1^{n-1} x)}$ (<-- lol GitHub's rendering of the summation symbol)
I think the equation in the paper is wrong.
The upper limit of the summation should be infinity. Otherwise, the probability mass discounted with adjusted count > 3 will go nowhere, and the total unnormalized probabilities + the backoff will be less than 1.
I think KenLM's implementation treats the upper limit as infinity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants