Skip to content
This repository has been archived by the owner on Jul 12, 2021. It is now read-only.

Subsampling of frequent words #8

Open
thomalm opened this issue Dec 21, 2020 · 0 comments
Open

Subsampling of frequent words #8

thomalm opened this issue Dec 21, 2020 · 0 comments

Comments

@thomalm
Copy link

thomalm commented Dec 21, 2020

I was looking through your implementation of subsampling of frequent words in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/nlpmodels/utils/elt/skipgram_dataset.py#L68 and specifically how you generate your sampling table in

def get_word_discard_probas(self):
. Looks like your implementation slightly differs from the original paper: https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L407.

Something like this worked for me if I pass a collections.Counter or dict with the item counts.

def sampling_probabilities(item_counts, sample=1e-5):
    counts = np.array(list(item_counts.values()))
    total_count = counts.sum()
    probabilities = (np.sqrt(counts / (sample * total_count)) + 1) * (sample * total_count) / counts
    # Only useful if you wish to plot the probability distribution
    #probabilities = np.minimum(probabilities, 1.0)
    return {k: probabilities[i] for i, k in enumerate(item_counts.keys())}

Using 1e-5 for sampling for one of my smaller datasets I get a around a 17% chance of keeping the most frequent item. This will of course differ a lot from dataset to dataset. There is a StackOverflow thread discussing the sampling: https://stackoverflow.com/questions/58772768/word2vec-subsampling-implementation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant