Subsampling of frequent words #8

thomalm · 2020-12-21T23:22:55Z

I was looking through your implementation of subsampling of frequent words in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/nlpmodels/utils/elt/skipgram_dataset.py#L68 and specifically how you generate your sampling table in

deeplearning-nlp-models/nlpmodels/utils/vocabulary.py

Line 159 in d3afac4

def get_word_discard_probas(self):

. Looks like your implementation slightly differs from the original paper: https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L407.

Something like this worked for me if I pass a collections.Counter or dict with the item counts.

def sampling_probabilities(item_counts, sample=1e-5):
    counts = np.array(list(item_counts.values()))
    total_count = counts.sum()
    probabilities = (np.sqrt(counts / (sample * total_count)) + 1) * (sample * total_count) / counts
    # Only useful if you wish to plot the probability distribution
    #probabilities = np.minimum(probabilities, 1.0)
    return {k: probabilities[i] for i, k in enumerate(item_counts.keys())}

Using 1e-5 for sampling for one of my smaller datasets I get a around a 17% chance of keeping the most frequent item. This will of course differ a lot from dataset to dataset. There is a StackOverflow thread discussing the sampling: https://stackoverflow.com/questions/58772768/word2vec-subsampling-implementation.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsampling of frequent words #8

Subsampling of frequent words #8

thomalm commented Dec 21, 2020 •

edited

Loading

Subsampling of frequent words #8

Subsampling of frequent words #8

Comments

thomalm commented Dec 21, 2020 • edited Loading

thomalm commented Dec 21, 2020 •

edited

Loading