You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 12, 2021. It is now read-only.
Something like this worked for me if I pass a collections.Counter or dict with the item counts.
def sampling_probabilities(item_counts, sample=1e-5):
counts = np.array(list(item_counts.values()))
total_count = counts.sum()
probabilities = (np.sqrt(counts / (sample * total_count)) + 1) * (sample * total_count) / counts
# Only useful if you wish to plot the probability distribution
#probabilities = np.minimum(probabilities, 1.0)
return {k: probabilities[i] for i, k in enumerate(item_counts.keys())}
I was looking through your implementation of subsampling of frequent words in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/nlpmodels/utils/elt/skipgram_dataset.py#L68 and specifically how you generate your sampling table in
deeplearning-nlp-models/nlpmodels/utils/vocabulary.py
Line 159 in d3afac4
Something like this worked for me if I pass a collections.Counter or dict with the item counts.
Using 1e-5 for sampling for one of my smaller datasets I get a around a 17% chance of keeping the most frequent item. This will of course differ a lot from dataset to dataset. There is a StackOverflow thread discussing the sampling: https://stackoverflow.com/questions/58772768/word2vec-subsampling-implementation.
The text was updated successfully, but these errors were encountered: