Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimensionality reduction #2

Open
ebaggott opened this issue Jun 1, 2018 · 1 comment
Open

Dimensionality reduction #2

ebaggott opened this issue Jun 1, 2018 · 1 comment

Comments

@ebaggott
Copy link

ebaggott commented Jun 1, 2018

This is great! How can one incorporate dimensionality reduction into the pipeline? For substantive and speed reasons, I'd like to exclude the most and least common words:

corpus = st.CorpusFromPandas(df,
category_col='country',
text_col='text',
nlp=nlp,
# can we discard 1st and 99th percentile of words here?
).build()

@JasonKessler
Copy link
Owner

Thanks!

Right now there's no way to exclude terms during corpus construction. However, after the corpus is constructed, you can easily remove outlying terms. For example:

# Remove bigrams from corpus.
corpus = corpus.get_unigram_corpus() 

# Create a pandas Series indexed on words containing their frequencies
term_frequencies = corpus.get_term_freq_df().sum(axis=1)

# Get the terms in the 99th and 1st percentiles
terms_99th_pctl = term_frequencies[term_frequencies >= np.percentile(term_frequencies, 99)].index
terms_1st_pctl = term_frequencies[term_frequencies <= np.percentile(term_frequencies, 1)].index

# Remove them from the corpus
reduced_corpus = corpus.remove_terms(terms_99th_pctl | terms_1st_pctl)

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants