Dimensionality reduction #2

ebaggott · 2018-06-01T16:49:32Z

This is great! How can one incorporate dimensionality reduction into the pipeline? For substantive and speed reasons, I'd like to exclude the most and least common words:

corpus = st.CorpusFromPandas(df,
category_col='country',
text_col='text',
nlp=nlp,
# can we discard 1st and 99th percentile of words here?
).build()

JasonKessler · 2018-06-01T17:29:23Z

Thanks!

Right now there's no way to exclude terms during corpus construction. However, after the corpus is constructed, you can easily remove outlying terms. For example:

# Remove bigrams from corpus.
corpus = corpus.get_unigram_corpus() 

# Create a pandas Series indexed on words containing their frequencies
term_frequencies = corpus.get_term_freq_df().sum(axis=1)

# Get the terms in the 99th and 1st percentiles
terms_99th_pctl = term_frequencies[term_frequencies >= np.percentile(term_frequencies, 99)].index
terms_1st_pctl = term_frequencies[term_frequencies <= np.percentile(term_frequencies, 1)].index

# Remove them from the corpus
reduced_corpus = corpus.remove_terms(terms_99th_pctl | terms_1st_pctl)

Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dimensionality reduction #2

Dimensionality reduction #2

ebaggott commented Jun 1, 2018

JasonKessler commented Jun 1, 2018

Dimensionality reduction #2

Dimensionality reduction #2

Comments

ebaggott commented Jun 1, 2018

JasonKessler commented Jun 1, 2018