You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is great! How can one incorporate dimensionality reduction into the pipeline? For substantive and speed reasons, I'd like to exclude the most and least common words:
corpus = st.CorpusFromPandas(df,
category_col='country',
text_col='text',
nlp=nlp,
# can we discard 1st and 99th percentile of words here?
).build()
The text was updated successfully, but these errors were encountered:
Right now there's no way to exclude terms during corpus construction. However, after the corpus is constructed, you can easily remove outlying terms. For example:
# Remove bigrams from corpus.corpus=corpus.get_unigram_corpus()
# Create a pandas Series indexed on words containing their frequenciesterm_frequencies=corpus.get_term_freq_df().sum(axis=1)
# Get the terms in the 99th and 1st percentilesterms_99th_pctl=term_frequencies[term_frequencies>=np.percentile(term_frequencies, 99)].indexterms_1st_pctl=term_frequencies[term_frequencies<=np.percentile(term_frequencies, 1)].index# Remove them from the corpusreduced_corpus=corpus.remove_terms(terms_99th_pctl|terms_1st_pctl)
This is great! How can one incorporate dimensionality reduction into the pipeline? For substantive and speed reasons, I'd like to exclude the most and least common words:
corpus = st.CorpusFromPandas(df,
category_col='country',
text_col='text',
nlp=nlp,
# can we discard 1st and 99th percentile of words here?
).build()
The text was updated successfully, but these errors were encountered: