A brief introduction to Text Processing and Analysis using Python.
- Use existing datasets
- direct download
- via API
- Scrape website
- Remove unneeded data
- remove HTML tags
- remove other unneeded parts
- Deal with encoding issues
- Tokenize document(s)
- document(s) --> sentences
- sentences --> words
- POS tag words (via nltk)
- convert to lower case words
- remove punctuation
- spellcheck (http://pythonhosted.org/pyenchant/)
- remove stopwords (http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk)
- lemmatize/stem words (http://www.nltk.org/api/nltk.stem.html)
3. Bag-of-words (https://en.wikipedia.org/wiki/Bag-of-words_model)
- Scikit.learn (http://scikit-learn.org/stable/modules/feature_extraction.html#limitations-of-the-bag-of-words-representation)
- Scikit.learn (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
5. Training models (http://scikit-learn.org/stable/supervised_learning.html)
- Classification
- Sentiment analysis
- Topic Extraction
- ...
- Matplotlib (http://matplotlib.org/)
- Tag cloud: self-develop or find one in github
- ...
- Regular expressions documentation in Python 3 (https://docs.python.org/3/library/re.html)
- https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/cleaningtext.html
- https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/
- http://ieva.rocks/2016/08/07/cleaning-text-for-nlp/
- https://chrisalbon.com/python/cleaning_text.html