A repo to explore article vectorisation with a scikit-learn API.
The repo was built for the Creating and assessing media article embeddings talk in front of Sofia Data Science Society.
The data used is a kaggle dataset, containing Medium articles. The dataset can be downloaded directly from Kaggle or via the kaggle API (for which you'd need Kaggle key and username).
KAGGLE_USERNAME=***************
KAGGLE_KEY=******************
- Vectorisation with the libraries' native APIs - notebooks/Article_Vectorisation_Exploration.ipynb
- Vectorisation with the scikit-learn API (using the
articlevectorizer
package) - notebooks/Article_Vectorisation_sklearn_api.ipynb - Showcase bulk project by Vincent Warmerdam by running
python -m bulk text data/titles_and_vectors_for_bulk.csv
- bulk requires a csv with the text data in a
text
column and columns x and y being the result of a dimensionality reduction technique. For this demo, UMAP was used.
- bulk requires a csv with the text data in a
- A small streamlit app to showcase the BERTopic project - run
streamlit run streamlit_demo/bertopic_demo.py
Requirements can be installed via
pip install requirements.txt