article-vectorisation-eda

A repo to explore article vectorisation with a scikit-learn API.

The repo was built for the Creating and assessing media article embeddings talk in front of Sofia Data Science Society.

Data

The data used is a kaggle dataset, containing Medium articles. The dataset can be downloaded directly from Kaggle or via the kaggle API (for which you'd need Kaggle key and username).

KAGGLE_USERNAME=***************

KAGGLE_KEY=******************

Showcases

Vectorisation with the libraries' native APIs - notebooks/Article_Vectorisation_Exploration.ipynb
Vectorisation with the scikit-learn API (using the articlevectorizer package) - notebooks/Article_Vectorisation_sklearn_api.ipynb
Showcase bulk project by Vincent Warmerdam by running python -m bulk text data/titles_and_vectors_for_bulk.csv
- bulk requires a csv with the text data in a text column and columns x and y being the result of a dimensionality reduction technique. For this demo, UMAP was used.
A small streamlit app to showcase the BERTopic project - run streamlit run streamlit_demo/bertopic_demo.py

Slides for the talk:

Slides

Requirements

Requirements can be installed via

pip install requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
notebooks		notebooks
streamlit_demo		streamlit_demo
.gitignore		.gitignore
DS in Media - Article Vectorisation.pdf		DS in Media - Article Vectorisation.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

article-vectorisation-eda

Data

Showcases

Slides for the talk:

Requirements

About

Releases

Packages

Contributors 2

Languages

License

krumeto/article-vectorisation-eda

Folders and files

Latest commit

History

Repository files navigation

article-vectorisation-eda

Data

Showcases

Slides for the talk:

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages