Instructions

Aside from the NewsAPI API key, all relevant data is included in this repository.

Data collection

The dataset is frozen. Running make dataset with a live dataset would update the current dataset in datasets/data.json; however, the dataset was frozen to allow for consistent experiments.

Preprocessing

All preprocessors are stored in featurizers/ and preprocessed data in featurizers/featurized_data/. To run a given featurizer, run make featurize/<featurizer name>, or make featurize/all to run all preprocessors.

Models

Due to the way our infrastructure was designed, models were not independent of preprocessors. So, a given model file in models/ combines a clustering model, its parameters, and which preprocessed data it ingests. Running any models/<model name>.py trains and saves the model into models/models/. Note that the biterm and tkm models may not be runnable because they require specific installation and setup from https://github.com/markoarnauto/biterm and https://github.com/JohnTailor/tkm, respectively (see the Miscellaneous section for more info).

If using the Makefile, run make model/<model name> to train a given model or make model/all to train all 10-cluster K-means models.

Analysis

First, for quantitative analysis of the clustering models, run python analyzers/cluster_score.py <model name> (or, to run it on across K-means models using different cluster numbers, run make clusterscore/clusternums). Then, for qualitative analysis, run python analyzers/visualize.py <model name> (or, again, make visualize/all for 10-cluster k-means visualization). Output scatterplots use the first three PCA dimensions and are stored in analyzers/graphs/.

Then, to analyze the results of the keyword extractors, cluster keywords can be examined by running python analyzers/group.py <model name> <extractor name> (or for 10-cluster k-means: make group/all extractor=<extractor name>). The cluster keywords are stored in groupings/. The effectiveness of the cluster keywords can be quantitatively evaluated using python analyzers/regression.py <model name> <extractor name> (or, again, make analyze/exp10 for 10-cluster k-means).

Miscellaneous

After running make analyze/exp10, the difference values were manually copied into manual/accuracy_diffs.json and manual/graph_acc_diffs.py was run to produce the manual/accuracy_differences.png.

Both repositories mentioned earlier were git cloned into vendor/ and pip install . was ran from within the repository home directory. A Cython environment was used for installing and training both biterm and tkm models.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.ipynb_checkpoints		.ipynb_checkpoints
analyzers		analyzers
datasets		datasets
experiments		experiments
extractors		extractors
featurizers		featurizers
manual		manual
models		models
parsed_data		parsed_data
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
Keyword Extractor.ipynb		Keyword Extractor.ipynb
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
cluster_scores_clusternums.txt		cluster_scores_clusternums.txt
cluster_scores_kmeans10.txt		cluster_scores_kmeans10.txt
cluster_scores_kmeans12.txt		cluster_scores_kmeans12.txt
clusters.json		clusters.json
db_score.py		db_score.py
final.pdf		final.pdf
kmeans-experiments.txt		kmeans-experiments.txt
news_flat.json		news_flat.json
news_text.json		news_text.json
notes.md		notes.md
progress.pdf		progress.pdf
proposal.pdf		proposal.pdf
results-10		results-10
results-10-new.txt		results-10-new.txt
results-12		results-12
results-8		results-8
tf_idf.py		tf_idf.py
weighted_analysis_2.txt		weighted_analysis_2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instructions

Data collection

Preprocessing

Models

Analysis

Miscellaneous

About

Releases

Packages

Contributors 3

Languages

matthewr6/news-understanding

Folders and files

Latest commit

History

Repository files navigation

Instructions

Data collection

Preprocessing

Models

Analysis

Miscellaneous

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages