sdm_kmeans

Scientific Data Management Assignment 1 KMeans Clustering

Notes for Codebase

firstly run python setup.py install Then run pip install -r requirements.txt to install dependencies Recommended style guide: https://www.python.org/dev/peps/pep-0008/

Way of approaching tasks

Everyone writes someone elses test cases (Lorenz: Good idea but project too small IMO)
Everyone hardcodes expected outputs of their domain
Everyone takes the inputs and works on developing an output testing against test caes
Split the work into update and init algorithms and we work on that..
Train the models and tweak algorithm
Test and improve
Run against test set
Upload

TEAM GOAL PRIORITIES

Implement basic testcases
implement a fully working script with defined K, random initialisation and Lloyds.
Implement Mac Queen Update & furthest point technique
Implement 1 other pre-clustered sample initialisation technique

Team TODO List

STRATEGY PATTERN TODO:

Create 2 concrete classes for the update_strategy (1 hr)
Create 3 concrete classes for init_strategy (1 hr) (Lorenz: RandomInit and FarthestPointInit already implemented, 1 more to go)
Update kmeans.py (context class) to implement this (1 hr)
decide on which 2 initialisation strategies we use (2 hrs) (Already decided: FarthestPointInit and PreClusteredSampleInit)

KMEANS TODO

Write testcases (ALL) (1hr each) Divide up testcases```` Write outputs for functions (ALL)
Import data from txt and csv filetype into a pandas dataframe (20 mins) (Lorenz: Done)
Clean data if necessary? (Unnecessary)
display a summary of data (20 mins) AND if k_clusters isn't defined then find optimal number for K (might not be necessary?) (2 hrs)
Split data into training and test set potentially creating a case for time series but can probably just leave that for now. (20 mins)
Implement each initialisation strategy (3 hrs each) (Lorenz: RandomInit and FarthestPointInit already implemented, 1 more to go. Workload estimate more like 6 hrs each IMO)
Implement each update strategy (3 hrs each)
Print our result and a visual representation of it. (30 mins)
Improve our results (~)
Submit! (30 mins)

TO FIND OUT

Will K always be given or do we have to find out. Asnwear = It s labeled data, skinf or no skin
Is speed important, should we be concerned with parallelisation?
Points in space or points of data? (I think points of data)
Does the data need to be cleaned

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.eggs		.eggs
__pycache__		__pycache__
build/lib		build/lib
config		config
data		data
dist		dist
result_plots		result_plots
sdm_assignment_1.egg-info		sdm_assignment_1.egg-info
tests		tests
.gitignore		.gitignore
Documentation.docx		Documentation.docx
Documentation.pdf		Documentation.pdf
LICENSE		LICENSE
README.md		README.md
assignment1.pdf		assignment1.pdf
experiment.py		experiment.py
init_strategies.py		init_strategies.py
kmeans.py		kmeans.py
requirements.txt		requirements.txt
setup.py		setup.py
update_strategies.py		update_strategies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sdm_kmeans

Notes for Codebase

Way of approaching tasks

TEAM GOAL PRIORITIES

Team TODO List

TO FIND OUT

About

Releases

Packages

Contributors 2

Languages

License

samhiggs/kmeans-algorithm

Folders and files

Latest commit

History

Repository files navigation

sdm_kmeans

Notes for Codebase

Way of approaching tasks

TEAM GOAL PRIORITIES

Team TODO List

TO FIND OUT

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages