Scientific Data Management Assignment 1 KMeans Clustering
firstly run python setup.py install Then run pip install -r requirements.txt to install dependencies Recommended style guide: https://www.python.org/dev/peps/pep-0008/
- Everyone writes someone elses test cases (Lorenz: Good idea but project too small IMO)
- Everyone hardcodes expected outputs of their domain
- Everyone takes the inputs and works on developing an output testing against test caes
- Split the work into update and init algorithms and we work on that..
- Train the models and tweak algorithm
- Test and improve
- Run against test set
- Upload
- Implement basic testcases
- implement a fully working script with defined K, random initialisation and Lloyds.
- Implement Mac Queen Update & furthest point technique
- Implement 1 other pre-clustered sample initialisation technique
STRATEGY PATTERN TODO:
- Create 2 concrete classes for the update_strategy (1 hr)
- Create 3 concrete classes for init_strategy (1 hr) (Lorenz: RandomInit and FarthestPointInit already implemented, 1 more to go)
- Update kmeans.py (context class) to implement this (1 hr)
- decide on which 2 initialisation strategies we use (2 hrs) (Already decided: FarthestPointInit and PreClusteredSampleInit)
KMEANS TODO
-
Write testcases (ALL) (1hr each) Divide up testcases```` Write outputs for functions (ALL)
-
Import data from txt and csv filetype into a pandas dataframe (20 mins) (Lorenz: Done)
-
Clean data if necessary? (Unnecessary)
-
display a summary of data (20 mins) AND if k_clusters isn't defined then find optimal number for K (might not be necessary?) (2 hrs)
-
Split data into training and test set potentially creating a case for time series but can probably just leave that for now. (20 mins)
-
Implement each initialisation strategy (3 hrs each) (Lorenz: RandomInit and FarthestPointInit already implemented, 1 more to go. Workload estimate more like 6 hrs each IMO)
-
Implement each update strategy (3 hrs each)
-
Print our result and a visual representation of it. (30 mins)
-
Improve our results (~)
-
Submit! (30 mins)
- Will K always be given or do we have to find out. Asnwear = It s labeled data, skinf or no skin
- Is speed important, should we be concerned with parallelisation?
- Points in space or points of data? (I think points of data)
- Does the data need to be cleaned