Skip to content
This repository was archived by the owner on May 21, 2024. It is now read-only.

LeskoIam/k-means_python_module_dev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple 2D K-means Python module implemented in pure C

Current version: 1.0.1

Intro

What this is

It's an example on how to create and build extension module for Python 2.7 in pure C. It demonstrates how to parse input and transform it to C data types then reversing the process for returning back to Python. Example is not the most trivial one so I hope it explains a little bit more then just how to pass a single variable and print it out from C.

What this is NOT

This is not meant to be a tutorial and is not written as such. It also does not talk about K-means algorithm but uses it as an example.

Backstory

For fun and games I'm gathering geolocation data for taxis in one of Europe's capital cities. In a year or so I am already at 68 million rows of raw data which boils down to 1.8 million distinct events. Every one of this events has latitude and longitude information (and some other info) attached.

I wanted to do some clustering of the data to see where taxis pick up customers most of the time. Looking at different algorithms I decided for K-means. It's one of the simplest and easiest to understand.

First I tried with Python. I found a pure Python implementation of K-means algorithm (original here) on the web. But it soon became obvious that it gets really slow really fast when amount of data supplied to it increases (70 seconds for 13000 points in 50 clusters). Enter C with promises of speedups.

I found C implementations of K-means but could not make heads or tails of it. Decision was made to implement it myself. I do have some prior experience with C but haven't done anything with it for some time, so it was a great re-learning experience especially in the end when I needed to wrap everything for calling from and returning to Python.

C source code

Everything dealing with Python is in ckmeans.c. All the code is commented and explained in the file.

Code related to K-means algorithm is in lib folder. Split between k_means.c and utils.c.

Top level look at workings of the module

C module receives two lists and an integer from Python .k_means([lat1, lat2, ...], [lon1, lon2, ...], K) and it returns list of dictionaries [{"center": (lat, lon), "num_points": N}].

How to compile to .pyd file on Windows (Windows10):

  1. Clone repo

  2. Install MinGW

  3. Add MinGW to PATH

  4. Execute make.bat to compile.

    • Run make.bat for 'production' compiling
    • Use -D flag to compile with debugging enabled (make.bat -D)

    If compiling is successful you will find ckmeans.pyd file in build\ directory. pyd is like .dll file but you can import it as any other python file.

  5. import into Python and use

import ckmeans
import random
import time

N = 10000
K = 50

lat = [random.randint(1000, 2000)/100. for _ in xrange(N)]
lon = [random.randint(4000, 5000)/100. for _ in xrange(N)]

s_time = time.time()
ans = ckmeans.k_means(lat, lon, K)

print "Time to cluster {0} points in to {1} clusters was: {2} seconds".format(N, K, (time.time() - s_time))

Resources used

TODO

  • write setup.py
  • return points belonging to clusters
  • clean up k_means and utils files
  • some .h file refactoring, probably
  • decide if cuttoff distance for K-means should be supplied by user?

Technical details

  • Windows10 64 Professional
  • Python 2.7.13
  • gcc (MinGW.org GCC-6.3.0-1) 6.3.0
  • CLion 2017.2 as an editor