Current version: 1.0.1
It's an example on how to create and build extension module for Python 2.7 in pure C. It demonstrates how to parse input and transform it to C data types then reversing the process for returning back to Python. Example is not the most trivial one so I hope it explains a little bit more then just how to pass a single variable and print it out from C.
This is not meant to be a tutorial and is not written as such. It also does not talk about K-means algorithm but uses it as an example.
For fun and games I'm gathering geolocation data for taxis in one of Europe's capital cities. In a year or so I am already at 68 million rows of raw data which boils down to 1.8 million distinct events. Every one of this events has latitude and longitude information (and some other info) attached.
I wanted to do some clustering of the data to see where taxis pick up customers most of the time. Looking at different algorithms I decided for K-means. It's one of the simplest and easiest to understand.
First I tried with Python. I found a pure Python implementation of K-means algorithm (original here) on the web. But it soon became obvious that it gets really slow really fast when amount of data supplied to it increases (70 seconds for 13000 points in 50 clusters). Enter C with promises of speedups.
I found C implementations of K-means but could not make heads or tails of it. Decision was made to implement it myself. I do have some prior experience with C but haven't done anything with it for some time, so it was a great re-learning experience especially in the end when I needed to wrap everything for calling from and returning to Python.
Everything dealing with Python is in ckmeans.c. All the code is commented and explained in the file.
Code related to K-means algorithm is in lib folder. Split between k_means.c and utils.c.
C module receives two lists and an integer from Python .k_means([lat1, lat2, ...], [lon1, lon2, ...], K)
and it returns
list of dictionaries [{"center": (lat, lon), "num_points": N}]
.
-
Clone repo
-
Install MinGW
-
Add MinGW to PATH
-
Execute make.bat to compile.
- Run
make.bat
for 'production' compiling - Use
-D
flag to compile with debugging enabled (make.bat -D
)
If compiling is successful you will find ckmeans.pyd file in
build\
directory. pyd is like .dll file but you can import it as any other python file. - Run
-
import into Python and use
import ckmeans
import random
import time
N = 10000
K = 50
lat = [random.randint(1000, 2000)/100. for _ in xrange(N)]
lon = [random.randint(4000, 5000)/100. for _ in xrange(N)]
s_time = time.time()
ans = ckmeans.k_means(lat, lon, K)
print "Time to cluster {0} points in to {1} clusters was: {2} seconds".format(N, K, (time.time() - s_time))
- General C resources
- DevDocs - General C documentation
- Using malloc for allocation of multi-dimensional arrays with different row lengths - StackOverflow question with great answer
- C: Correctly freeing memory of a multi-dimensional array - StackOverflow question with great answer
- #ifdef DEBUG print statements in C
- Programming in C (fourth edition) by Stephen G. Kochan (ISBN: 978-0-321-77641-9)
- Python extensions resources
- Custom Python Part 1: Extensions - great examples on how to dynamically build Python objects
- Python C Extensions - forum question with good answer on how to dereference used Python objects
- The Py_BuildValue() Function - general documentation on
Py_BuildValue()
- Parsing arguments and building values - general documentation on
Py_BuildValue()
- Dictionary Objects - dictionary object documentation
- Using PyArg_ParseTuple with a list - example on how to pass more then a string from Python to C
- Using Py_BuildValue() to create a list of tuples in C - StackOverflow question with great answer and links
- Python Programming/Extending with C - basic example of C extension
- Writing Python/C extensions by hand - another a bit advanced example of C extension
- Beginning Python by Peter Norton, Alex Samuel, David Aitel, ... (ISBN: 0-7645-9654-3)
- write
setup.py
- return points belonging to clusters
- clean up k_means and utils files
- some .h file refactoring, probably
- decide if cuttoff distance for K-means should be supplied by user?
- Windows10 64 Professional
- Python 2.7.13
- gcc (MinGW.org GCC-6.3.0-1) 6.3.0
- CLion 2017.2 as an editor