-
Notifications
You must be signed in to change notification settings - Fork 10
SKMeans #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SKMeans #132
Changes from 4 commits
b002c73
9e91f78
5f43dd9
4bcb87f
adb51e1
5cf1125
888511e
33fd05a
7f431fb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| :digest: K-Means with Spherical Distances | ||
| :species: data | ||
| :sc-categories: FluidManipulation | ||
| :sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor, Classes/FluidKMeans | ||
| :see-also: KMeans, KNNClassifier, MLPClassifier, DataSet | ||
| :description: | ||
|
|
||
| Uses K-means algorithm with cosine similarity to learn clusters and features from a :fluid-obj:`DataSet`. | ||
|
|
||
| :discussion: | ||
|
|
||
| :fluid-obj:`SKMeans` is an implementation of :fluid-obj:`KMeans` based of cosine distances instead of euclidian ones, measuring the angles between the normalised vectors. It is generally used to learn of features from a :fluid-obj:`DataSet`. See this reference for a more technical explanation: https://machinelearningcatalogue.com/algorithm/alg_spherical-k-means.html | ||
|
|
||
| :control numClusters: | ||
|
|
||
| The number of clusters to classify data into. | ||
|
||
|
|
||
| :control maxIter: | ||
|
|
||
| The maximum number of iterations the algorithm will use whilst fitting. | ||
|
|
||
| :message fit: | ||
|
|
||
| :arg dataSet: A :fluid-obj:`DataSet` of data points. | ||
|
|
||
| :arg action: A function to run when fitting is complete, taking as its argument an array with the number of data points for each cluster. | ||
|
|
||
| Identify ``numClusters`` clusters in a :fluid-obj:`DataSet`. It will optimise until no improvement is possible, or up to ``maxIter``, whichever comes first. Subsequent calls will continue training from the stopping point with the same conditions. | ||
|
|
||
| :message predict: | ||
|
|
||
| :arg dataSet: A :fluid-obj:`DataSet` containing the data to predict. | ||
|
|
||
| :arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters. | ||
|
|
||
| :arg action: A function to run when the server responds. | ||
|
|
||
| Given a trained object, return the cluster ID for each data point in a :fluid-obj:`DataSet` to a :fluid-obj:`LabelSet`. | ||
|
|
||
| :message fitPredict: | ||
|
|
||
| :arg dataSet: A :fluid-obj:`DataSet` containing the data to fit and predict. | ||
|
|
||
| :arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters. | ||
|
|
||
| :arg action: A function to run when the server responds | ||
|
|
||
| Run :fluid-obj:`KMeans#*fit` and :fluid-obj:`KMeans#*predict` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return the learned clustering to the passed :fluid-obj:`LabelSet` | ||
|
|
||
| :message predictPoint: | ||
|
|
||
| :arg buffer: A |buffer| containing a data point. | ||
|
|
||
| :arg action: A function to run when the server responds, taking the ID of the cluster as its argument. | ||
|
|
||
| Given a trained object, return the cluster ID for a data point in a |buffer| | ||
|
|
||
| :message transform: | ||
|
|
||
| :arg srcDataSet: A :fluid-obj:`DataSet` containing the data to transform. | ||
|
|
||
| :arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-distance space. | ||
|
|
||
| :arg action: A function to run when the server responds. | ||
|
|
||
| Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its distance to each cluster as an array, often reffered to as the cluster-distance space. | ||
|
||
|
|
||
| :message fitTransform: | ||
|
|
||
| :arg srcDataSet: A :fluid-obj:`DataSet` containing the data to fit and transform. | ||
|
|
||
| :arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-distance space. | ||
|
|
||
| :arg action: A function to run when the server responds | ||
|
|
||
| Run :fluid-obj:`KMeans#*fit` and :fluid-obj:`KMeans#*transform` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return its cluster-distance space in the destination :fluid-obj:`DataSet` | ||
|
||
|
|
||
| :message transformPoint: | ||
|
|
||
| :arg sourceBuffer: A |buffer| containing a data point. | ||
|
|
||
| :arg targetBuffer: A |buffer| to write in the distance to all the cluster centroids. | ||
|
|
||
| :arg action: A function to run when complete. | ||
|
|
||
| Given a trained object, return the distance of the provided point to each cluster centroid. Both points are handled as |buffer| | ||
|
||
|
|
||
| :message getMeans: | ||
|
|
||
| :arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column. | ||
|
|
||
| :arg action: A function to run when complete. | ||
|
|
||
| Given a trained object, retrieve the means (centroids) of each cluster as a :fluid-obj:`DataSet` | ||
|
|
||
| :message setMeans: | ||
|
|
||
| :arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column. | ||
|
|
||
| :arg action: A function to run when complete. | ||
|
|
||
| Overwrites the means (centroids) of each cluster, and declare the object trained. | ||
|
|
||
| :message clear: | ||
|
|
||
| :arg action: A function to run when complete. | ||
|
|
||
| Reset the object status to not fitted and untrained. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| code:: | ||
|
|
||
| ( | ||
| //Make some clumped 2D points and place into a DataSet | ||
| ~points = (4.collect{ | ||
| 64.collect{(1.sum3rand) + [1,-1].choose}.clump(2) | ||
| }).flatten(1) * 0.5; | ||
| fork{ | ||
| ~dataSet = FluidDataSet(s); | ||
| d = Dictionary.with( | ||
| *[\cols -> 2,\data -> Dictionary.newFrom( | ||
| ~points.collect{|x, i| [i, x]}.flatten)]); | ||
| s.sync; | ||
| ~dataSet.load(d, {~dataSet.print}); | ||
| } | ||
| ) | ||
|
|
||
|
|
||
| // Create an SKMeans instance and a LabelSet for the cluster labels in the server | ||
| ~clusters = FluidLabelSet(s); | ||
| ~skmeans = FluidSKMeans(s); | ||
|
|
||
| // Fit into 4 clusters | ||
| ( | ||
| ~skmeans.fitPredict(~dataSet,~clusters,action: {|c| | ||
| "Fitted.\n # Points in each cluster:".postln; | ||
| c.do{|x,i| | ||
| ("Cluster" + i + "->" + x.asInteger + "points").postln; | ||
| } | ||
| }); | ||
| ) | ||
|
|
||
| // Cols of SKMeans should match DataSet, size is the number of clusters | ||
|
|
||
| ~skmeans.cols; | ||
| ~skmeans.size; | ||
| ~skmeans.dump; | ||
|
|
||
| // Retrieve labels of clustered points by sorting the IDs | ||
| ~clusters.dump{|x|~assignments = x.at("data").atAll(x.at("data").keys.asArray.sort{|a,b|a.asInteger < b.asInteger}).flatten.postln;} | ||
|
|
||
| //Visualise: we're hoping to see colours neatly mapped to quandrants... | ||
| ( | ||
| d = ((~points + 1) * 0.5).flatten(1).unlace; | ||
| w = Window("scatter", Rect(128, 64, 200, 200)); | ||
| ~colours = [Color.blue,Color.red,Color.green,Color.magenta]; | ||
| w.drawFunc = { | ||
| Pen.use { | ||
| d[0].size.do{|i| | ||
| var x = (d[0][i]*200); | ||
| var y = (d[1][i]*200); | ||
| var r = Rect(x,y,5,5); | ||
| Pen.fillColor = ~colours[~assignments[i].asInteger]; | ||
| Pen.fillOval(r); | ||
| } | ||
| } | ||
| }; | ||
| w.refresh; | ||
| w.front; | ||
| ) | ||
|
|
||
| // single point transform on arbitrary value | ||
| ~inbuf = Buffer.loadCollection(s,0.5.dup); | ||
| ~skmeans.predictPoint(~inbuf,{|x|x.postln;}); | ||
| :: | ||
|
|
||
| subsection:: Accessing the means | ||
|
|
||
| We can get and set the means for each cluster, their centroid. | ||
|
|
||
| code:: | ||
| // with the dataset and skmeans generated and trained in the code above | ||
| ~centroids = FluidDataSet(s); | ||
| ~skmeans.getMeans(~centroids, {~centroids.print}); | ||
|
|
||
| // We can also set them to arbitrary values to seed the process | ||
| ~centroids.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\0, [0.5,0.5], \1, [-0.5,0.5], \2, [0.5,-0.5], \3, [-0.5,-0.5]])])); | ||
| ~centroids.print | ||
| ~skmeans.setMeans(~centroids, {~skmeans.predict(~dataSet,~clusters,{~clusters.dump{|x|var count = 0.dup(4); x["data"].keysValuesDo{|k,v|count[v[0].asInteger] = count[v[0].asInteger] + 1;};count.postln}})}); | ||
|
|
||
| // We can further fit from the seeded means | ||
| ~skmeans.fit(~dataSet) | ||
| // then retreive the improved means | ||
| ~skmeans.getMeans(~centroids, {~centroids.print}); | ||
| //subtle in this case but still.. each quadrant is where we seeded it. | ||
| :: | ||
|
|
||
| subsection:: Cluster-distance Space | ||
|
|
||
| We can get the spherical distance of a given point to each cluster. SKMeans differ from KMeans as it takes the angular distance (cosine) of the vector. This is often referred to as the cluster-distance space as it creates new dimensions for each given point, one distance per cluster. | ||
|
|
||
| code:: | ||
| // with the dataset and skmeans generated and trained in the code above | ||
| b = Buffer.sendCollection(s,[0.5,0.5]) | ||
| c = Buffer(s) | ||
|
|
||
| // get the distance of our given point (b) to each cluster, thus giving us 4 dimensions in our cluster-distance space | ||
| ~skmeans.transformPoint(b,c,{|x|x.query;x.getn(0,x.numFrames,{|y|y.postln})}) | ||
|
|
||
| // we can also transform a full dataset | ||
| ~srcDS = FluidDataSet(s) | ||
| ~cdspace = FluidDataSet(s) | ||
| // make a new dataset with 4 points | ||
| ~srcDS.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\pp, [0.5,0.5], \np, [-0.5,0.5], \pn, [0.5,-0.5], \nn, [-0.5,-0.5]])])); | ||
| ~skmeans.transform(~srcDS, ~cdspace, {~cdspace.print}) | ||
| :: | ||
|
|
||
| subsection:: Queries in a Synth | ||
|
|
||
| This is the equivalent of predictPoint, but wholly on the server | ||
|
|
||
| code:: | ||
| ( | ||
| { | ||
| var trig = Impulse.kr(5); | ||
| var point = WhiteNoise.kr(1.dup); | ||
| var inputPoint = LocalBuf(2); | ||
| var outputPoint = LocalBuf(1); | ||
| Poll.kr(trig, point, [\pointX,\pointY]); | ||
| point.collect{ |p,i| BufWr.kr([p],inputPoint,i)}; | ||
| ~skmeans.kr(trig,inputPoint,outputPoint); | ||
| Poll.kr(trig,BufRd.kr(1,outputPoint,0,interpolation:0),\cluster); | ||
| }.play; | ||
| ) | ||
|
|
||
| // to sonify the output, here are random values alternating quadrant, generated more quickly as the cursor moves rightwards | ||
| ( | ||
| { | ||
| var trig = Impulse.kr(MouseX.kr(0,1).exprange(0.5,ControlRate.ir / 2)); | ||
| var step = Stepper.kr(trig,max:3); | ||
| var point = TRand.kr(-0.1, [0.1, 0.1], trig) + [step.mod(2).linlin(0,1,-0.6,0.6),step.div(2).linlin(0,1,-0.6,0.6)] ; | ||
| var inputPoint = LocalBuf(2); | ||
| var outputPoint = LocalBuf(1); | ||
| point.collect{|p,i| BufWr.kr([p],inputPoint,i)}; | ||
| ~skmeans.kr(trig,inputPoint,outputPoint); | ||
| SinOsc.ar((BufRd.kr(1,outputPoint,0,interpolation:0) + 69).midicps,mul: 0.1); | ||
| }.play; | ||
| ) | ||
|
|
||
| :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> based on
However, just going straight in with cosine vs Euclidean distance isn't terribly informative, IMO, because it doesn't tell us what the algorithm is trying to do. When we use Euclidean distance, it tries to partition the data into k blobs of equal variance. When we use cosine distance it tries to partition the data into k segments of equal angular separation (I think).
"One common application of spherical k means is to try and learn features directly from input data without supervision (see https://www-cs.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf)"