Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/KMeans.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
:digest: Cluster data points with K-Means
:species: data
:sc-categories: FluidManipulation
:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor
:see-also: KNNClassifier, MLPClassifier, DataSet
:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor, Classes/FluidSKMeans
:see-also: SKMeans, KNNClassifier, MLPClassifier, DataSet
:description:

Uses the K-means algorithm to learn clusters from a :fluid-obj:`DataSet`.
Expand Down
108 changes: 108 additions & 0 deletions doc/SKMeans.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
:digest: K-Means with Spherical Distances
:species: data
:sc-categories: FluidManipulation
:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor, Classes/FluidKMeans
:see-also: KMeans, KNNClassifier, MLPClassifier, DataSet
:description:

Uses K-means algorithm with cosine similarity to learn clusters and features from a :fluid-obj:`DataSet`.

:discussion:

:fluid-obj:`SKMeans` is an implementation of :fluid-obj:`KMeans` based of cosine distances instead of euclidian ones, measuring the angles between the normalised vectors. It is generally used to learn of features from a :fluid-obj:`DataSet`. See this reference for a more technical explanation: https://machinelearningcatalogue.com/algorithm/alg_spherical-k-means.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based of

--> based on

However, just going straight in with cosine vs Euclidean distance isn't terribly informative, IMO, because it doesn't tell us what the algorithm is trying to do. When we use Euclidean distance, it tries to partition the data into k blobs of equal variance. When we use cosine distance it tries to partition the data into k segments of equal angular separation (I think).

It is generally used...

"One common application of spherical k means is to try and learn features directly from input data without supervision (see https://www-cs.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf)"


:control numClusters:

The number of clusters to classify data into.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'classify' is confusing with the supervised case, perhaps. 'partition'?


:control maxIter:

The maximum number of iterations the algorithm will use whilst fitting.

:message fit:

:arg dataSet: A :fluid-obj:`DataSet` of data points.

:arg action: A function to run when fitting is complete, taking as its argument an array with the number of data points for each cluster.

Identify ``numClusters`` clusters in a :fluid-obj:`DataSet`. It will optimise until no improvement is possible, or up to ``maxIter``, whichever comes first. Subsequent calls will continue training from the stopping point with the same conditions.

:message predict:

:arg dataSet: A :fluid-obj:`DataSet` containing the data to predict.

:arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters.

:arg action: A function to run when the server responds.

Given a trained object, return the cluster ID for each data point in a :fluid-obj:`DataSet` to a :fluid-obj:`LabelSet`.

:message fitPredict:

:arg dataSet: A :fluid-obj:`DataSet` containing the data to fit and predict.

:arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters.

:arg action: A function to run when the server responds

Run :fluid-obj:`KMeans#*fit` and :fluid-obj:`KMeans#*predict` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return the learned clustering to the passed :fluid-obj:`LabelSet`

:message predictPoint:

:arg buffer: A |buffer| containing a data point.

:arg action: A function to run when the server responds, taking the ID of the cluster as its argument.

Given a trained object, return the cluster ID for a data point in a |buffer|

:message transform:

:arg srcDataSet: A :fluid-obj:`DataSet` containing the data to transform.

:arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-distance space.

:arg action: A function to run when the server responds.

Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its distance to each cluster as an array, often reffered to as the cluster-distance space.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this object, in distinction to plain k-Means, the output vector is further 'encoded' in the way described in the Coates and Ng paper above. This depends on the parameter alpha, which seems to be missing from this doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't know whether 'cluster-distance space' is adding much value


:message fitTransform:

:arg srcDataSet: A :fluid-obj:`DataSet` containing the data to fit and transform.

:arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-distance space.

:arg action: A function to run when the server responds

Run :fluid-obj:`KMeans#*fit` and :fluid-obj:`KMeans#*transform` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return its cluster-distance space in the destination :fluid-obj:`DataSet`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KMeans -> SKMeans


:message transformPoint:

:arg sourceBuffer: A |buffer| containing a data point.

:arg targetBuffer: A |buffer| to write in the distance to all the cluster centroids.

:arg action: A function to run when complete.

Given a trained object, return the distance of the provided point to each cluster centroid. Both points are handled as |buffer|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we're encoding here, not just returning straight up distances


:message getMeans:

:arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column.

:arg action: A function to run when complete.

Given a trained object, retrieve the means (centroids) of each cluster as a :fluid-obj:`DataSet`

:message setMeans:

:arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column.

:arg action: A function to run when complete.

Overwrites the means (centroids) of each cluster, and declare the object trained.

:message clear:

:arg action: A function to run when complete.

Reset the object status to not fitted and untrained.
140 changes: 140 additions & 0 deletions example-code/sc/SKMeans.scd
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
code::

(
//Make some clumped 2D points and place into a DataSet
~points = (4.collect{
64.collect{(1.sum3rand) + [1,-1].choose}.clump(2)
}).flatten(1) * 0.5;
fork{
~dataSet = FluidDataSet(s);
d = Dictionary.with(
*[\cols -> 2,\data -> Dictionary.newFrom(
~points.collect{|x, i| [i, x]}.flatten)]);
s.sync;
~dataSet.load(d, {~dataSet.print});
}
)


// Create an SKMeans instance and a LabelSet for the cluster labels in the server
~clusters = FluidLabelSet(s);
~skmeans = FluidSKMeans(s);

// Fit into 4 clusters
(
~skmeans.fitPredict(~dataSet,~clusters,action: {|c|
"Fitted.\n # Points in each cluster:".postln;
c.do{|x,i|
("Cluster" + i + "->" + x.asInteger + "points").postln;
}
});
)

// Cols of SKMeans should match DataSet, size is the number of clusters

~skmeans.cols;
~skmeans.size;
~skmeans.dump;

// Retrieve labels of clustered points by sorting the IDs
~clusters.dump{|x|~assignments = x.at("data").atAll(x.at("data").keys.asArray.sort{|a,b|a.asInteger < b.asInteger}).flatten.postln;}

//Visualise: we're hoping to see colours neatly mapped to quandrants...
(
d = ((~points + 1) * 0.5).flatten(1).unlace;
w = Window("scatter", Rect(128, 64, 200, 200));
~colours = [Color.blue,Color.red,Color.green,Color.magenta];
w.drawFunc = {
Pen.use {
d[0].size.do{|i|
var x = (d[0][i]*200);
var y = (d[1][i]*200);
var r = Rect(x,y,5,5);
Pen.fillColor = ~colours[~assignments[i].asInteger];
Pen.fillOval(r);
}
}
};
w.refresh;
w.front;
)

// single point transform on arbitrary value
~inbuf = Buffer.loadCollection(s,0.5.dup);
~skmeans.predictPoint(~inbuf,{|x|x.postln;});
::

subsection:: Accessing the means

We can get and set the means for each cluster, their centroid.

code::
// with the dataset and skmeans generated and trained in the code above
~centroids = FluidDataSet(s);
~skmeans.getMeans(~centroids, {~centroids.print});

// We can also set them to arbitrary values to seed the process
~centroids.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\0, [0.5,0.5], \1, [-0.5,0.5], \2, [0.5,-0.5], \3, [-0.5,-0.5]])]));
~centroids.print
~skmeans.setMeans(~centroids, {~skmeans.predict(~dataSet,~clusters,{~clusters.dump{|x|var count = 0.dup(4); x["data"].keysValuesDo{|k,v|count[v[0].asInteger] = count[v[0].asInteger] + 1;};count.postln}})});

// We can further fit from the seeded means
~skmeans.fit(~dataSet)
// then retreive the improved means
~skmeans.getMeans(~centroids, {~centroids.print});
//subtle in this case but still.. each quadrant is where we seeded it.
::

subsection:: Cluster-distance Space

We can get the spherical distance of a given point to each cluster. SKMeans differ from KMeans as it takes the angular distance (cosine) of the vector. This is often referred to as the cluster-distance space as it creates new dimensions for each given point, one distance per cluster.

code::
// with the dataset and skmeans generated and trained in the code above
b = Buffer.sendCollection(s,[0.5,0.5])
c = Buffer(s)

// get the distance of our given point (b) to each cluster, thus giving us 4 dimensions in our cluster-distance space
~skmeans.transformPoint(b,c,{|x|x.query;x.getn(0,x.numFrames,{|y|y.postln})})

// we can also transform a full dataset
~srcDS = FluidDataSet(s)
~cdspace = FluidDataSet(s)
// make a new dataset with 4 points
~srcDS.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\pp, [0.5,0.5], \np, [-0.5,0.5], \pn, [0.5,-0.5], \nn, [-0.5,-0.5]])]));
~skmeans.transform(~srcDS, ~cdspace, {~cdspace.print})
::

subsection:: Queries in a Synth

This is the equivalent of predictPoint, but wholly on the server

code::
(
{
var trig = Impulse.kr(5);
var point = WhiteNoise.kr(1.dup);
var inputPoint = LocalBuf(2);
var outputPoint = LocalBuf(1);
Poll.kr(trig, point, [\pointX,\pointY]);
point.collect{ |p,i| BufWr.kr([p],inputPoint,i)};
~skmeans.kr(trig,inputPoint,outputPoint);
Poll.kr(trig,BufRd.kr(1,outputPoint,0,interpolation:0),\cluster);
}.play;
)

// to sonify the output, here are random values alternating quadrant, generated more quickly as the cursor moves rightwards
(
{
var trig = Impulse.kr(MouseX.kr(0,1).exprange(0.5,ControlRate.ir / 2));
var step = Stepper.kr(trig,max:3);
var point = TRand.kr(-0.1, [0.1, 0.1], trig) + [step.mod(2).linlin(0,1,-0.6,0.6),step.div(2).linlin(0,1,-0.6,0.6)] ;
var inputPoint = LocalBuf(2);
var outputPoint = LocalBuf(1);
point.collect{|p,i| BufWr.kr([p],inputPoint,i)};
~skmeans.kr(trig,inputPoint,outputPoint);
SinOsc.ar((BufRd.kr(1,outputPoint,0,interpolation:0) + 69).midicps,mul: 0.1);
}.play;
)

::