diff --git a/doc/KMeans.rst b/doc/KMeans.rst index 32cef0d..eab47d5 100644 --- a/doc/KMeans.rst +++ b/doc/KMeans.rst @@ -1,8 +1,8 @@ :digest: Cluster data points with K-Means :species: data :sc-categories: FluidManipulation -:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor -:see-also: KNNClassifier, MLPClassifier, DataSet +:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor, Classes/FluidSKMeans +:see-also: SKMeans, KNNClassifier, MLPClassifier, DataSet :description: Uses the K-means algorithm to learn clusters from a :fluid-obj:`DataSet`. @@ -63,7 +63,7 @@ :arg action: A function to run when the server responds. - Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its distance to each cluster as an array, often reffered to as the cluster-distance space. + Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its distance to each cluster as an array, often referred to as the cluster-distance space. :message fitTransform: @@ -87,7 +87,7 @@ :message getMeans: - :arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column. + :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column. :arg action: A function to run when complete. @@ -95,7 +95,7 @@ :message setMeans: - :arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column. + :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column. :arg action: A function to run when complete. diff --git a/doc/SKMeans.rst b/doc/SKMeans.rst new file mode 100644 index 0000000..949764a --- /dev/null +++ b/doc/SKMeans.rst @@ -0,0 +1,113 @@ +:digest: K-Means with Spherical Distances +:species: data +:sc-categories: FluidManipulation +:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor, Classes/FluidKMeans +:see-also: KMeans, KNNClassifier, MLPClassifier, DataSet +:description: + + Uses K-means algorithm with cosine similarity to learn clusters and features from a :fluid-obj:`DataSet`. + +:discussion: + + :fluid-obj:`SKMeans` is an implementation of KMeans based on cosine distances instead of euclidian ones, measuring the angles between the normalised vectors. + One common application of spherical KMeans is to try and learn features directly from input data (via a :fluid-obj:`DataSet`) without supervision. See this reference for a more technical explanation: https://machinelearningcatalogue.com/algorithm/alg_spherical-k-means.html and https://www-cs.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf for feature extractions. + +:control numClusters: + + The number of clusters to partition data into. + +:control encodingThreshold: + + The encoding threshold (aka the alpha parameter). When used for feature learning, this can be used to produce sparser output features by setting the least active output dimensions to 0. + +:control maxIter: + + The maximum number of iterations the algorithm will use whilst fitting. + +:message fit: + + :arg dataSet: A :fluid-obj:`DataSet` of data points. + + :arg action: A function to run when fitting is complete, taking as its argument an array with the number of data points for each cluster. + + Identify ``numClusters`` clusters in a :fluid-obj:`DataSet`. It will optimise until no improvement is possible, or up to ``maxIter``, whichever comes first. Subsequent calls will continue training from the stopping point with the same conditions. + +:message predict: + + :arg dataSet: A :fluid-obj:`DataSet` containing the data to predict. + + :arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters. + + :arg action: A function to run when the server responds. + + Given a trained object, return the cluster ID for each data point in a :fluid-obj:`DataSet` to a :fluid-obj:`LabelSet`. + +:message fitPredict: + + :arg dataSet: A :fluid-obj:`DataSet` containing the data to fit and predict. + + :arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters. + + :arg action: A function to run when the server responds + + Run :fluid-obj:`KMeans#*fit` and :fluid-obj:`KMeans#*predict` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return the learned clustering to the passed :fluid-obj:`LabelSet` + +:message predictPoint: + + :arg buffer: A |buffer| containing a data point. + + :arg action: A function to run when the server responds, taking the ID of the cluster as its argument. + + Given a trained object, return the cluster ID for a data point in a |buffer| + +:message encode: + + :arg srcDataSet: A :fluid-obj:`DataSet` containing the data to encode. + + :arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-activation space. + + :arg action: A function to run when the server responds. + + Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its encoded activations to each cluster as an array, often referred to as the cluster-activation space. + +:message fitEncode: + + :arg srcDataSet: A :fluid-obj:`DataSet` containing the data to fit and encode. + + :arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-activation space. + + :arg action: A function to run when the server responds + + Run :fluid-obj:`SKMeans#*fit` and :fluid-obj:`SKMeans#*encode` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return its encoded cluster-activation space in the destination :fluid-obj:`DataSet` + +:message encodePoint: + + :arg sourceBuffer: A |buffer| containing a data point. + + :arg targetBuffer: A |buffer| to write in the activation to all the cluster centroids. + + :arg action: A function to run when complete. + + Given a trained object, return the encoded activation of the provided point to each cluster centroid. Both points are handled as |buffer| + +:message getMeans: + + :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column. + + :arg action: A function to run when complete. + + Given a trained object, retrieve the means (centroids) of each cluster as a :fluid-obj:`DataSet` + +:message setMeans: + + :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column. + + :arg action: A function to run when complete. + + Overwrites the means (centroids) of each cluster, and declare the object trained. + +:message clear: + + :arg action: A function to run when complete. + + Reset the object status to not fitted and untrained. diff --git a/example-code/sc/SKMeans.scd b/example-code/sc/SKMeans.scd new file mode 100644 index 0000000..777dd08 --- /dev/null +++ b/example-code/sc/SKMeans.scd @@ -0,0 +1,140 @@ +code:: + +( +//Make some clumped 2D points and place into a DataSet +~points = (4.collect{ + 64.collect{(1.sum3rand) + [1,-1].choose}.clump(2) + }).flatten(1) * 0.5; +fork{ + ~dataSet = FluidDataSet(s); + d = Dictionary.with( + *[\cols -> 2,\data -> Dictionary.newFrom( + ~points.collect{|x, i| [i, x]}.flatten)]); + s.sync; + ~dataSet.load(d, {~dataSet.print}); +} +) + + +// Create an SKMeans instance and a LabelSet for the cluster labels in the server +~clusters = FluidLabelSet(s); +~skmeans = FluidSKMeans(s); + +// Fit into 4 clusters +( +~skmeans.fitPredict(~dataSet,~clusters,action: {|c| + "Fitted.\n # Points in each cluster:".postln; + c.do{|x,i| + ("Cluster" + i + "->" + x.asInteger + "points").postln; + } + }); +) + +// Cols of SKMeans should match DataSet, size is the number of clusters + +~skmeans.cols; +~skmeans.size; +~skmeans.dump; + +// Retrieve labels of clustered points by sorting the IDs +~clusters.dump{|x|~assignments = x.at("data").atAll(x.at("data").keys.asArray.sort{|a,b|a.asInteger < b.asInteger}).flatten.postln;} + +//Visualise: we're hoping to see colours neatly mapped to quandrants... +( +d = ((~points + 1) * 0.5).flatten(1).unlace; +w = Window("scatter", Rect(128, 64, 200, 200)); +~colours = [Color.blue,Color.red,Color.green,Color.magenta]; +w.drawFunc = { + Pen.use { + d[0].size.do{|i| + var x = (d[0][i]*200); + var y = (d[1][i]*200); + var r = Rect(x,y,5,5); + Pen.fillColor = ~colours[~assignments[i].asInteger]; + Pen.fillOval(r); + } + } +}; +w.refresh; +w.front; +) + +// single point query on arbitrary value +~inbuf = Buffer.loadCollection(s,0.5.dup); +~skmeans.predictPoint(~inbuf,{|x|x.postln;}); +:: + +subsection:: Accessing the means + +We can get and set the means for each cluster, their centroid. + +code:: +// with the dataset and skmeans generated and trained in the code above +~centroids = FluidDataSet(s); +~skmeans.getMeans(~centroids, {~centroids.print}); + +// We can also set them to arbitrary values to seed the process +~centroids.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\0, [0.5,0.5], \1, [-0.5,0.5], \2, [0.5,-0.5], \3, [-0.5,-0.5]])])); +~centroids.print +~skmeans.setMeans(~centroids, {~skmeans.predict(~dataSet,~clusters,{~clusters.dump{|x|var count = 0.dup(4); x["data"].keysValuesDo{|k,v|count[v[0].asInteger] = count[v[0].asInteger] + 1;};count.postln}})}); + +// We can further fit from the seeded means +~skmeans.fit(~dataSet) +// then retreive the improved means +~skmeans.getMeans(~centroids, {~centroids.print}); +//subtle in this case but still.. each quadrant is where we seeded it. +:: + +subsection:: Cluster-distance Space + +We can get the spherical distance of a given point to each cluster. SKMeans differ from KMeans as it takes the angular distance (cosine) of the vector. This is often referred to as the cluster-distance space as it creates new dimensions for each given point, one distance per cluster. + +code:: +// with the dataset and skmeans generated and trained in the code above +b = Buffer.sendCollection(s,[0.5,0.5]) +c = Buffer(s) + +// get the distance of our given point (b) to each cluster, thus giving us 4 dimensions in our cluster-distance space +~skmeans.encodePoint(b,c,{|x|x.query;x.getn(0,x.numFrames,{|y|y.postln})}) + +// we can also encode a full dataset +~srcDS = FluidDataSet(s) +~cdspace = FluidDataSet(s) +// make a new dataset with 4 points +~srcDS.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\pp, [0.5,0.5], \np, [-0.5,0.5], \pn, [0.5,-0.5], \nn, [-0.5,-0.5]])])); +~skmeans.encode(~srcDS, ~cdspace, {~cdspace.print}) +:: + +subsection:: Queries in a Synth + +This is the equivalent of predictPoint, but wholly on the server + +code:: +( +{ + var trig = Impulse.kr(5); + var point = WhiteNoise.kr(1.dup); + var inputPoint = LocalBuf(2); + var outputPoint = LocalBuf(1); + Poll.kr(trig, point, [\pointX,\pointY]); + point.collect{ |p,i| BufWr.kr([p],inputPoint,i)}; + ~skmeans.kr(trig,inputPoint,outputPoint); + Poll.kr(trig,BufRd.kr(1,outputPoint,0,interpolation:0),\cluster); +}.play; +) + +// to sonify the output, here are random values alternating quadrant, generated more quickly as the cursor moves rightwards +( +{ + var trig = Impulse.kr(MouseX.kr(0,1).exprange(0.5,ControlRate.ir / 2)); + var step = Stepper.kr(trig,max:3); + var point = TRand.kr(-0.1, [0.1, 0.1], trig) + [step.mod(2).linlin(0,1,-0.6,0.6),step.div(2).linlin(0,1,-0.6,0.6)] ; + var inputPoint = LocalBuf(2); + var outputPoint = LocalBuf(1); + point.collect{|p,i| BufWr.kr([p],inputPoint,i)}; + ~skmeans.kr(trig,inputPoint,outputPoint); + SinOsc.ar((BufRd.kr(1,outputPoint,0,interpolation:0) + 69).midicps,mul: 0.1); +}.play; +) + +::