flucoma · tremblap · Jun 14, 2022 · Jun 4, 2022 · Jun 10, 2022 · Jun 10, 2022
diff --git a/doc/KMeans.rst b/doc/KMeans.rst
@@ -1,8 +1,8 @@
 :digest: Cluster data points with K-Means
 :species: data
 :sc-categories: FluidManipulation
-:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor
-:see-also: KNNClassifier, MLPClassifier, DataSet
+:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor, Classes/FluidSKMeans
+:see-also: SKMeans, KNNClassifier, MLPClassifier, DataSet
 :description: 
 
    Uses the K-means algorithm to learn clusters from a :fluid-obj:`DataSet`.
@@ -63,7 +63,7 @@
 
    :arg action: A function to run when the server responds.
 
-   Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its distance to each cluster as an array, often reffered to as the cluster-distance space.
+   Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its distance to each cluster as an array, often referred to as the cluster-distance space.
 
 :message fitTransform:
 
@@ -87,15 +87,15 @@
 
 :message getMeans:
 
-   :arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column.
+   :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column.
 
    :arg action: A function to run when complete.
 
    Given a trained object, retrieve the means (centroids) of each cluster as a :fluid-obj:`DataSet`
 
 :message setMeans:
 
-   :arg dataSet: A :fluid-obj:`DataSet` of clusers with a mean per column.
+   :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column.
 
    :arg action: A function to run when complete.
 

diff --git a/doc/SKMeans.rst b/doc/SKMeans.rst
@@ -0,0 +1,113 @@
+:digest: K-Means with Spherical Distances
+:species: data
+:sc-categories: FluidManipulation
+:sc-related: Classes/FluidDataSet, Classes/FluidLabelSet, Classes/FluidKNNClassifier, Classes/FluidKNNRegressor, Classes/FluidKMeans
+:see-also: KMeans, KNNClassifier, MLPClassifier, DataSet
+:description: 
+
+   Uses K-means algorithm with cosine similarity to learn clusters and features from a :fluid-obj:`DataSet`.
+
+:discussion:
+
+   :fluid-obj:`SKMeans` is an implementation of KMeans based on cosine distances instead of euclidian ones, measuring the angles between the normalised vectors. 
+   One common application of spherical KMeans is to try and learn features directly from input data (via a :fluid-obj:`DataSet`) without supervision. See this reference for a more technical explanation: https://machinelearningcatalogue.com/algorithm/alg_spherical-k-means.html and https://www-cs.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf for feature extractions.
+
+:control numClusters:
+
+   The number of clusters to partition data into.
+
+:control encodingThreshold:
+
+   The encoding threshold (aka the alpha parameter). When used for feature learning, this can be used to produce sparser output features by setting the least active output dimensions to 0.
+
+:control maxIter:
+
+   The maximum number of iterations the algorithm will use whilst fitting.
+
+:message fit:
+
+   :arg dataSet: A :fluid-obj:`DataSet` of data points.
+
+   :arg action: A function to run when fitting is complete, taking as its argument an array with the number of data points for each cluster.
+
+   Identify ``numClusters`` clusters in a :fluid-obj:`DataSet`. It will optimise until no improvement is possible, or up to ``maxIter``, whichever comes first. Subsequent calls will continue training from the stopping point with the same conditions.
+
+:message predict:
+
+   :arg dataSet: A :fluid-obj:`DataSet` containing the data to predict.
+
+   :arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters.
+
+   :arg action: A function to run when the server responds.
+
+   Given a trained object, return the cluster ID for each data point in a :fluid-obj:`DataSet` to a :fluid-obj:`LabelSet`.
+
+:message fitPredict:
+
+   :arg dataSet: A :fluid-obj:`DataSet` containing the data to fit and predict.
+
+   :arg labelSet: A :fluid-obj:`LabelSet` to retrieve the predicted clusters.
+
+   :arg action: A function to run when the server responds
+
+   Run :fluid-obj:`KMeans#*fit` and :fluid-obj:`KMeans#*predict` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return the learned clustering to the passed :fluid-obj:`LabelSet`
+
+:message predictPoint:
+
+   :arg buffer: A |buffer| containing a data point.
+
+   :arg action: A function to run when the server responds, taking the ID of the cluster as its argument.
+
+   Given a trained object, return the cluster ID for a data point in a |buffer|
+
+:message encode:
+
+   :arg srcDataSet: A :fluid-obj:`DataSet` containing the data to encode.
+
+   :arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-activation space.
+
+   :arg action: A function to run when the server responds.
+
+   Given a trained object, return for each item of a provided :fluid-obj:`DataSet` its encoded activations to each cluster as an array, often referred to as the cluster-activation space.
+
+:message fitEncode:
+
+   :arg srcDataSet: A :fluid-obj:`DataSet` containing the data to fit and encode.
+
+   :arg dstDataSet: A :fluid-obj:`DataSet` to contain the new cluster-activation space.
+
+   :arg action: A function to run when the server responds
+
+   Run :fluid-obj:`SKMeans#*fit` and :fluid-obj:`SKMeans#*encode` in a single pass: i.e. train the model on the incoming :fluid-obj:`DataSet` and then return its encoded cluster-activation space in the destination :fluid-obj:`DataSet`
+
+:message encodePoint:
+
+   :arg sourceBuffer: A |buffer| containing a data point.
+
+   :arg targetBuffer: A |buffer| to write in the activation to all the cluster centroids.
+
+   :arg action: A function to run when complete.
+
+   Given a trained object, return the encoded activation of the provided point to each cluster centroid. Both points are handled as |buffer|
+
+:message getMeans:
+
+   :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column.
+
+   :arg action: A function to run when complete.
+
+   Given a trained object, retrieve the means (centroids) of each cluster as a :fluid-obj:`DataSet`
+
+:message setMeans:
+
+   :arg dataSet: A :fluid-obj:`DataSet` of clusters with a mean per column.
+
+   :arg action: A function to run when complete.
+
+   Overwrites the means (centroids) of each cluster, and declare the object trained.
+
+:message clear:
+
+   :arg action: A function to run when complete.
+
+   Reset the object status to not fitted and untrained.
diff --git a/example-code/sc/SKMeans.scd b/example-code/sc/SKMeans.scd
@@ -0,0 +1,140 @@
+code::
+
+(
+//Make some clumped 2D points and place into a DataSet
+~points = (4.collect{
+		       64.collect{(1.sum3rand) + [1,-1].choose}.clump(2)
+	       }).flatten(1) * 0.5;
+fork{
+    ~dataSet =  FluidDataSet(s);
+    d = Dictionary.with(
+        *[\cols -> 2,\data -> Dictionary.newFrom(
+			~points.collect{|x, i| [i, x]}.flatten)]);
+    s.sync;
+    ~dataSet.load(d, {~dataSet.print});
+}
+)
+
+
+// Create an SKMeans instance and a LabelSet for the cluster labels in the server
+~clusters = FluidLabelSet(s);
+~skmeans = FluidSKMeans(s);
+
+// Fit into 4 clusters
+(
+~skmeans.fitPredict(~dataSet,~clusters,action: {|c|
+		"Fitted.\n # Points in each cluster:".postln;
+		c.do{|x,i|
+			("Cluster" + i + "->" + x.asInteger + "points").postln;
+		}
+	});
+)
+
+// Cols of SKMeans should match DataSet, size is the number of clusters
+
+~skmeans.cols;
+~skmeans.size;
+~skmeans.dump;
+
+// Retrieve labels of clustered points by sorting the IDs
+~clusters.dump{|x|~assignments = x.at("data").atAll(x.at("data").keys.asArray.sort{|a,b|a.asInteger < b.asInteger}).flatten.postln;}
+
+//Visualise: we're hoping to see colours neatly mapped to quandrants...
+(
+d = ((~points + 1) * 0.5).flatten(1).unlace;
+w = Window("scatter", Rect(128, 64, 200, 200));
+~colours = [Color.blue,Color.red,Color.green,Color.magenta];
+w.drawFunc = {
+	Pen.use {
+		d[0].size.do{|i|
+			var x = (d[0][i]*200);
+			var y = (d[1][i]*200);
+			var r = Rect(x,y,5,5);
+			Pen.fillColor = ~colours[~assignments[i].asInteger];
+			Pen.fillOval(r);
+		}
+	}
+};
+w.refresh;
+w.front;
+)
+
+// single point query on arbitrary value
+~inbuf = Buffer.loadCollection(s,0.5.dup);
+~skmeans.predictPoint(~inbuf,{|x|x.postln;});
+::
+
+subsection:: Accessing the means
+
+We can get and set the means for each cluster, their centroid.
+
+code::
+// with the dataset and skmeans generated and trained in the code above
+~centroids = FluidDataSet(s);
+~skmeans.getMeans(~centroids, {~centroids.print});
+
+// We can also set them to arbitrary values to seed the process
+~centroids.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\0, [0.5,0.5], \1, [-0.5,0.5], \2, [0.5,-0.5], \3, [-0.5,-0.5]])]));
+~centroids.print
+~skmeans.setMeans(~centroids, {~skmeans.predict(~dataSet,~clusters,{~clusters.dump{|x|var count = 0.dup(4); x["data"].keysValuesDo{|k,v|count[v[0].asInteger] = count[v[0].asInteger] + 1;};count.postln}})});
+
+// We can further fit from the seeded means
+~skmeans.fit(~dataSet)
+// then retreive the improved means
+~skmeans.getMeans(~centroids, {~centroids.print});
+//subtle in this case but still.. each quadrant is where we seeded it.
+::
+
+subsection:: Cluster-distance Space
+
+We can get the spherical distance of a given point to each cluster. SKMeans differ from KMeans as it takes the angular distance (cosine) of the vector. This is often referred to as the cluster-distance space as it creates new dimensions for each given point, one distance per cluster.
+
+code::
+// with the dataset and skmeans generated and trained in the code above
+b = Buffer.sendCollection(s,[0.5,0.5])
+c = Buffer(s)
+
+// get the distance of our given point (b) to each cluster, thus giving us 4 dimensions in our cluster-distance space
+~skmeans.encodePoint(b,c,{|x|x.query;x.getn(0,x.numFrames,{|y|y.postln})})
+
+// we can also encode a full dataset
+~srcDS = FluidDataSet(s)
+~cdspace = FluidDataSet(s)
+// make a new dataset with 4 points
+~srcDS.load(Dictionary.newFrom([\cols, 2, \data, Dictionary.newFrom([\pp, [0.5,0.5], \np, [-0.5,0.5], \pn, [0.5,-0.5], \nn, [-0.5,-0.5]])]));
+~skmeans.encode(~srcDS, ~cdspace, {~cdspace.print})
+::
+
+subsection:: Queries in a Synth
+
+This is the equivalent of predictPoint, but wholly on the server
+
+code::
+(
+{
+    var trig = Impulse.kr(5);
+    var point = WhiteNoise.kr(1.dup);
+    var inputPoint = LocalBuf(2);
+    var outputPoint = LocalBuf(1);
+    Poll.kr(trig, point, [\pointX,\pointY]);
+    point.collect{ |p,i| BufWr.kr([p],inputPoint,i)};
+    ~skmeans.kr(trig,inputPoint,outputPoint);
+    Poll.kr(trig,BufRd.kr(1,outputPoint,0,interpolation:0),\cluster);
+}.play;
+)
+
+// to sonify the output, here are random values alternating quadrant, generated more quickly as the cursor moves rightwards
+(
+{
+	var trig = Impulse.kr(MouseX.kr(0,1).exprange(0.5,ControlRate.ir / 2));
+	var step = Stepper.kr(trig,max:3);
+	var point = TRand.kr(-0.1, [0.1, 0.1], trig) + [step.mod(2).linlin(0,1,-0.6,0.6),step.div(2).linlin(0,1,-0.6,0.6)] ;
+    var inputPoint = LocalBuf(2);
+    var outputPoint = LocalBuf(1);
+	point.collect{|p,i| BufWr.kr([p],inputPoint,i)};
+    ~skmeans.kr(trig,inputPoint,outputPoint);
+    SinOsc.ar((BufRd.kr(1,outputPoint,0,interpolation:0) + 69).midicps,mul: 0.1);
+}.play;
+)
+
+::