## Clustering

- Canopy Clustering -
**single machine / MapReduce**(deprecated, will be removed once Streaming k-Means is stable enough) - k-Means Clustering -
**single machine / MapReduce** - Fuzzy k-Means -
**single machine / MapReduce** - Streaming k-Means -
**single machine / MapReduce**

**Canopy Clustering:**

Canopy Clustering is a very simple, fast and surprisingly accurate method for grouping objects into clusters. All objects are represented as a point in a multidimensional feature space. The algorithm uses a fast approximate distance metric and two distance thresholds T1 > T2 for processing. The basic algorithm is to begin with a set of points and remove one at random. Create a Canopy containing this point and iterate through the remainder of the point set. At each point, if its distance from the first point is < T1, then add the point to the cluster. If, in addition, the distance is < T2, then remove the point from the set. This way points that are very close to the original will avoid all further processing. The algorithm loops until the initial set is empty, accumulating a set of Canopies, each containing one or more points. A given point may occur in more than one Canopy.

Canopy Clustering is often used as an initial step in more rigorous clustering techniques, such as K-Means Clustering . By starting with an initial clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the initial canopies.

WARNING: Canopy is deprecated in the latest release and will be removed once streaming k-means becomes stable enough

**K-Means Clustering:**

k-Means is a simple but well-known algorithm for grouping objects, clustering. All objects need to be represented as a set of numerical features. In addition, the user has to specify the number of groups (referred to as k) she wishes to identify.

Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to the center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task.

After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations.

Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same.

**Fuzzy k-Means: **

Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.

**StreamingKMeans**

The StreamingKMeans algorithm is a variant of Algorithm 1 from Shindler et al and consists of two steps:

Streaming step

BallKMeans step.

The streaming step is a randomized algorithm that makes one pass through the data and produces as many centroids as it determines is optimal. This step can be viewed as a preparatory dimensionality reduction. If the size of the data stream is n and the expected number of clusters is k, the streaming step will produce roughly k*log(n) clusters that will be passed on to the BallKMeans step which will further reduce the number of clusters down to k.