K-means clustering

One of the most popular clustering techniques.

Given an integer K, K-means partitions the data set into K non overlapping clusters. It does so by positioning K "centroïds", or "prototypes" in densely populated regions of the data space. Each observation is then assigned to the closest centroid ("Minimum distance rule"). A cluster therefore contains all observations that are all closer to a given centroid than to any of the other centroids (lower image of the illustration).

 

 

The centroids are positioned by an iterative procedure that progressively moves them to their final, stable position.

-----

K-means is very popular because :

    * It is conceptually simple.

    * It is computationally fast and memory efficient.

But is also suffers from several drawbacks :

    * The user has to choose the value of K, the number of clusters (whereas Hierarchical Clustering allows some flexibility in the determination of the number of clusters). Although  for 2D data this choice can easily be made by visual inspection, it is not so for higher dimension data, and there are usually no clues as to what number of clusters migh be appropriate. Choosing an inappropriate number of clusters will lead to a meaningless typology.

    * For a given K, clusters will usually depend heavily on the initial configuration of the set of centroids, thus making interpretation of the clusters rather uncertain.

    * K-means is an objective technique, meaning that it minimizes a quantitative criterion. It is therefore an optimization technique. As is usually the case in optimization, K-means will stop when it cannot make the value of the criterion decrease anymore, but it is quite possible that there exist other configurations of the centroids that would yield even lower values of the criterion. In the vocabulary of optimization, K-means reaches a local minimum, but cannot guarantee to reach the global minimum (lowest possible value) of the criterion.

 _______________________________________________________

 

 

Tutorial 1

 

In this Tutorial, we go over the fundamentals of K-means. We will see that despite its heuristic flavor, K-means is an objective technique, in that it minimizes a quantitative criterion.

We also address the less well known incremental K-means, that builds clusters one observation at a time.

 

 

BASICS OF K-MEANS

Basic K-means

Data type

Data standardization

The K-means algorithm

Classes and Voronoi tesselation

The sum-of-squares criterion

For a single cluster

For the complete data set

"Within-class" and "Between-class" dispersions

Limitations of the sum-of-squares criterion

Quadratic nature

Dependance on the number of clusters

Convergence of K-means

Incremental K-means

Why is incremental K-means useful ?

The algorithm

Order of presentation of the observations

Convergence of incremental K-means

TUTORIAL

 _________________________________________________

 

Tutorial 2

 

In this second Tutorial, we address the important question of the initialization of K-means (number of clusters and initial positions of the centroids). There is no theory to settle the issue, and the analyst has to resort to a series of heuristics that will most likely provide satisfactory (if not optimal) answers after some trial and errors.

Clusters are easily obtained, but because so many different solutions can be reached depending on the initialization, it is mandatory to validate the final result. Clusters should be, as much as possible, compact, well separated, and interpretable, possibly with the help of some additional variables. Many interpretation aids are available, and we here describe the most popular of these aids.

 

 

INITIALIZATION OF K-MEANS

INTERPRETATION AIDS

 Initialization of K-means

Initializing the centroids

Random initialization

Multiple initilizations

Dead centroids

Computation time

Expert knowledge

"Uniform" distribution of initial prototypes

Maximizing initial distances

MaxMin initialization

"Anomalous pattern" initialisation

Initialization by regular intervals

Choosing the number of clusters

Trying several values of K

Preliminary hierarchical clustering

Interpretation aids

Within cluster

Cluster populations

Distances to centroid

Scatterplots

PCA on cluster

Between clusters

Pairwise distances

F statistic

Scatterplots

Extra variables

Nominal variables

Continuous variables

TUTORIAL

 

____________________________________

 

Related readings :

Clustering

Hierarchical clustering

Download this Glossary

 

Want to contribute to this site ?