K-means clustering
One of the most popular clustering techniques.
Given an integer K, K-means partitions the data set into K non overlapping clusters. It does so by positioning K "centroïds", or "prototypes" in densely populated regions of the data space. Each observation is then assigned to the closest centroid ("Minimum distance rule"). A cluster therefore contains all observations that are all closer to a given centroid than to any of the other centroids (lower image of the illustration).
The centroids are positioned by an iterative procedure that progressively moves them to their final, stable position.
-----
K-means is very popular because :
* It is conceptually simple.
* It is computationally fast and memory efficient.
But is also suffers from several drawbacks :
* The user has to choose the value of K, the number of clusters (whereas Hierarchical Clustering allows some flexibility in the determination of the number of clusters). Although for 2D data this choice can easily be made by visual inspection, it is not so for higher dimension data, and there are usually no clues as to what number of clusters migh be appropriate. Choosing an inappropriate number of clusters will lead to a meaningless typology.
* For a given K, clusters will usually depend heavily on the initial configuration of the set of centroids, thus making interpretation of the clusters rather uncertain.
* K-means is an objective technique, meaning that it minimizes a quantitative criterion. It is therefore an optimization technique. As is usually the case in optimization, K-means will stop when it cannot make the value of the criterion decrease anymore, but it is quite possible that there exist other configurations of the centroids that would yield even lower values of the criterion. In the vocabulary of optimization, K-means reaches a local minimum, but cannot guarantee to reach the global minimum (lowest possible value) of the criterion.
_______________________________________________________
|
Tutorial 1 |
In this Tutorial, we go over the fundamentals of K-means. We will see that despite its heuristic flavor, K-means is an objective technique, in that it minimizes a quantitative criterion.
We also address the less well known incremental K-means, that builds clusters one observation at a time.
BASICS OF K-MEANS
|
Basic K-means Data type Data standardization The K-means algorithm Classes and Voronoi tesselation The sum-of-squares criterion For a single cluster For the complete data set "Within-class" and "Between-class" dispersions Limitations of the sum-of-squares criterion Quadratic nature Dependance on the number of clusters Convergence of K-means Incremental K-means Why is incremental K-means useful ? The algorithm Order of presentation of the observations Convergence of incremental K-means |
||
|
TUTORIAL |
||
_________________________________________________
|
Tutorial 2 |
In this second Tutorial, we address the important question of the initialization of K-means (number of clusters and initial positions of the centroids). There is no theory to settle the issue, and the analyst has to resort to a series of heuristics that will most likely provide satisfactory (if not optimal) answers after some trial and errors.
Clusters are easily obtained, but because so many different solutions can be reached depending on the initialization, it is mandatory to validate the final result. Clusters should be, as much as possible, compact, well separated, and interpretable, possibly with the help of some additional variables. Many interpretation aids are available, and we here describe the most popular of these aids.
INITIALIZATION OF K-MEANS
INTERPRETATION AIDS
|
Initialization of K-means Initializing the centroids Random initialization Multiple initilizations Dead centroids Computation time Expert knowledge "Uniform" distribution of initial prototypes Maximizing initial distances MaxMin initialization "Anomalous pattern" initialisation Initialization by regular intervals Choosing the number of clusters Trying several values of K Preliminary hierarchical clustering Interpretation aids Within cluster Cluster populations Distances to centroid Scatterplots PCA on cluster Between clusters Pairwise distances F statistic Scatterplots Extra variables Nominal variables Continuous variables |
||
|
TUTORIAL |
||
____________________________________
Related readings :
|
Want to contribute to this site ? |