A very popular clustering paradigm, that covers several different techniques.
Given a data set, a hierarchy is a set of subsets of data (clusters) such that :
* The complete data set is a cluster.
* Every single observation is a cluster (singleton).
* Given two clusters in the hierarchy, either they have no observation is common, or else one is included in the other (no overlapping).
For practical reasons, it is also common to add that any cluster (except the singletons) is partitioned into exactly two clusters of the hierarchy.
Such a structure can be graphically represented by a dendogram (or tree).
* The cluster that contains the complete data set is called the "Root" of the tree.
* Every cluster in the tree is called a "Node".
* Every cluster (except the singletons) usually has two "Children".
* A line connecting a cluster to one of his children is a "Branch".
* At the botton of the Tree, singletons (single observation clusters) are called "Leaves".
Note that paths connecting a cluster to the subsequent leaves may have different number of branches.
What the analyst is interested in is not the hierarchy, but a typology, that is a partition of the data set in clusters that are :
* Well separated,
* And that can be interpreted.
A hierarchy provides a convenient way to construct many typologies : any upper section of the tree defines a series of non-overlapping terminal clusters that partition the data set (red clusters in the lower image of the above illustration).
Using hierarchies for building typologies is facing two problems :
Given a cluster, what is the most appropriate way to split this cluster into two children ? Alternatively, one may ask : "How should we select two clusters for merging into a single parent cluster ?". These two questions give rise to, respectively, Descending and Ascending Hierarchies.
2) Selecting a typology in the hierarchy
Given a hierarchy, which upper section of the tree should we retain as the final typology ?
Ascending (or "Agglomerative") Hierarchical Clustering operates by successively merging pairs of already existing clusters. The next pair of clusters to be merged is chosen as the pair exhibiting the smallest "distance" between the two clusters of the pair. So the whole issue is to find an appropriate definition of the "distance" between two clusters. There are many ways to do that, the most popular being the Ward distance, that we describe below.
So Ascending Hierarchical Clustering first considers every observation as 1-observation clusters (singletons), and the most common way to define the distance between two observations is simply the ordinary Euclidian distance. The first step is therefore to merge the two closest observations into as 2-observation cluster.
From there on, the Ward distance takes over (it coincides with the Euclidian distance for singletons). At each step, the two clusters whose Ward distance is smallest are merged. The process stops when the remaining two clusters are finally merged into the complete data set.
We describe the most widely rule used for drawing the corresponding tree. This rule is such that the typologies that are most likely to be significant are those obtained by simply drawing a horizontal line across the tree, and retain the terminal clusters that are just above the line. Adjusting the height of the line is a convenient way to adjust the number of clusters (granularity) of the retained typology.
Note that this is a major difference with K-means, which demands that the number of clusters be specified in advance.
Descending (or "Splitting") Hierarchical Clustering operates the other way around. It starts with the complete data set considered as a mega-cluster, and splits it into two child-clusters. If the Ward distance is chosen, the split is made so that the Ward distance between the two child-clusters is as large as possible, so as to create two well separated clusters. The process is iterated recursively until there is nothing but singletons left.
The same tree-like representation and final typology selection procedures that are used for Ascending Hierarchical Clustering may also be used for the "Descending" approach.
Ascending and Descending Hierarchical Clustering seem at first to be like mirror images of each other. But it is not so for two reasons :
* It's not absolutely clear that being "close" according to Ward's criterion can be interpreted in terms of cluster geometry.
* Whereas partitioning a cluster into two clusters with a large Ward distance means that these two child-clusters are compact and well separated.
Indeed, it can be argued that the Descending procedure is much closer to the way the human eye does clustering than the Ascending procedure.
In practice, the difficulties associated with the Descending approach make the Ascending approach by far the most widely used of the two.
It's hard to give-up the Descending approach, though, because of its usually good results and interpretability. Another way to go around the computation time problem is to keep the general Descending scheme, but to decide splits on the basis of the values of the observations on a single variable, instead of all the variables as the Ward distance demands. So, for each cluster, the variables are reviewed one by one, and for each variable, one considers splits defined by a threshold value for the variable, and identify the optimal threshold according to some criterion. One finally retains the variable and threshold that provide the best split. Many splitting criteria are available.
This technique is known as Segmentation Trees.
This general scheme is very reminiscent of Decision Trees, the difference being that :
* Decision Trees are trying to minimize a miclassification rate according to the modalities of a nominal variable (supervised classification).
* Whereas Segmentation Trees do not refer to any variable in particular, and only tries to identify good clusters (unsupervised classification).
In this Tutorial, we describe the best known of the
Hierarchical Clustering techniques : the Ascending (or Agglomerative) Hierarchical
Clustering. It is conceptually simple, and usually leads to satisfactory result.
Yet, as is often the case with heuristics, software leave many decisions to the user (distances between observations and between clusters, granularity of the typology), and it is therefore important to understand what lies behind all the options.
ASCENDING HIERARCHICAL CLUSTERING
Distances between observations
The Euclidian distance
The Manhattan (or "City Block") distance
Distance between clusters
The "single linkage" distance
The "complete linkage" distance
The "average-linkage" distance
The centroids distance
The "average group linkage" distance
The Ward distance
The agglomerative algorithm
Presentation of the hierarchy
The agglomeration schedule
Selecting a typology
Granularity is adjustable
Selection by dendogram
Selection by distance plot
Related readings :