Hierarchical clustering

A very popular clustering paradigm, that covers several different techniques.

# A hierarchy of clusters

Given a data set, a hierarchy is a set of subsets of data (clusters) such that :

* The complete data set is a cluster.

* Every single observation is a cluster (singleton).

* Given two clusters in the hierarchy, either they have no observation is common, or else one is included in the other (no overlapping).

For practical reasons, it is also common to add that any cluster (except the singletons) is partitioned into exactly two clusters of the hierarchy.

-----

Such a structure can be graphically represented by a dendogram (or tree).

* The cluster that contains the complete data set is called the "Root" of the tree.

* Every cluster in the tree is called a "Node".

* Every cluster (except the singletons) usually has two "Children".

* A line connecting a cluster to one of his children is a "Branch".

* At the botton of the Tree, singletons (single observation clusters) are called "Leaves".

Note that paths connecting a cluster to the subsequent leaves may have different number of branches.

# Hierarchy and Typology

What the analyst is interested in is not the hierarchy, but a typology, that is a partition of the data set in clusters that are :

* Compact,

* Well separated,

* And that can be interpreted.

A hierarchy provides a convenient way to construct many typologies : any upper section of the tree defines a series of non-overlapping terminal clusters that partition the data set (red clusters in the lower image of the above illustration).

-----

Using hierarchies for building typologies is facing two problems :

## 1) Building the hierarchy

Given a cluster, what is the most appropriate way to split this cluster into two children ? Alternatively, one may ask : "How should we select two clusters for merging into a single parent cluster ?". These two questions give rise to, respectively, Descending and Ascending Hierarchies.

2) Selecting a typology in the hierarchy

Given a hierarchy, which upper section of the tree should we retain as the final typology ?

# Ascending Hierarchical Clustering

Ascending (or "Agglomerative") Hierarchical Clustering operates by successively merging pairs of already existing clusters. The next pair of clusters to be merged is chosen as the pair exhibiting the smallest "distance" between the two clusters of the pair. So the whole issue is to find an appropriate definition of the "distance" between two clusters. There are many ways to do that, the most popular being the Ward distance, that we describe below.

So Ascending Hierarchical Clustering first considers every observation as 1-observation clusters (singletons), and the most common way to define the distance between two observations is simply the ordinary Euclidian distance. The first step is therefore to merge the two closest observations into as 2-observation cluster.

From there on, the Ward distance takes over (it coincides with the Euclidian distance for singletons). At each step, the two clusters whose Ward distance is smallest are merged. The process stops when the remaining two clusters are finally merged into the complete data set.

-----

We describe the most widely rule used for drawing the corresponding tree. This rule is such that the typologies that are most likely to be significant are those obtained by simply drawing a horizontal line across the tree, and retain the terminal clusters that are just above the line. Adjusting the height of the line is a convenient way to adjust the number of clusters (granularity) of the retained typology.

Note that this is a major difference with K-means, which demands that the number of clusters be specified in advance.

# Descending Hierarchical Clustering

Descending (or "Splitting") Hierarchical Clustering operates the other way around. It starts with the complete data set considered as a mega-cluster, and splits it into two child-clusters. If the Ward distance is chosen, the split is made so that the Ward distance between the two child-clusters is as large as possible, so as to create two well separated clusters. The process is iterated recursively until there is nothing but singletons left.

The same tree-like representation and final typology selection procedures that are used for Ascending Hierarchical Clustering may also be used for the "Descending" approach.

# Ascending or Descending ?

Ascending and Descending Hierarchical Clustering seem at first  to be like mirror images of each other. But it is not so for two reasons :

1. The Ascending procedure requires that a table of pairwise distances of clusters be updated after each merge. Choosing the next merge is done by finding the smallest distance in this table, a long but manageable task.
Things are quite different for the Descending procedure. At each step, a cluster is being scrutinized for splitting. If we adhere to the strict splitting criterion, we have to consider all possible 2-splits of the cluster and retain the best one, a procedure that becomes unmanageably long even for modest size clusters. So we have to forget about finding the best split, use shortcuts, and be satisfied with identifying just a "good" split.
2. Yet, "Descending" is quite attractive because it usually provides typologies that are easier to interpret than typologies produced by "Ascending". The reason is that :

* It's not absolutely clear that being "close" according to Ward's criterion can be interpreted in terms of cluster geometry.

* Whereas partitioning a cluster into two clusters with a large Ward distance means that these two child-clusters are compact and well                 separated.

Indeed, it can be argued that the Descending procedure is much closer to the way the human eye does clustering than the Ascending procedure.

-----

In practice, the difficulties associated with the Descending approach make the Ascending approach by far the most widely used of the two.

# Segmentation trees

It's hard to give-up the Descending approach, though, because of its usually good results and interpretability. Another way to go around the computation time problem is to keep the general Descending scheme, but to decide splits on the basis of the values of the observations on a single variable, instead of all the variables as the Ward distance demands. So, for each cluster, the variables are reviewed one by one, and for each variable, one considers splits defined by a threshold value for the variable, and identify the optimal threshold according to some criterion. One finally retains the variable and threshold that provide the best split. Many splitting criteria are available.

This technique is known as Segmentation Trees.

-----

This general scheme is very reminiscent of Decision Trees, the difference being that :

* Decision Trees are trying to minimize a miclassification rate according to the modalities of a nominal variable (supervised classification).

* Whereas Segmentation Trees do not refer to any variable in particular, and only tries to identify good clusters (unsupervised classification).

__________________________________________________

 Tutorial

In this Tutorial, we describe the best known of the Hierarchical Clustering techniques : the Ascending (or Agglomerative) Hierarchical Clustering. It is conceptually simple, and usually leads to satisfactory result.
Yet, as is often the case with heuristics, software leave many decisions to the user (distances between observations and between clusters, granularity of the typology), and it is therefore important to understand what lies behind all the options.

ASCENDING HIERARCHICAL CLUSTERING

 The data Distances between observations The Euclidian distance The Manhattan (or "City Block") distance Chebychev distance Power distance Distance between clusters The "single linkage" distance The "complete linkage" distance The "average-linkage" distance Unweighted Weighted The centroids distance Unweighted Weighted The "average group linkage" distance The Ward distance The agglomerative algorithm Presentation of the hierarchy The agglomeration schedule Dendogram Selecting a typology Granularity is adjustable Selection by dendogram Selection by distance plot TUTORIAL

___________________________________

 Clustering K-means clustering