Cluster analysis
Also called "clustering", "segmentation" or "unsupervised classification".
-----
The data set displayed below is clearly made of three "clusters", that is, groups of densely packed observations.
This is an important piece of information, because it suggests that the overall population is in fact a mixture of three different populations, each with distinct characteristics.
The ultimate goal of unsupervised classification ("cluster analysis", "clustering" or "segmentation") is to identify these three populations (see lower image in the above illustration), and therefore, be able to assign to each observation a "class label" that tells which class the observation belongs to. In addition, one wishes to be able to assign such a class label to any new observation that will show up at a later date.
This situation is reminiscent of that encountered in supervised classification. Then, the available observations have a class label from the start, and the goal is to assign (usually, in a probabilistic way) a class label to any new observation. But the problem is now more difficult, because the available observations are not identified as belonging to one of the assumed populations : this information will have to be infered from the spatial distribution of the observations.
The expression "unsupervised" refers to the fact that no "supervisor" will tell us which population an observation from the data belongs to.
The lack of class labels is a severe handicap that can only partially be overcome. A careful analysis of the distribution of the data is the only way we can "guess" where the classes are. The two main difficulties that cluster analysis will meet are as follows :
Facing such a loosely defined problem, it is not surprising to discover a large number of techniques, many of them with a strong heuristic flavor. It is convenient to distinguish between two large families of techniques : "Partitionning" and "Hierarchical clustering".
The general idea is to partition the space of observations into a certain number of disjoint regions, and to assign a different class label to each of these regions. All the observations inside one region will be considered as belonging to the same class.
Each class will be represented by one (or sometimes, several) prorotype, a virtual observation that is meant to be particularly representative of the population of the class. Most often, the prototype of a class will be the barycenter of the observations in the class.
These prototypes will be first iteratively driven towards high density areas, and observations will then be assigned to classes on the basis of a proximity criterion to the different prototypes.
-----
There are many partitioning techniques for cluster analysis, the most popular being K-means and its variants. Note that a very special type of Neural Networks, known as Kohonen Maps (or "SOM") may be perceived as a partitioning technique that attempts to give a 2-D representation of the classes that preserves their relative positions.
It may seem irrealistic to assume that the true, underlying classes each occupy only a bounded region of space, and don't overlap. For example, Discriminant Analysis is considering multinormal classes that necessarily overlap. It is therefore natural to consider the possibility that classes do overlap.
This approach assumes that classes are indeed multinormal. For a given assumed number K of classes, the data set is considered as a sample drawn from K multinormal distributions, each one with a a priori probability. One then seeks the means and covariance matrices of these distributions, as well as the a priori probabilities, so as to maximize the likelihood of the data. This is done by using the Expectation Maximization (EM) algorithm.
In mixture models, classes overlap but each observation is considered as belonging to one class only (although one can only determine the probabilities for the observation to belong to any of the classes). Fuzzy clustering, on the other hand, considers that an observation can belong simultaneously to several classes to various degrees. This attitude reflects the fact that in many realistic situations, classes are defined in such a way that class membership is not exclusive, but is a matter of degree (think of the classes "Hot" and "Very hot").
Fuzzy clustering then assigns to each observation a "degree of membership" to each of the classes in a way that is consistent with the distribution of the data. The degrees of membership assigned to an observation sum up to 1.
Hierarchical clustering is an important family of techniques that generate a series of nested partitions, from the trivial partition with a single class (containing the complete data set) to the trivial partition where each observation is a cluster. In-between these two extremes are many candidate partitions that the analyst will have to choose from.
Most classes met in real life are unimodal : from a dense central kernel, the density decreases continuously when moving out in any direction. Many cluster analysis techniques rely on this image and pay considerable attention to subsets of observations with short distances between them (high density regions). They do that in several ways :
* By using distances between observations for constructing classes (hierarchical clustering techniques).
* By acknowledging the fact that subpopulations with many observations but a low inertia around their barycenter correspond to high density regions (K-means).
* By doing density estimation, either parametric (mixture models) of non parametric (methods based on K-nearest neighbors density estimation).
A clustering algorithm will always generate classes. But do these classes reflect the true structure of the data ? Does each class indeed contain one well defined cluster and its immediate neighborhood, as the analyst expects ?
There are two reasons why a typology might be considered unsatisfying :
It is therefore necessary to be able to quantify the quality of a partition, that is, to be able to appreciate whether classes do correspond to clusters.
-----
There are many such criteria of quality of a partition, which means that none of them is fully satisfactory. Let's mention :
Theories behind Cluster Analysis techniques are usually simple, leading to the idea that unsupervised classification is an easy business. It is not so.
Most cluster analysis techniques rely on metric concepts (distance, density) and therefore need to be fed with quantitative variables. Yet, data is seldom described exclusively by native quantitative variables. It is therefore necessary to first code the nominal variables into quantitative variables.
This can be done :
* Either by coding the modalities of the nominal variables with dummy indicator variables.
* Or by submitting the variables to a Multiple Correspondance Analysis, and then replacing the original nominal variables by the factors derived from the analysis.
-----
Note that it is possible, though, to define a numerical "degree of dissimilarity" between observations described by nominal variables. In this approach, the numerical coding of the nominal variables is superfluous.
The above illustration is misleading, because it suggests that data always contains clearly identifiable clusters. More often than not, there is no a priori reason to believe that such is the case and yet, algorithms will always produce classes even if fed with nearly uniform, structureless data. But the outcome of a cluster analysis (that is, a "typology") will be meaningful only if the discovered classes do correspond to actual clusters in the data.
Assessing the "clumpiness" of the data is difficult. It is based on the idea that if data contains well defined clusters, then :
* Classes will be stable.
* Only a certain number of classes will provide a good typology.
More precisely :
Several typologies are built, each time leaving a randomly closen subset of the data aside. If the clusters are well defined, this small size reduction of the design set should not affect the final classes much, and the typologies will look very similar.
On the other hand, if the data is more of less uniformly scattered, the various sets classes thus obtained will be very different (think of points uniformly distributed along a straight line : removing a few points at random will each time lead to very different "classes").
If there are well defined clusters in the data, a cluster analysis conducted with the appropriate number of clusters may lead to a particularly convincing typology (see below).
On the other hand, with a more or less structureless data set, all typologies will be equally poor, irrespective of the number of classes.
Many cluster analysis techniques (and more particularly partitionning techniques) tend to yield convex classes. There is no strong reason why real underlying classes should be convex and then, whatever the efforts of the analyst, the final typology will not reflect the true data structure, even in the case of well separated clusters.
In the above illustration, the dotted line is the boundary one would like to identify between the two natural clusters. But partitioning the data set into two convex sets will yield a result like what is shown in the lower image of the illustration, which does not correspond to the true structure of the data.
-----
There exist advanced clustering techniques that allow more flexible shapes for the final classes (CURE, ROCK, BIRCH...), but they are more complex to use than the most popular techniques.
All cluster analysis techniques allow the analyst to choose the number of classes in the final typology (height of the cutoff line for hierarchical clustering, number of prototypes for K-means...).
This flexibility is a double-edged sword. If the population actually contains well defined clusters, but if the chosen number of classes is inappropriate, then the final typology will :
* Either wrongfully merge well separated clusters (too few classes, illustration below),
* Or else wrongfully split homogenous clusters into several classes (too many classes, lower image of the following illustration).
In the first case, the typology is said to be "too coarse", and to be "too fine-grained" in the second case.
-----
Determining the appropriate number of classes to be retained is a mandatory phase of cluster analysis, but it is time consuming and the results are often ambiguous.
There is no formula that calculates the number of classes from the data. Rather, this number is to be found by trial and error. A series of partitions are created with an increasing number of classes, and for each new partition, the value of a quality criterion (see above) is calculated. The retained number of classes is that of the partition with the best value of the criterion.
--
One may also resort to visual aids. Suppose that a quantity is known to always increase with the number of classes, and then level-off when there are many classes. When its value is plotted against the number of classes, the plot will sometimes show a sudden "elbow" beyond which the evolution of the quantity is negligeable. One then will retain as the appropriate number of classes the number at which this discontinuity occurs.
On example of such a visual aid is the distance plot in hierarchical clustering.
Once a partition of the data has been finalized, one should then characterize the classes in the partition. Beyond the elementary characterizations (list of observations in each class, coordinates of the barycenters, univariate analysis...), there exist methods of interpretation that yield more information about the typology, but are also more difficult to implement. Let's mention :
It is possible to perform unsupervised classification not on observations, but on variables instead. The dissimilarity between two variables (akin the the distance between two observations) is generally defined from their correlation coefficient. Strongly correlated variables are then considered to be "close" to each other.
Unsupervised classification on variables is useful when there are many variables, and the risk of collinearity is high. After classification, all the variables in one class will be replaced by a single synthetic variable that best represents the variables of the class.
____________________________________________________________
Related readings:
|
Want to contribute to this site ? |