Discriminant Analysis

Strictly speaking, Discriminant Analysis is synonymous with supervised Classification. The term therefore refers to many different techniques such as Logistic Regression, Decision Trees or Neural networks.

The two "Discriminant Analysis"

But common use of the expression refers more specifically to the following two approaches to classification.

1) A descriptive approach (Discriminant Factor Analysis)

We suggest that you first read the entry on Fisher's linear discriminant.

The goal of Discriminant Factor Analysis is to identify straight lines in the space of attributes such that class projections on these lines are particularly well separated. These directions will be identified as particular linear combinations of the original variables, and are called or "Discriminant Axes" or "Discriminant Factors".

In the top image of this illustration, neither x1 nor x2 is individually of much use for separating class C1  from class C2 , because the projections of these two classes on the axes severely overlap.

But the oblique line whose equation is  x1 =  x2 (bottom image of the illustration) is clearly a perfectly discriminant factor because class projections on this line are completely separated.

When there are k multinormal classes in p-dimensional space (p > k), we'll show that Discriminant Factor Analysis (DFA) identifies the optimal set of (k - 1) orthogonal projection axes which best separate the classes : these axes are the (k - 1) discriminant factors. Adding more factors does not help : in fact, we'll show that DFA can be interpreted as a Principal Component Analysis (PCA) of the set of the k class barycenters ponderated by the corresponding class populations. If you're familiar with PCA, you'll realize that the projections of the class barycenters on the (k - 1)-dimensional subspace spanned by the discriminant factors are "as separated" as possible in the PCA sense, and that the projections of these barycenters on any straight line orthogonal to this projection subspace are superimposed (no separation at all).

Visual examination of the class projections on the various planes defined by pairs of discriminant factors provides as much intuitive information as possible about the data structure (mostly shapes and relative positions of the classes).

2) A predictive approach : Predictive Discriminant Analysis

Predictive Discriminant Analysis is both the most fundamental and the most popular classification technique.

* Normality assumption

Discriminant Analysis is a parametric technique, as it assumes a particular functional form for the class densities. More specifically, it assumes that the classes are multinormal. This strong assumption allows deriving both :

* Explicit and optimal rules for assigning new observations to one class or another.

* Accurate estimates of the a posteriori probabilities for every class. In other words, given a new observation, DA calculates for every i a good estimate of the probability Pi for this new observation to belong to class Ci.

In the general case, the discriminant functions of DA are quadratic, but additional simplifying assumptions allow considering only linear discriminant functions, that then lead to piecewise linear decision boundaries ("Linear Discriminant Analysis"). The good news is that this simplified DA sometimes performs better than the full version for reasons that we outline below.

* Class densities estimation

Discriminant Analysis is an indirect technique. It operates by :

- First estimating the means and covariance matrices of each class (which is enough for completely specifying a multinormal distribution),

- And, in a second phase, by calling on Bayes's theorem for estimating the a posteriori probabilities of the various classes, which then leads to determining the decision regions.

This approach is to be constrasted with that of, say, Logistic Regression which estimates class probabilities directly, that is without the intermediate step of class density estimation.

Strengths and weaknesses of Discriminant Analysis

The simplest (and usually very poor) parametric classification technique consists in assigning an observation to the class whose barycenter is closest to the observation, in the Euclidian sense of the term. This is a "first order" technique, in that in considers only the barycenters of the classes, whose coordinates are linear (first degree) functions of the observations, and does not take their spreads in the various directions of space into consideration.

The next step is to take the "second order information" about classes into account, that is, to consider their covariance matrices in addition to their barycenters : the coefficients of a covariance matrix are quadratic (second degree) functions or the observations . This is exactly what Discriminant Analysis does. In fact, rather than emphasizing the "class normality" assumption, it is equally justified to say that Discriminant Analysis is the simplest "second order classification technique", with then no reference to a particular functional form of the class densities. But because a multinormal density is fully determined by its mean and its covariance matrix, Discriminant Analysis is the ideal classification technique for multinormal classes.

The beauty of it is that it will turn out that Discriminant Analysis may still be interpreted as a "nearest barycenter" technique (first order), provided that the usual Euclidian distance be replaced by another "distance", the Mahalanobis distance, which is defined from second order information about the classes.

So the strength of Discriminant Analysis is that it is an optimal yet simple, complete and accurate classification technique whenever the basic "second order" assumptions accurately describe the setting (multinormal classes).

Of course, this is also where its weakness is. Classes are never perfectly multinormal : to what extent do the performances of Discriminant Analysis degrade when the class densities depart form normality ? Theory has no definite answer to this question, but one of the reasons for the popularity of Discriminant Analysis is that decades of intensive use show that DA is reasonably robust with respect to departure from the standard assumptions.

Another weakness of the full Discriminant Analysis is that as many covariance matrices have to be estimated as there are classes. This easily leads to models containing tens or even hundreds of parameters, a rather large number in view of the usually limited amount of design data (see "bias-variance tradeoff"). Consequently, the full DA tends to be unstable (strong dependence of design data), whereas the "restricted" versions, with fewer parameters, tend, to the contrary, to be more stable but at the expense of a strong bias. Modern versions of Discriminant Analysis incorporate regularization mechanisms, whose principles are similar to that of Ridge Regression, and which allow fine tuning of the balance between bias and variance of a DA model.

Extensions of Discriminant Analysis

Yet, there comes a point where departure of the class densities from normality is so large that Discriminant Analysis becomes ineffective. The analyst then has to turn to more sophisticated (but also more complex) techniques, of which we mention only a few.

Kernel Discriminant Analysis

A quite general paradigm for getting around the limitations of a low-order modeling technique is to first project the design data into a high-dimensional space by an appropriate non linear projection, then use the traditional low-order technique in this high-dimensional space for finally bringing the data together with the decision regions thus found back into the original data space. When applied to Discriminant Analysis, this approach is called "Kernel Discriminant Analysis". Class boundaries can this way be smooth surfaces of any shape, rather than just quadratic surfaces.

Theory and calculations then become pretty heavy, and KDA still has to find its place in the daily modeling tool-box of the analyst.

Logistic Regression

A more popular solution is to turn to Logistic Regression. We'll see that in Discriminant Analysis, a certain function of the a posteriori class probabilities, called the logit, turns out to be a linear function of the data attributes. The normality assumption is nowhere to be found in the definition of the logit (although its coefficients depend on the result of the previous estimation of the class covariance matrices), and DA can calculate the a posteriori probabilities only on the basis of the fact that the logit is linear in x.

It is then tempting to do away with the oversimplifying normality assumption of Discriminant Analysis, keep only the assumption about the linearity of the logit (whose coefficients will then have to be calculated by some other method) and see what comes out of this weaker assumption.

The result is the powerful Logistic Regression. Linearity of the logit does not enforce a particular functional form on the class densities, but rather a certain relationship between the densities of the different classes. Of course, multinormal densities satisfy this relationship, but only as a special case : linearity of the logit (implicitely) defines a much larger family of probabilities densities, and Logistic Regression is therefore more general than Discriminant Analysis.

Models that do not specify class densities, but only a functional relationship between class densities are called semi-parametric models.

------

Yet, everything has a price, and Logistic Regression is more complex, requires more computing power and is sometimes more unstable than Discriminant Analysis. So the rule may be : start with DA, and turn to Logistic Regression only if DA cannot deliver statisfactory results.

Neural Networks

Neural Networks may be regarded as classification techniques that take an additional step towards generalization by removing all constraints on class densities. The most popular Neural Network, the Multilayer Perceptron (MLP), may in fact be perceived as a generalization of Logistic Regression, and on difficult problems, some people might regard the MLP as the most powerful classifier available.

Yet, again, gaining in flexibility means losing somewhere else :

* By their very architecture (which, in the case of the MLP, extends that of Logistic Regression), Neural Networks contain more parameters to estimate than Logistic Regression, and require therefore larger data sets for reaching a comparable level of accuracy and/or stability (see again the bias-variance tradeoff).

* "Training" (model fitting) may be very long and uncertain (many local minima).

* The probability distribution of the parameters of a MLP is of course unknown, which precludes designing confidence intervals or tests on the calculated values of these parameters.

Conclusion

Discriminant Analysis has been the workhorse of classification for many years, and will remain so for many more years. It is simple, very well understood from a theoretical standpoint, and reasonably effective on most ordinary problems : solving a new classification problem should start by using Discriminant Analysis. Only when when it becomes clear that DA is not powerful enough for solving the problem at hand satisfactorily should the analyst envision moving on to more sophisticated techniques like Logistic Regression of Neural Networks.

_______________________________________________________________________________

 Tutorial 1

Owing to the complexity of the subject, the first Tutorial is just an overview of Discriminant Analysis which expounds the basic principales without going into the mathematics.

OVERVIEW OF DISCRIMINANT ANALYSIS

 Discriminant Factor Analysis (DFA) Building a classifier "Geometrical" Classification Functions "Probabilistic" Classification Functions Costs Assumption on classes Choosing the "best" set of predictors TUTORIAL

____________________________________________

 Tutorial 2

In this Tutorial, we examine the "descriptive part" of Discriminant Analysis. It identifies directions in the space of observations on which the classes are particularly well separated. Just as in Principal Components Analysis, data may then be projected on planes defined by pairs of such directions for visual interpretation.

DISCRIMINANT FACTOR ANALYSIS

 Which variables are useful for discriminating between classes ? Geometric criterion Statistical criterion Creating Discriminant Functions The simplest case What when class populations are not identical ? Discriminant Functions as Principal Components How many Discriminant Functions ? What when classes are not spherical ? The Mahalanobis distance Re-defining the "distance" What if classes have different covariance matrices ? The complete DFA scheme Is the model trustworthy ? Theory Validation techniques Linear or quadratic ? Predictor selection Why select predictors ? Selecting predictors in practice "Global quality" criterion, Wilks' Lambda "F-to-enter" "F-to-remove" The search strategy Categorical variables Class indicators Multiple Correspondance Analysis TUTORIAL

_____________________________________________________________

 Tutorial 3

We now want to build a real classifier, that is a predictive model whose input will be the attributes of the observations (or "predictors"), and whose output will be :

* Either a "hard decision", that is a class label.

* Or, better, a probability for each class that the observation belongs to the class.

We'll do that with the concept of "Discriminant Functions", that we first describe in general terms.

BUILDING THE DISCRIMINANT ANALYSIS CLASSIFIER

 The general concept of "Discriminant Function" The geometric discriminant functions The Mahalanobis distance criterion Two classes only : Multiple Linear Regression Numerical coding of the classes Is the result the same as that of DA ? A slight improvement over standard DA MLR : a caveat Logistic Regression The probabilistic discriminant functions The Bayes' theorem Linear and quadratic discriminant functions Class boundaries Actually building the classifier The parameters of the model Performance of the model on the sample The classification matrix Costs Is the model good ? Is the model trustworthy ? Theory Validation techniques Linear or quadratic ? Predictor selection Why select predictors ? Selecting predictors in practice "Global quality" criterion, Wilks' Lambda "F-to-enter" "F-to-remove" The search strategy Categorical variables Class indicators Multiple Correspondance Analysis TUTORIAL

______________________________________________________

 Tutorial 4

We give here some mathematical complements to substantiate the results of the previous Tutorials.

COMPLEMENTS ON DISCRIMINANT ANALYSIS

 The concept of "metrics", the Mahalanobis distance Decomposition of the total covariance matrix The Mahalanobis distance criterion The exact probabilistic discriminant functions Unequal covariance matrices Equal covariance matrices Score Variance of a projection on a line Discriminant Analysis "at large" TUTORIAL

_________________________________________________________

 Classification Covariance matrix Principal Components Analysis Inertia Fisher's linear discriminant