Discriminant Analysis

Strictly speaking, Discriminant Analysis is synonymous with supervised Classification. The term therefore refers to many different techniques such as Logistic Regression, Decision Trees or Neural networks.

The two "Discriminant Analysis"

But common use of the expression refers more specifically to the following two approaches to classification.

1) A descriptive approach (Discriminant Factor Analysis)

We suggest that you first read the entry on Fisher's linear discriminant.

The goal of Discriminant Factor Analysis is to identify straight lines in the space of attributes such that class projections on these lines are particularly well separated. These directions will be identified as particular linear combinations of the original variables, and are called or "Discriminant Axes" or "Discriminant Factors".

In the top image of this illustration, neither x1 nor x2 is individually of much use for separating class C1  from class C2 , because the projections of these two classes on the axes severely overlap.

 

 

But the oblique line whose equation is  x1 =  x2 (bottom image of the illustration) is clearly a perfectly discriminant factor because class projections on this line are completely separated.

When there are k multinormal classes in p-dimensional space (p > k), we'll show that Discriminant Factor Analysis (DFA) identifies the optimal set of (k - 1) orthogonal projection axes which best separate the classes : these axes are the (k - 1) discriminant factors. Adding more factors does not help : in fact, we'll show that DFA can be interpreted as a Principal Component Analysis (PCA) of the set of the k class barycenters ponderated by the corresponding class populations. If you're familiar with PCA, you'll realize that the projections of the class barycenters on the (k - 1)-dimensional subspace spanned by the discriminant factors are "as separated" as possible in the PCA sense, and that the projections of these barycenters on any straight line orthogonal to this projection subspace are superimposed (no separation at all).

Visual examination of the class projections on the various planes defined by pairs of discriminant factors provides as much intuitive information as possible about the data structure (mostly shapes and relative positions of the classes).

2) A predictive approach : Predictive Discriminant Analysis

Predictive Discriminant Analysis is both the most fundamental and the most popular classification technique.

 

            * Normality assumption

Discriminant Analysis is a parametric technique, as it assumes a particular functional form for the class densities. More specifically, it assumes that the classes are multinormal. This strong assumption allows deriving both :

    * Explicit and optimal rules for assigning new observations to one class or another.

    * Accurate estimates of the a posteriori probabilities for every class. In other words, given a new observation, DA calculates for every i a good estimate of the probability Pi for this new observation to belong to class Ci.

In the general case, the discriminant functions of DA are quadratic, but additional simplifying assumptions allow considering only linear discriminant functions, that then lead to piecewise linear decision boundaries ("Linear Discriminant Analysis"). The good news is that this simplified DA sometimes performs better than the full version for reasons that we outline below.

 

            * Class densities estimation

Discriminant Analysis is an indirect technique. It operates by :

    - First estimating the means and covariance matrices of each class (which is enough for completely specifying a multinormal distribution),

    - And, in a second phase, by calling on Bayes's theorem for estimating the a posteriori probabilities of the various classes, which then leads to determining the decision regions.

This approach is to be constrasted with that of, say, Logistic Regression which estimates class probabilities directly, that is without the intermediate step of class density estimation.

Strengths and weaknesses of Discriminant Analysis

The simplest (and usually very poor) parametric classification technique consists in assigning an observation to the class whose barycenter is closest to the observation, in the Euclidian sense of the term. This is a "first order" technique, in that in considers only the barycenters of the classes, whose coordinates are linear (first degree) functions of the observations, and does not take their spreads in the various directions of space into consideration.

The next step is to take the "second order information" about classes into account, that is, to consider their covariance matrices in addition to their barycenters : the coefficients of a covariance matrix are quadratic (second degree) functions or the observations . This is exactly what Discriminant Analysis does. In fact, rather than emphasizing the "class normality" assumption, it is equally justified to say that Discriminant Analysis is the simplest "second order classification technique", with then no reference to a particular functional form of the class densities. But because a multinormal density is fully determined by its mean and its covariance matrix, Discriminant Analysis is the ideal classification technique for multinormal classes.

The beauty of it is that it will turn out that Discriminant Analysis may still be interpreted as a "nearest barycenter" technique (first order), provided that the usual Euclidian distance be replaced by another "distance", the Mahalanobis distance, which is defined from second order information about the classes.

So the strength of Discriminant Analysis is that it is an optimal yet simple, complete and accurate classification technique whenever the basic "second order" assumptions accurately describe the setting (multinormal classes).

 

Of course, this is also where its weakness is. Classes are never perfectly multinormal : to what extent do the performances of Discriminant Analysis degrade when the class densities depart form normality ? Theory has no definite answer to this question, but one of the reasons for the popularity of Discriminant Analysis is that decades of intensive use show that DA is reasonably robust with respect to departure from the standard assumptions.

Another weakness of the full Discriminant Analysis is that as many covariance matrices have to be estimated as there are classes. This easily leads to models containing tens or even hundreds of parameters, a rather large number in view of the usually limited amount of design data (see "bias-variance tradeoff"). Consequently, the full DA tends to be unstable (strong dependence of design data), whereas the "restricted" versions, with fewer parameters, tend, to the contrary, to be more stable but at the expense of a strong bias. Modern versions of Discriminant Analysis incorporate regularization mechanisms, whose principles are similar to that of Ridge Regression, and which allow fine tuning of the balance between bias and variance of a DA model.

Extensions of Discriminant Analysis

Yet, there comes a point where departure of the class densities from normality is so large that Discriminant Analysis becomes ineffective. The analyst then has to turn to more sophisticated (but also more complex) techniques, of which we mention only a few.

Kernel Discriminant Analysis

A quite general paradigm for getting around the limitations of a low-order modeling technique is to first project the design data into a high-dimensional space by an appropriate non linear projection, then use the traditional low-order technique in this high-dimensional space for finally bringing the data together with the decision regions thus found back into the original data space. When applied to Discriminant Analysis, this approach is called "Kernel Discriminant Analysis". Class boundaries can this way be smooth surfaces of any shape, rather than just quadratic surfaces.

Theory and calculations then become pretty heavy, and KDA still has to find its place in the daily modeling tool-box of the analyst.

Logistic Regression

A more popular solution is to turn to Logistic Regression. We'll see that in Discriminant Analysis, a certain function of the a posteriori class probabilities, called the logit, turns out to be a linear function of the data attributes. The normality assumption is nowhere to be found in the definition of the logit (although its coefficients depend on the result of the previous estimation of the class covariance matrices), and DA can calculate the a posteriori probabilities only on the basis of the fact that the logit is linear in x.

It is then tempting to do away with the oversimplifying normality assumption of Discriminant Analysis, keep only the assumption about the linearity of the logit (whose coefficients will then have to be calculated by some other method) and see what comes out of this weaker assumption.

The result is the powerful Logistic Regression. Linearity of the logit does not enforce a particular functional form on the class densities, but rather a certain relationship between the densities of the different classes. Of course, multinormal densities satisfy this relationship, but only as a special case : linearity of the logit (implicitely) defines a much larger family of probabilities densities, and Logistic Regression is therefore more general than Discriminant Analysis.

Models that do not specify class densities, but only a functional relationship between class densities are called semi-parametric models.

------

Yet, everything has a price, and Logistic Regression is more complex, requires more computing power and is sometimes more unstable than Discriminant Analysis. So the rule may be : start with DA, and turn to Logistic Regression only if DA cannot deliver statisfactory results.

Neural Networks

Neural Networks may be regarded as classification techniques that take an additional step towards generalization by removing all constraints on class densities. The most popular Neural Network, the Multilayer Perceptron (MLP), may in fact be perceived as a generalization of Logistic Regression, and on difficult problems, some people might regard the MLP as the most powerful classifier available.

Yet, again, gaining in flexibility means losing somewhere else :

    * By their very architecture (which, in the case of the MLP, extends that of Logistic Regression), Neural Networks contain more parameters to estimate than Logistic Regression, and require therefore larger data sets for reaching a comparable level of accuracy and/or stability (see again the bias-variance tradeoff).

    * "Training" (model fitting) may be very long and uncertain (many local minima).

    * The probability distribution of the parameters of a MLP is of course unknown, which precludes designing confidence intervals or tests on the calculated values of these parameters.

Conclusion

Discriminant Analysis has been the workhorse of classification for many years, and will remain so for many more years. It is simple, very well understood from a theoretical standpoint, and reasonably effective on most ordinary problems : solving a new classification problem should start by using Discriminant Analysis. Only when when it becomes clear that DA is not powerful enough for solving the problem at hand satisfactorily should the analyst envision moving on to more sophisticated techniques like Logistic Regression of Neural Networks.

_______________________________________________________________________________

 

 

 

 

Tutorial 1

 

Owing to the complexity of the subject, the first Tutorial is just an overview of Discriminant Analysis which expounds the basic principales without going into the mathematics.

 

OVERVIEW OF DISCRIMINANT ANALYSIS

Discriminant Factor Analysis (DFA)

Building a classifier

"Geometrical" Classification Functions

"Probabilistic" Classification Functions

Costs

Assumption on classes

Choosing the "best" set of predictors

TUTORIAL

____________________________________________

 

 

Tutorial 2

 

 In this Tutorial, we examine the "descriptive part" of Discriminant Analysis. It identifies directions in the space of observations on which the classes are particularly well separated. Just as in Principal Components Analysis, data may then be projected on planes defined by pairs of such directions for visual interpretation.

 

 

DISCRIMINANT FACTOR ANALYSIS

Which variables are useful for discriminating between classes ?

Geometric criterion

Statistical criterion

Creating Discriminant Functions

The simplest case

What when class populations are not identical ?

Discriminant Functions as Principal Components

How many Discriminant Functions ?

What when classes are not spherical ?

The Mahalanobis distance

Re-defining the "distance"

What if classes have different covariance matrices ?

The complete DFA scheme

Is the model trustworthy ?

Theory

Validation techniques

Linear or quadratic ?

Predictor selection

Why select predictors ?

Selecting predictors in practice

"Global quality" criterion, Wilks' Lambda

"F-to-enter"

"F-to-remove"

The search strategy

Categorical variables

Class indicators

Multiple Correspondance Analysis

TUTORIAL

_____________________________________________________________

 

 

Tutorial 3

 

 We now want to build a real classifier, that is a predictive model whose input will be the attributes of the observations (or "predictors"), and whose output will be :

    * Either a "hard decision", that is a class label.

    * Or, better, a probability for each class that the observation belongs to the class.

 

We'll do that with the concept of "Discriminant Functions", that we first describe in general terms.

 

 

BUILDING THE DISCRIMINANT ANALYSIS CLASSIFIER

The general concept of "Discriminant Function"

The geometric discriminant functions

The Mahalanobis distance criterion

Two classes only : Multiple Linear Regression

Numerical coding of the classes

Is the result the same as that of DA ?

A slight improvement over standard DA

MLR : a caveat

Logistic Regression

The probabilistic discriminant functions

The Bayes' theorem

Linear and quadratic discriminant functions

Class boundaries

Actually building the classifier

The parameters of the model

Performance of the model on the sample

The classification matrix

Costs

Is the model good ?

Is the model trustworthy ?

Theory

Validation techniques

Linear or quadratic ?

Predictor selection

Why select predictors ?

Selecting predictors in practice

"Global quality" criterion, Wilks' Lambda

"F-to-enter"

"F-to-remove"

The search strategy

Categorical variables

Class indicators

Multiple Correspondance Analysis

TUTORIAL

______________________________________________________

 

 

Tutorial 4

 

We give here some mathematical complements to substantiate the results of the previous Tutorials.

 

COMPLEMENTS ON DISCRIMINANT ANALYSIS

The concept of "metrics", the Mahalanobis distance

Decomposition of the total covariance matrix

The Mahalanobis distance criterion

The exact probabilistic discriminant functions

Unequal covariance matrices

Equal covariance matrices

Score

Variance of a projection on a line

Discriminant Analysis "at large"

TUTORIAL

 

_________________________________________________________

 

Related readings:

Classification

Covariance matrix

Principal Components Analysis

Inertia

Fisher's linear discriminant

Download this Glossary