Classification  (Supervised)

Observations often come in natural groups, or "classes". For example :

 

In a data table, the class to which the observation belongs appears as a nominal variable, whose modalities are the names of the class. An observation in the table is said to be labeled by the name of the class to which it belongs. For example, this is a (grossly simplified) fragment of the historical customer data base of Consumer Loan company. Each customer is described by a series of attributes (or features, or variables). The last column tells whether the customer payed back his or her loan without incident. It is the class label column, and we have here only two classes (a quite common situation) :

 

 

Gender

Revenue

Mar.Status

Loan ($)

OK ?

M

50K

Single

5K

Y

F

45K

Married

7K

Y

M

63K

Divorced

4K

N

...

...

...

...

...

 

Later, new observations will be collected, whose class labels are missing. For example, a new applicant for a loan will fill out forms that will translate into :

 

Gender

Revenue

Mar.Status

Loan

OK ?

M

52K

Married

6K

???

 

 

Classification is the action of predicting to which class the new, unlabeled observation actually belongs or, equivalently, what is the value of the (unobserved) modality of the class variable. In the above example, classification addresses the following question : "Should we grant the loan to this new applicant, will the loan be paid back without incident ?".

The values of the observed attributes of this new observation will be plugged into a classification model, or classifier, that will deliver a prediction of the class membership of the observation. This classifier is a set of equations or logical rules built from the available labeled data (the design or learning set).

 

So a classifier is a predictive model. Like all predictive models, it serves two different purposes:

  1. Perform the actual predictions. The set of attributes retained to build the classifier (the predictors) implicitely contains the information needed to reconstruct the variable to be predicted (here, the class label). The classifier then extracts from this mass of (mostly irrelevant) information this particular fraction that pertains to class label.
  2. Shed some light on the structure of the data, and more particularly on which predictors are particularly important for a successful prediction. In the above example, the loan company would be particularly interested to identify a small set of attributes (among the hundreds of possible measurable attributes) that are particularly effective in predicting the outcome of the pay-back process.

---------------------

 

In addition to the general difficulties of Data Modeling (), Classification encounters at least two difficulties of its own:

  1. Classes usually overlap.
  2. Classes may have arbitrarily complex shapes in the space of predictors.

 

Lets get over these two points in more detail.

Class overlap and probabilistic classification

Classes (not just the available data) usually overlap. What this means is that two observations (either already labeled or not labeled yet) with identical values of their predictors might very well belong to different classes. For example, two customers may have identical measured profiles (gender, age, revenue, annual milage, budget etc...), yet one has bought car model A, while the other one has bought car model B.

 

 

 

Because of class overlap, a classifier delivering only hard verdicts ("The new observation belongs to class Ci") is prone to errors, however expertly crafted. As a matter of fact, all popular classifiers deliver a more careful verdict, expressed in terms of (estimated) probabilities. Given a new, unlabled observation x, the verdict delivered by the classifier will read :

The (estimated) probability for the new observation x to belong to class Ci is P(i), i = 1, 2, ..., n   with n the number of classes.

What this means is that, on the average, P(i).100% of the new observations with x for their attribute vector actually belong to class Ci.

So, classification appears as an intrinsically probabilistic business. The numbers P(i) are called the a posteriori probabilities, or posterior probabilities of the various classes for observation x. In a technical context, they are denoted by P{Ci | x} and read : "Probability for an observation to belong to class Ci knowing that its attributes are described by the vector x".

P{Ci | x} is called the probability of Ci conditionally to x.

Decision boundaries and discriminant functions

Decision regions and decision boundaries

Despite the uncertainty originating from the probabilistic nature of classification, the analyst will ultimately have to make a decision, and assign a unique class label to any new observation ("The loan is granted" or "The loan is denied"). This means that the prediction model assigns a class label to every point of the feature space. All points with the same class label constitute a decision region.

 

 

In this illustration, all observations above the blue line will be assigned to class C1, while all observations below this line will be assigned to class C2.

So, for the classifier to perform correctly (we'll come back to this notion later), decision boundaries should be able to take complex shapes.

The boundary between two decision  regions is a decision boundary. Its shape is determined by both :

Discriminant functions

Classifiers come in a great diversity of techniques and algorithms. Yet, the ways the decision regions and decision boundaries are determined share a common pattern :

The gi(x) are called discriminant functions. There are as many discriminant functions as there are classes, and each class has its own discriminant function. Decision boundaries are defined by the system of equations gi(x) = gj(x)  (i ≠ j).  


When there are only two classes C1 and C2,   it is customary to have only one discriminant function g(x) taylored so that :
   * x is assigned to C1 if g(x) > 0,
   * x is assigned to C2 otherwise.

This approach allows a great flexibility :

    * In the way the analytic forms of the discriminant functions gi(x) are chosen,

    * And in the way the parameters of the discriminant functions are adjusted.

 

Discriminant functions are often chosen to be linear (in x), thus generating piecewise-linear decision boundaries and convex decision regions. They may occasionally be chosen quadratic or even assume more complex analytical forms so as to make the decision boundaries more flexible.

Parameters of the discriminant functions are adjusted so as to minimize (or maximize) some criterion :

        * A heuristically defined quantity, like the criterion of Fisher's linear discriminant.

        * An additional variable may be created, that has a fixed value for all observations in one class. The discriminant function may be required to predict these numbers in a way that minimizes the quadratic error on the prediction ("Least mean squared error" procedures).

        * Logistic Regression and Neural Networks maximize the likelihood of the sample. This approach bears some similarities with the "regression on indicators" approach in that classes are recoded from "modalities" to "numbers", and some sort of regression is then performed on the classes' numerical codes.

The Bayesian Decision framework

Defining a performance criterion for a classifier relies on precisely what is expected from the classification process. This depends not on the mechanism of data modeling, but rather on the goal that the analyst is trying to reach. This goal is usually expressed in terms of minimizing some quantity related to the classification process. In this Glossary, we will address only the simplest of these goals :

Minimize the number of class assignment errors on new, unlabeled observations.

The Bayesian Decision Theory establishes that the strategy which consists in assigning any new observation to the class with the largest posterior probability is optimal : no other strategy can lead to a a lower classification error rate.

This theory also addresses a variant of this goal that is of great practical interest. A class assignment error is costly, but not all errors are equally costly : some may be rather innocuous, while others may have dramatic consequences. So the goal of the analyst is usually to minimize the average cost of misclassification, rather than just the raw misclassification rate. The Bayesian Decision theory also solves this question.

Consequently, whenever the goal is simply to minimize the number of classification errors, it is advisable to have the values of the discriminant functions gi(x) as close as possible to the true (but unknown) posterior probabilities P{Ci | x}. Once these probabilities have been estimated by the classifier, a new observation is assigned to the class with the largest estimated posterior probability.


So, in fact, a classifier does not assign observations to classes, it only calculates estimates of posterior probabilities. The classification proper is made on the basis of rules cooked up by the analyst, and that use these estimates.

Estimating posterior probabilities

So, at least in theory, the Bayesian Decision framework completely solves the problem of classification, assuming that the P(i) can be calculated, or at least estimated. There are two main approaches to estimating posterior probabilities :

Indirect methods

The first approach is to resort to Bayes' theorem, that states that the P(i)s can be estimated provided that :

    * The class prior probabilities are known (or at least can be estimated),

    * The probability density within each class can be estimated (these probability densities are refered to as the class conditional probability densities). This estimation is the goal of the set of techniques known as probability density estimation.

        - If a particular analytical form of the class conditional probability densities is postulated, then estimating the posterior probabilities amounts to estimating the values of a handful of parameters (see here). This is the approach implemented in Discriminant Analysis, which assumes that classes are all multinormal : estimating the class conditional densities amounts then to estimating the coefficients of the covariance matrices of the various classes.

        - When no simple candidate density seems justified, on may resort to non parameteric estimation of the class conditional densities, for example with k-Nearest Neighbors or Decision Trees.

These techniques are conceptually simple, easy to implement, and their results are easily interpreted. Yet, their performances as classifiers are often disappointing because of their large variance (see the "bias-variance tradeoff"). In addition, their parameters (even so-called "non parametric" techniques have parameters!) do not follow any identifiable probability distribution, and it is therefore impossible to build confidence intervals or tests around their values.

Direct methods

Calculating reasonably accurate estimates of the probabilities densities may prove a formidable task in the most general situation. But although Bayes' theorem tells us that posterior probabilities can be estimated that way, it does not say that it is the only way to do it. In fact, it can easily be shown that the outcome of a perfect regression on class indicator variables provides good estimates of the P(i)s. So many methods by-pass the conditional probability density estimation stage, and estimate posterior probabilities directly by performing some kind of regression on the class indicators instead. This is how Neural Networks proceed. Logistic Regression may also be perceived as a regression on class indicators.

 

_______________________________________________________

 

Classifications techniques were born out of necessity and out of the imagination of many practitioners in many different fields. It is common to see Classification techniques described individually without much reference to other (competing) techniques or how this particular technique fits in the more general context of Classification. As this Glossary develops, we will attempt to keep a unifying view of Classification by trying to insert techniques in the general framework outlined in this page.

____________________________________________________________

 

Related readings :

Discriminant Analysis

Decision Trees

Logistic Regression

Neural Networks

 Cluster analysis

Download this Glossary