Classification

Observations often come in natural groups, or "classes". For example :

 

In a data table, the class to which the observation belongs appears as a nominal variable, whose modalities are the names of the class. An observation in the table is said to be labeled by the name of the class to which it belongs. For example, this is a (grossly simplified) fragment of the historical customer data base of Consumer Loan company. Each customer is described by a series of attributes (or features, or variables). The last column tells whether the customer payed back his or her loan without incident. It is the class label column, and we have here only two classes (a quite common situation) :

 

 

Gender

Revenue

Mar.Status

Loan ($)

OK ?

M

50K

Single

5K

Y

F

45K

Married

7K

Y

M

63K

Divorced

4K

N

...

...

...

...

...

 

Later, new observations will be collected, whose class labels are missing. For example, a new applicant for a loan will fill out forms that will translate into :

 

Gender

Revenue

Mar.Status

Loan

OK ?

M

52K

Married

6K

???

 

 

 

Classification is the action of predicting to which class the new, unlabeled observation actually belongs or, equivalently, what is the value of the (unobserved) modality of the class variable. In the above example, classification addresses the following question : "Should we grant the loan to this new applicant, will the loan be paid back without incident ?".

The values of the observed attributes of this new observation will be plugged into a classification model, or classifier, that will deliver a prediction of the class membership of the observation. This classifier is a set of equations or logical rules built from the available labeled data (the design or learning set).

 

So a classifier is a predictive model. Like all predictive models, it plays two roles :

  1. Perform the actual prediction. For this purpose, it considers that the set of attributes retained to build the classifier (the predictors) implicitely contains the information needed to reconstruct the variable to be predicted (here, the class label). It then extracts from this mass of (mostly irrelevant) information this particular fraction that pertains to class label.
  2. Shed some light on the structure of the data, and more particularly to which predictors are particularly important for a successful prediction. In the above example, the loan company would be particularly interested to identify a small set of attributes (among the hundreds of possible measurable attributes) that are particularly effective in predicting the outcome of the pay-back process.

---------------------

 

In addition to the general difficulties of Data Modeling (), Classification encounters at least two difficulties of its own:

  1. Classes usually overlap.
  2. Classes may have arbitrarily complex shapes in the space of predictors.

 

Lets get over these two points in more details.

 

Class overlap

        Classes (not just the available data) usually overlap. What this means is that two observations (either already labeled or not labeled yet) with identical values of their predictors might very well belong to different classes. For example, two customers may have identical measured profiles (gender, age, revenue, annual milage etc...), but one has bought car model A, while the other one has bought car model B.

 

 

 

Because of class overlap, a classifier delivering only hard verdicts ("New observation belongs to class Ci") is prone to errors, however expertly crafted. As a matter of fact, all popular classifiers deliver a more careful verdict, expressed in terms of (estimated) probabilities. Given a new, unlabled observation x, the verdict delivered by the classifier will read :

    * The (estimated) probability for the new observation x to belong to class Ci is P(i), i = 1, 2, ..., n   with n the number of classes.

 

So, classification appears as an intrinsically probabilistic business. The numbers P(i) are called the a posteriori, or posterior probabilities of the various classes for observation x. In a technical context, they are denoted by P{Ci , x} and read : "Probability for an observation to belong to class Ci knowing that its attributes are described by the vector x".

Decision boundaries

        Despite the uncertainty originating from the probabilistic nature of classification, the analyst will ultimately have to make a decision, and assign a unique class label to any new observation. This means that a class label is attached to any point of the feature space. All points with the same label constitute a decision region. The boundary between two decision  regions is a decision boundary. Its shape is determined by both :

 

 

In this illustration, all observations above the blue line will be assigned to class C1, while all observations below this line will be assigned to class C2.

So, for the classifier to perform correctly (we'll come back to this notion later), decision boundaries should be able to take on complex shapes.

---------------------

The Bayesian Decision framework

        Once the posterior probabilities P{Ci , x} are estimated by the classifier, it is a quite natural move to assign the new observation to the class with the largest probability.


So, in fact, a classifier does not assign observations to classes, it only calculates posterior probabilities. The assignment proper is made on the basis of rules cooked up by the analyst, and that use these probabilities.

This move is justified by the following result.
Defining a performance criterion for a classifier relies on precisely what is expected from the classification process. This depends not on the mechanics of data modeling, but rather on the goal that the analyst is trying to reach. This goal is usually expressed in terms of minimizing some quantity related to the classification process. In this Glossary, we will address only the simplest of these goals :

 

The Bayesian Decision theory establishes that the strategy that assigns a new observation to the class with the largest posterior probability is optimal, in that it generates a smaller number of classification errors than any other strategy.

This theory also addresses a variant of this goal that is of great practical interest. A class assignment error is costly, but not all errors are equally costly : some may be rather innocuous, while others may have dramatic consequences. So the goal of the analyst is usually to minimize the average cost of misclassification, rather than just the raw misclassification rate. The Bayesian Decision theory also solves this question.

Estimating the posterior probabilities

        So, in theory, the Bayesian Decision framework completely solves the problem of classification, assuming that the P(i) can be calculated, or at least estimated. There are two main approaches to estimating posterior probabilities :

Indirect methods

            The first approach is to resort to Bayes' theorem, that states that the P(i)s can be estimated provided that :

Direct methods

            Calculating reasonably accurate estimates of the probabilities densities may prove a formidable task in the most general situation. But although Bayes' theorem tells us that posterior probabilities can be estimated that way, it does not say that it is the only way to do so. In fact, it can easily be shown that the outcome of a perfect regression on class indicator variables provides good estimates of the P(i)s. So many methods by-pass the conditional probability density estimation stage, and estimate posterior probabilities directly by performing some kind of regression on the class indicators instead. This is how Neural Networks proceed. Logistic Regression may also be perceived as a regression on class indicators.

Discriminant functions

        We are now convinced that estimating posterior probabilities P{Ci , x}is a sound strategy for most classification tasks. Yet, even this limited goal may prove very difficult to attain in practice, and the pure Bayesian Decision framework may be somewhat relaxed to make problems more tractable.

The simplest way of doing that is to consider estimation of the P(i)s as the ultimate solution, but be satisfied with the following more modest goal :

 If properly designed, the gi(x) will not be too different from true posterior probabilities P{Ci , x}, and our classifier will still deliver satisfactory results. The gi(.)s are called discriminant functions. There are as many discriminant functions as there are classes, as each class has its own discriminant function. Decision boundaries are defined by all the equations   gi(x) = gj(x)  (i  j).  


When there are only two classes C1 and C2,   it is customary to have only one discriminant function g(x) taylored so that :
   * x is assigned to C1 if g(x) > 0,
   * x is assigned to C2 otherwise.

The advantage of this approach is that there is now a great flexibility for choosing :

 

so as to lead to tractable calculations and a more straightforwardly interpretable model. The price to pay for this improvement is usually only a quite tolerable loss in classification accuracy.

 

Discriminant functions are often chosen to be linear (in x), thus generating piecewise-linear decision boundaries and convex decision regions. They may occasionally be chosen quadratic or even assume more complex analytical forms so as to make the decision boundaries more flexible.

 

Discriminant functions come with parameters whose values are adjusted on the data so as to minimize some criterion :

 

Trading accuracy for simplicity and interpretability is pushed to the extreme in Decision Trees. There, discriminant functions are built recursively one variable at a time. As a result, Decision Trees, despite their relative lack of accuracy, are extremely popular because of their speed, simplicity and easy interpretation.

_______________________________________________________

 

Classifications techniques were born out of necessity and out of the imagination of many practitioners in many different fields. It is common to see Classification techniques described individually without much reference to other (competing) techniques or how this particular technique fits in the more general context of Classification. As this Glossary develops, we will attempt to keep a unifying view of Classification by trying to insert techniques in the general framework outlined in this page.

____________________________________________________________

 

Related readings :

 Cluster analysis

Download this Glossary

 

Want to contribute to this site ?