Classification
Observations often come in natural groups, or "classes". For example :
In a data table, the class to which the observation belongs appears as a nominal variable, whose modalities are the names of the class. An observation in the table is said to be labeled by the name of the class to which it belongs. For example, this is a (grossly simplified) fragment of the historical customer data base of Consumer Loan company. Each customer is described by a series of attributes (or features, or variables). The last column tells whether the customer payed back his or her loan without incident. It is the class label column, and we have here only two classes (a quite common situation) :
|
Gender |
Revenue |
Mar.Status |
Loan ($) |
OK ? |
|
M |
50K |
Single |
5K |
Y |
|
F |
45K |
Married |
7K |
Y |
|
M |
63K |
Divorced |
4K |
N |
|
... |
... |
... |
... |
... |
Later, new observations will be collected, whose class labels are missing. For example, a new applicant for a loan will fill out forms that will translate into :
|
Gender |
Revenue |
Mar.Status |
Loan |
OK ? |
|
M |
52K |
Married |
6K |
??? |
Classification is the action of predicting to which class the new, unlabeled observation actually belongs or, equivalently, what is the value of the (unobserved) modality of the class variable. In the above example, classification addresses the following question : "Should we grant the loan to this new applicant, will the loan be paid back without incident ?".
The values of the observed attributes of this new observation will be plugged into a classification model, or classifier, that will deliver a prediction of the class membership of the observation. This classifier is a set of equations or logical rules built from the available labeled data (the design or learning set).
So a classifier is a predictive model. Like all predictive models, it plays two roles :
---------------------
In addition to the general difficulties of Data Modeling
(
), Classification encounters at least two difficulties of its own:
Lets get over these two points in more details.
Classes (not just the available data) usually overlap. What this means is that two observations (either already labeled or not labeled yet) with identical values of their predictors might very well belong to different classes. For example, two customers may have identical measured profiles (gender, age, revenue, annual milage etc...), but one has bought car model A, while the other one has bought car model B.

Because of class overlap, a classifier delivering only hard verdicts ("New observation belongs to class Ci") is prone to errors, however expertly crafted. As a matter of fact, all popular classifiers deliver a more careful verdict, expressed in terms of (estimated) probabilities. Given a new, unlabled observation x, the verdict delivered by the classifier will read :
* The (estimated) probability for the new observation x to belong to class Ci is P(i), i = 1, 2, ..., n with n the number of classes.
So, classification appears as an intrinsically probabilistic business. The numbers P(i) are called the a posteriori, or posterior probabilities of the various classes for observation x. In a technical context, they are denoted by P{Ci , x} and read : "Probability for an observation to belong to class Ci knowing that its attributes are described by the vector x".
Despite the uncertainty originating from the probabilistic nature of classification, the analyst will ultimately have to make a decision, and assign a unique class label to any new observation. This means that a class label is attached to any point of the feature space. All points with the same label constitute a decision region. The boundary between two decision regions is a decision boundary. Its shape is determined by both :

In this illustration, all observations above the blue line will be assigned to class C1, while all observations below this line will be assigned to class C2.
So, for the classifier to perform correctly (we'll come back to this notion later), decision boundaries should be able to take on complex shapes.
---------------------
Once the posterior probabilities P{Ci , x} are estimated by the classifier, it is a quite natural move to assign the new observation to the class with the largest probability.
So, in fact, a classifier does not assign observations to
classes, it only calculates posterior probabilities. The assignment proper is
made on the basis of rules cooked up by the analyst, and that use these
probabilities.
This move is justified by the following result.
Defining a performance criterion
for a classifier relies on precisely what is expected from the classification
process. This depends not on the mechanics of data modeling, but rather on
the goal that the analyst is trying to reach. This goal is usually expressed
in terms of minimizing some quantity related to the classification process.
In this Glossary, we will address only the simplest of these goals :
The Bayesian Decision theory establishes that the strategy that assigns a new observation to the class with the largest posterior probability is optimal, in that it generates a smaller number of classification errors than any other strategy.
This theory also addresses a variant of this goal that is of great practical interest. A class assignment error is costly, but not all errors are equally costly : some may be rather innocuous, while others may have dramatic consequences. So the goal of the analyst is usually to minimize the average cost of misclassification, rather than just the raw misclassification rate. The Bayesian Decision theory also solves this question.
So, in theory, the Bayesian Decision framework completely solves the problem of classification, assuming that the P(i) can be calculated, or at least estimated. There are two main approaches to estimating posterior probabilities :
The first approach is to resort to Bayes' theorem, that states that the P(i)s can be estimated provided that :
Calculating reasonably accurate estimates of the probabilities densities may prove a formidable task in the most general situation. But although Bayes' theorem tells us that posterior probabilities can be estimated that way, it does not say that it is the only way to do so. In fact, it can easily be shown that the outcome of a perfect regression on class indicator variables provides good estimates of the P(i)s. So many methods by-pass the conditional probability density estimation stage, and estimate posterior probabilities directly by performing some kind of regression on the class indicators instead. This is how Neural Networks proceed. Logistic Regression may also be perceived as a regression on class indicators.
We are now convinced that estimating posterior probabilities P{Ci , x}is a sound strategy for most classification tasks. Yet, even this limited goal may prove very difficult to attain in practice, and the pure Bayesian Decision framework may be somewhat relaxed to make problems more tractable.
The simplest way of doing that is to consider estimation of the P(i)s as the ultimate solution, but be satisfied with the following more modest goal :
If properly designed, the gi(x) will
not be too different from true posterior probabilities P{Ci
, x}, and our classifier will
still deliver satisfactory results. The gi(.)s are called discriminant functions.
There are as many discriminant functions as there are classes, as each class
has its own discriminant function. Decision boundaries are defined by all the
equations gi(x) = gj(x) (i
j).
When there are only two classes C1
and C2, it is customary to have only one
discriminant function g(x) taylored so that :
*
x is assigned to C1 if g(x) >
0,
* x is assigned to C2
otherwise.
The advantage of this approach is that there is now a great flexibility for choosing :
so as to lead to tractable calculations and a more straightforwardly interpretable model. The price to pay for this improvement is usually only a quite tolerable loss in classification accuracy.
Discriminant functions are often chosen to be linear (in x), thus generating piecewise-linear decision boundaries and convex decision regions. They may occasionally be chosen quadratic or even assume more complex analytical forms so as to make the decision boundaries more flexible.
Discriminant functions come with parameters whose values are adjusted on the data so as to minimize some criterion :
Trading accuracy for simplicity and interpretability is pushed to the extreme in Decision Trees. There, discriminant functions are built recursively one variable at a time. As a result, Decision Trees, despite their relative lack of accuracy, are extremely popular because of their speed, simplicity and easy interpretation.
_______________________________________________________
Classifications techniques were born out of necessity and out of the imagination of many practitioners in many different fields. It is common to see Classification techniques described individually without much reference to other (competing) techniques or how this particular technique fits in the more general context of Classification. As this Glossary develops, we will attempt to keep a unifying view of Classification by trying to insert techniques in the general framework outlined in this page.
____________________________________________________________
Related readings :
|
Want to contribute to this site ? |