Posterior (probabilities)

Now suppose you have some kind of model that allows you to predict if a company will post a profit or a loss next quarter. For instance, it may be based on a formula taking into account various accounting ratios. You plug the numbers relative to Acme Inc. into the model, and you get the following answer : probability for a profit next quarter is 60%. We say that the posterior (or "a posteriori") probability for Acme Inc. to post a profit according to this particular model is 60%. Thus, a posterior probability is not attached to just a class, but also to an observation and a model. If the model is replaced by another one, the posterior probability will change too.

The ultimate goal of classification is to estimate posterior probabilities that are conducive to the lowest possible rate of misclassification by the model (or rather, to the lowest misclassification cost).

Principal Component Analysis (PCA)

Prior (probabilities)

Most companies, say 80%, make a profit over a quarter. Say somebody asks you "Let's consider company Acme Inc. . Will it make a profit next quarter ?". You know nothing about the company. Your best answer is "Yes, it will make a profit", because by saying so, you know you have a 0.8 probability of being right. Within the context of the classification of companies into two classes "Profit" and "Loss", one says that the prior (or "a priori") probability of the "Profit" class is 0.8.

So the prior probability of a class is just the ratio of the population of the class to the total population. At this point, you might want to take a look at posterior probabilities.

For a classification model to reach the lowest possible misclassification rate, classes should be represented in the sample in proportion of their priors. For example, if class C1 is really twice as populous as class C2, then there should be twice as many observations labeled "C1" in the sample than observations labeled "C2". Yet, there are two situations when the practitioner will deliberately violate this rule :

1) Small classes

Imagine a 2-class classification problem, where C1 is really twenty times as populous as class C2. There is a trivial classifier that is performing pretty well, and that is expressed in the following rule :

"Assign any new observation to class C1"

Its  misclassification rate is close to 5%, which is quite remarkable. Yet, the model is obviously useless : it has a 0% misclassification rate on C1, and a 100% misclassification rate on C2.

Now you work up a genuine classification model to bring the global misclassification rate down. Say, after some hard work, you obtain a global 4% misclassification rate (top illustration). Chances are that the misclassification rate on the big C1 will be still very low (say 1%),  while the misclassification rate on the small C2 will remain high (say 60%) : minimizing a global misclassification rate favors big classes in terms of class misclassification percentage. Besides, a little noise in the data will strongly affect the C2 misclassification rate (while leaving the C1 misclassification rate essentially unchanged), which is then both high and unreliable.

Although there is nothing wrong with this situation from the theoretician's viewpoint, the practitioner may feel otherwise. He has no problem identifying new C1 observations, but even the best model is very ineffective at recognizing new C2 observations.

So, in practice, it is common to artificially increase the small class priors, and incorporate this change into the model building algorithm. There are several ways of doing that, and we won't elaborate on this in this glossary.

As a consequence of this volontary bias :

1) The misclassification rate on large classes increases somewhat, but remains within acceptable values,

2) The misclassification rate of small classes decreases up to the point where the model becomes effective again on these small classes.

The price to pay for this little trick is that the global misclassification rate is not quite as low as it could be, but at least the model is now effective at discriminating between all classes, irrespective of their priors (bottom illustration).

2) Misclassification costs

Why bother identifying observations belonging to small classes anyway ? It may be because missing a "small class" observation may have disastrous consequences. The traditional example is that of a serious health condition. Fortunately, at a given time, only relatively few people are going around with, say, an undetected lung cancer. Yet, the central preoccupation of a radiologist scrutinizing many lung X-rays is to miss as few as possible genuine cancer images. Erroneously assigning a "Healthy" image to the "Cancer" class is of no great consequence, while the reverse is some sort of death sentence.

There are many business issues that are less dramatic, but that still require taking misclassification costs  into account. One easy way of doing that is often to artificially increase the small class priors in proportions that are dictated by the various misclassification costs.

Probability density function

Let X be a random numerical variable. For any value x0, the probability for a new observation to be between x0 and x0 + dx is proportional to dx, and can therefore be written :

Probability (x0 < x < x0 + dx) = p(x0).dx

By definition p(x) si the probability density function (or pdf) of the variable.

Any pdf is :

* Positive (or rather non negative) for any value x0.

* The value of its integral over - to + is 1 (area under the curve).

Note that there is no upper limit to the values of a pdf can take, as long as its integral is 1. A pdf may even be "infinite" at some points as is the case, for instance of the 1 distribution, or more generally of the Gamma distribution with α < 1.

The pdf is closely related to the distribution function :

* The value of the distribution function in x0 is the integral (area under the curve) of the p.d.f. from - to x0 (top illustration).

* Conversely,  the pdf is the derivative (slope) of the distribution function, wherever this derivative is defined (bottom illustration).

The next figure is another way of illustrating the relationship between the Probability Density Function and the Distribution function.

 The "Book of Animations" on your computer

Click anywhere in the "Density" frame (including the green and yellow areas). You'll quickly learn to "carve out" a tailor-made probability density function.

Use the slider to study the relationship between the Density function and the Distribution function. For example, build a density function with two humps separated by a depression, and observe that the distribution function is nearly constant (derivative is 0)  in the region where the density is low.

Probability mass function

For discrete random variables, the equivalent of the probability density function for continuous variables. To each value Xi that the discrete variable X can take is attached the probability Pi{X = Xi}. Similarly to the continuous case :

P{A X B} = Pi{X = Xi}

the summation being over all indices i such that A Xi   B.

Whereas a probability density can take any positive value, a probability distribution function can take values that are less or equal than 1, being genuine probabilities.

We have obviously :

i Pi{X = Xi} = 1

________________

Note : distribution functions make no difference between discrete and continuous variables, and are defined the same way in both cases. The Central Limit Theorem is expressed in terms of distribution functions, not in terms of probability densities or probability distributions. This is why it can be applied to continuous or discrete variables alike (see for example the case of the Binomial Distribution).

__________________________________________________