Associations (analysis)
Let us recall that Associations Analysis is mostly used to detect and analyse cross-selling opportunities (see here).
Associations Analysis is certainly (at least, in its principles) one of the simplest models in Data Mining : its core engine does nothing but count records. Associations Analysis detects synergies between products, and does so by counting :
* Records containing product A but not product B,
* Records containing product B but not product A,
* Records containing both product A and product B.
Now, if :
and if we assume that there is no reason why a customer who bought A should have a higher propensity than average to also buy B (or vice versa), then we would expect the proportion of customers who bought both A and B to be p(A).p(B).
If, upon counting, it appears that this proportion is :
The above presentation puts products A and B on the same footing. In practice, Associations are usually meant to be rather "one way", and materialize as rules that read something like :
"15% of customers who bought A also bought B"
which is supposed to describe a causal relationship
between "Buying A" and "Buying B".
As is
the case with correlation,
data by itself can't tell if causation (assuming that it is there) goes this
way or the other way. Therefore, it may very well be worth examining the "other
way" rule, that could read :
"21% of customers who bought B also bought A"
Associations Analysis runs into the same kind of problems than those commonly encountered in Correlation Analysis for numerical variables. In particular, it is well known that overlooking the effects of partial correlation can lead to completely erroneous interpretations of data. Unfortunately, taking these issues into account implies considering Associations between 3 or more products. The "wall of combinatorial explosition" usually forbids pushing the analysis further.
It is customary to associate two numbers with each rule of association :
1) The level of confirmation of the rule
It
is the proportion of all records containing both A and B (keeping
in mind that many records may contain neither A nor B). A high
level of confirmation means that the issue of a possible association between
A and B is of general interest, but it does not mean yet that
a synergy between A and B has indeed be detected. The level of
confirmation is symetrical in A and B.
2) The level of confidence
in the rule
It is the proportion of records containing both A and B among all those containing A (whether associated to B or not). A high level of confidence, as the name implies, means that the rule is deemed real, and not just caused by chance. The level of confidence in not symetrical in A and B. In the above example :
"15% of customers who bought A also bought B"
the level of confidence of the rule is 15%.
Caution, this level of confidence, by itself, does
not tell if the association is stronger of weaker than expected. A conclusion
can be drawn only by comparison with the expected proportion p(A).p(B).
Because Association Analysis can be expressed in terms of very basic probability theory, it is to be expected that software will soon address the issue of assessing the significance of the departure of the observed level confidence from that expected, should the independence of A and B hypothesis be true.
1) The average
of
N real numbers {x1, x2,
..., xN } is, by definition :
= (
i
xi)/N
This definition is not related to probability theory or statistics.
The following interactive animation illustrates the concept of "average of a set of numbers".
* Drag points with your mouse, and observe the influence
on the average.
* Add more points and observe that the average becomes
less sensitive to the position of any point you drag.
* Estimate the width of the excursion of the average when you drag a point (any point) from one end of the scene to the other. Do that again with another point. How does the width of the excursion of the average change ? Can you explain that ? Now try again with any other initial configuration (keeping the number of points constant). What is your conclusion ?
* Go one step further, and explicit the (very simple) formula
that gives the width of this excursion.
2) Sample average
Suppose now that these N numbers have been drawn from a given, but unknown, probability distribution. The question is "What is the mean "µ" of this distribution ?".
The sample average is also commonly called the "sample
mean". Throughout this site, we will try to keep using "average"
for a finite sample, and reserve the word "mean" for the distribution,
or the complete population.
Of course, you cannot give an exact answer from a
finite sample. But if you are pressed to give an estimate of this mean, you
probably won't think twice before suggesting
,
the sample average, as your best guess. You have just used
as
an estimator of the mean µ.
Your guess is indeed appropriate. But why ? Why is the sample average a good estimator of the distribution mean ?
The reason is as follows. Each new sample yields a
different value for
,
so the sample average
is
a random variable. Although the exact distribution of
is
usually unknown (see below), it can be shown that the average of
over
a very large number of samples (that is, its mean) is just µ, the mean of the distribution.
In more technical terms, on says that the expectation of the sample average
is just the distribution mean, which we denote :
|
E( |
Note that this is true whatever the distribution
(as long as it has a mean, which most common distributions do).
Any estimator
with this property is called an unbiased
estimator. "Unbiasedness" is of course a very valuable property for
an estimator.
____________________________________________________________
The following animation illustrates the
fact that
,
the sample average,
is an unbiased estimator of the mean µ.
Upper frame
* The green rectangle is a uniform distribution.
* A sample from this distribution is shown (red ticks), together with the sample average (red line). You may change the sample size with the "Nb. Points" buttons. Each time you click on one of these buttons, another sample is drawn.
To try another distribution, click several times anywhere in the upper frame (even in the green area). The blue line hanging from the top of the frame is the mean of your distribution.
Lower axis
A blue tick, sitting on a blue block, is showing the average of the sample averages of the samples that have already been drawn. Because only one sample has been drawn so far, it is facing the red sample average line of the upper frame.
Animation
Click on "Go". Samples are repetitively drawn from your distribution, and the position of the blue tick is constantly updated to be the average of the sample averages of the samples that have already been drawn.
After a while, the blue tick will line up with the blue line denoting the mean of your density. This will happen:
* whatever the shape of the density,
* and whatever the sample size.
The convergence is fairly slow, though, and it may take
several thousand iterations to get the tick convincingly stable.
So the sample average is indeed an unbiased estimator of the mean.
You may try several sample sizes while retaining the same
density. Just click on "Pause" while the animation is running,
change the sample size, and click on "Go" to start a new simulation.
_________________________________________________________
The distribution of the sample average of a given distribution can, at least in principle, be calculated by refering to the general results about:
* The distribution of the sum of independent random variables (see here), that permits calculating the distribution of the sum of the abscissas of the observations,
* Followed by a simple change of scale 1/n to finally obtain the distribution of the sample average (see here).
In general, the result cannot be expressed in a simple analytic form, except in some particular cases like:
* The normal distribution (see here), whose sample average is also normally distributed,
* The Cauchy distribution (see here), whose sample average is also Cauchy distributed in a way that does not depend on the sample size.
But whenever a distribution has a mean µ and a variance s², the sample mean:
* Has µ for its expected value (unbiased estimation),
* Has a distribution with variance s²/n, where n is the sample size.
Most common distributions have a mean and a variance. Yet, some quite respectable distributions don't. The reason is always the same: the density p(x) goes to 0 at infinity
* fast enough for its integral to be finite (equal to 1).
* but not fast enough to prevent

from diverging (being infinite), thus preventing the distribution from having a mean.
Higher order moments then do not exist
either.
The most classical example of a mean-less distribution is the rouge Cauchy distribution. Two other examples are Fisher's distributions Fn, 2 and Fn, 1.
The barycenter of a set of points is just its
center of gravity. Points often carry "weights", and different points
carry different weights (this is the case, for example, in Correspondence
Analysis).
The illustration below demonstrates the concept of barycenter.
* Drag points around with
your mouse.
* To change the weight of a point, click on it, change its weight, and click on it again, or start dragging it.
Notice how heavy points influence the position and motion of the barycenter much more than light points do.
The fundamental property of the barycenter is this : the barycenter of the projections of the points on x1 is just the projection of the barycenter on the same axis. The same holds true for x2 or for any other straight line. In short, the barycenters of the projections are the projections of the barycenter.
This is a direct consequence of the linear nature of the formula that gives the coordinates of the barycenter :
|
b1 = |
where :
* b1 is the coordinate of the barycenter on axis 1,
* wi is the weight carried by point i,
* xi1 is the coordinate of point i on axis 1.
with a similar formula for axis 2.
In other words, the coordinates of the barycenter are the ponderated averages of the coordinates of the points.
See also Inertia.
This term has at least four different meanings :
1) Bias of data.
3) Bias of a model.
4) Bias of a neuron.
In all cases, "bias" means a systematic departure of a quantity from a reference quantity.
_______________________________________________
1) Bias of data
Data is said to be biased if it was collected in conditions that depart appreciably from normal conditions. Simple examples of biased data occur when measuring physical quantities. For example, suppose a scale has not been properly adjusted, and that it reads "1 ounce" when there is nothing on the pan. Then, all further readings will be 1 ounce higher than the actual weight of any object put in the pan. This may not be a problem if we use always the same scale, but discrepancies will show up when we try to compare measurements made with this particular scale with measurements made with another, properly adjusted scale.
Bias of data is a very common problem in Data
Mining, and it is usually difficult to detect because of the many potential
causes. For instance, sales figures from various stores of a retail chain may
be hard to compare because of different :
* Sizes.
* Sociodemographic environments.
* Local climates.
* The presence of stores from competing chains.
and many other possible factors. In particular, procedures used to collect perfectly sound data may be biased, therefore introducing a type of bias particularly pernicious and hard to detect and eradicate.
Bias of data is a nuisance because it makes models
built on a certain set of data unusable on another set of similar data. Worse,
if the bias is overlooked, the models may be used and provide biased results
(that is, wrong results).
On the other hand, Data Mining may be used to detect and interpret biases in data from different sources. The following figure illustrates the case of two agencies of the same bank, showing different saving patterns of their clients. Building a model on agency A, and using it on data of agency B for predictive purposes would lead to wrong results. But Data Mining can detect and help interpret this behavioral difference.
____________________________________________
2)
Bias of an
estimator
Please see here.
Please see here.
3) Bias of a neuron (in a Multilayer Perceptron)
Among the various parameters ("weights") of a standard neuron of a Multilayer Perceptron (MLP), one of them, that plays a particular role, is called bias (or also "Threshold").
Binary (Variable)
A variable is said to be binary if it can take only two values, often coded as "0" and "1".
Typical examples of binary variables are :
* "Gender", that takes the values "Male" and "Female".
* "Owns_House", that takes the values "Yes" or "No".
A binary variable can therefore be considered as a special case (the simplest one) of a categorical variable. But it can also be considered as a special case of a numerical variable; this interpretation is often used in presentations of Logistic Regression.
See also : Numerical, Categorical, Ordinal.
|
Want to contribute to this site ? |