A sample is a fraction of a population (of observations) with three characteristic features :
1) It is much smaller than
the whole population it is extracted from (and which is usually considered
as essentially infinite).
2) Yet, it is hoped that its distribution is a small scale, but fair representation of the distribution of the whole population.
3) You have it in your data base.
The sample is the data that will fuel model construction.
Hopefully, the information extracted from the sample by the model will apply
just as well to the population at large.
In practice, things are a bit more complicated than that.
1) Raw data is rarely usable
without a considerable amount of auditing and preprocessing (deletion of observations
or variables with too much missing information, outliers elimination, variables
selection and recoding etc...).
2) Careful attention must
be paid to the issue of sample bias
: "Is the sample really representative of the population that we wish to
model ?". Possible causes of bias are numerous, hard to detect, and
often impossible to correct.
3) Because the sample is not the population, but only a small fraction of the population, any conclusion drawn from a model built on the sample is corrupted with uncertainty. The smaller the sample, the greater the uncertainty. One of the central issues of practical Data Mining is to asses this level of uncertainty, so that the ensuing decision making process can be ponderated by a known risk level. Evaluating the uncertainties that spoil a model is called validating the model.
In Statistics, the term "Samples" has a somewhat more restricted meaning. It is still a group of observations drawn from a population, but now only one variable is usually considered. For example, if this variable is numerical, then a "n observation sample" is just a collection of n numbers.
The sample is usually the only information about the underlying distribution at the statisticians's disposal. The sample will then "feed" tests that this distribution will be submitted to.
Tests often take several samples into account, for example if it is needed to decide whether they were drawn from identical or different distributions. It is then necessary to make a distinction between "Independent samples" and "Matched samples". These important concepts are detailed here.
Another name for "cluster analysis", "clustering", or "unsupervised classification".
Separability (of classes)
1) Two classes C1
and C2 are said to be separable if they don't overlap.
In such a situation, there could be a clear cut boundary between the two classes,
with all the points in C1 on one side of the frontier, and all the
points in C2 on the other side. This very academic
notion would allow constructing a perfect, non probabilistic classifier, but
is hardly ever met in practice (See "Classification").
2) Several classes are linearly separable if each class can be separated from all the other classes by a linear boundary (a line in the plane, a plane in space etc...). This situation is even less realistic than simple separability.
Then why worry about separability if the concept is so irrealistic ? It's because some weaker form of separability may happen in practice. It may be that one the classes has a linear (or nearly linear) best boundary between itself and the set of the other classes. We say "best'" instead of "perfect" to stay realistic. Call C1 the class that is nearly linearly separated from all the others. Then classification may be accomplished in two setps :
1) First, build a linear classifier that will separate C1 from the rest of the other classes. This classifier will tell you if an observation belongs to C1 or not.
2) Then build a second classifier that will disciminate among classes other than C1 . If the observation does not belong to C1 , the this second classifier will tell you which class it belongs to.
Why bother with two classifiers instead of one ? For
two reasons :
1) The first classifier is
linear (in the variables),
and therefore simple and insightful.
2) Suppose there were N classes to start with. Then the second classifier has now to discriminate between N-1 classes only, a much simpler task than discriminating between N classes.
Note also that the same approach may be envisioned for the remaining N-1 classes.
This "Divide and conquer" strategy may be attempted even if no class is linearly separable from the others. As a matter of fact, there are quite a few approaches that replace one big classification problem with several smaller ones. With a bit of maths, it is even possible to keep the global classifier probabilitic, and even get better results than with a single classifier.