Metrics
Rule that defines how the distance between
two "points" (e.g., two observations) should be calculated from the values of their attributes.
Many modeling techniques use the concept of "distance" between observations. For example :
* Hierarchical clustering first merges the two observations with the smallest "distance" among all the pairs of observations (that is, in fact, the two observations that are the most similar to each other)..
* PCA projects observations on a plane chosen so as to minimize the errors on the mutual "distances" of observations caused by the projection.
Imagine that all your variables are numerical
(they may not all have been numerical initially, but may have been coded as
new, numerical variables). Now, how do you define the "distance"
between two observations ? Out of school habits, you'll probably think that
"Distance" is just short for "Euclidian distance". But you
may also remember that Euclidan distance is just one out of many possible
ways of measuring "how close" (or similar) observations are from
each other. In fact, it turns out that the Euclidian distance is not always
a good measure of the similarity (or rather, the dissimilarity) between observations
when building a model of data.
We can't go into the details here, but you'll
often see mentioned :
* The Mahalanobis distance between an observation and the center of a multivariate normal distribution (see tutorial on Discriminant Analysis).
* The Chi-2 distance between
rows of a Correspondence Analysis.
|
Want to contribute to this site ? |