Metrics

Rule that defines how the distance between two "points" (e.g., two observations) should be calculated from the values of their attributes.
 

 Many modeling techniques use the concept of "distance" between observations. For example :

    * Hierarchical clustering first merges the two observations with the smallest "distance" among all the pairs of observations (that is, in fact, the two observations that are the most similar to each other)..

    * PCA projects observations on a plane chosen so as to minimize  the errors on the mutual "distances" of observations caused by the projection.

 

Imagine that all your variables are numerical (they may not all have been numerical initially, but may have been coded as new, numerical variables). Now, how do you define the "distance" between two observations ? Out of school habits, you'll probably think that "Distance" is just short for "Euclidian distance". But you may also remember that Euclidan distance is just one out of  many possible ways of measuring "how close" (or similar) observations are from each other. In fact, it turns out that the Euclidian distance is not always a good measure of the similarity (or rather, the dissimilarity) between observations when building a model of data.

We can't go into the details here, but you'll often see mentioned :

    * The Mahalanobis distance between an observation and the center of a multivariate normal distribution (see tutorial on Discriminant Analysis).

    * The Chi-2 distance between rows of a Correspondence Analysis
  

Download this Glossary

 

Want to contribute to this site ?