TRAINING COURSE "CLASSIFICATION in Data Mining"

* Should a loan be granted to this applicant ?

* Which product is this customer going to buy ?

* Has this new customer a high life-time value ?

* Which ones of my customer are likely to leave my company for the competition ?

These questions, and many others, require assign an individual to a category (or "class"). It is even better if this assignment can be ponderated by a probability, and if the most pertinent attributes for this assignement can be identified. In other words, although they are seemingly different, all these questions belong to the general problem of classification.

Data Mining has a large number of classification techniques at its disposal. They differ widely by their performances and their operational characteristics. This 1 day training course (see outline below) reviews the most important classification techniques available in most Data Mining software.

Outline of the course

The general problem of classification

The geometrical approach

Classification functions.
Class boundaries

The probabilistic approach

Bayes Theorem and bayesian decision making

Direct and indirect probabilistic classification

Factorial Discriminant Analysis

The 2 class example : Fisher's criterion

Generalization : the concept of discriminant direction

Discriminant projections

Connections between FDA and PCA

How to build a geometric classifier

The general idea of distance to the class barycenter

A special case : normal distribution classes.

The appropriate "distance": the Mahalanobis distance

How to build a probabilitic classifier

Direct probabilistic models by regression on class indicators

Linear Multiple Regression on class indicators

Logistic Regression

Supervised Neural networks

Other direct models

K-Nearest Neighbors classification

Decision Trees

Indirect classification models

The Bayes Theorem

K Nearest Neighbors density estimation

Kernel density estimation.

"Mixture of Gaussian" density estimation

Classifier performance estimation (validation)

Confusion matrix and ROC curves

How to overestimate the true performance of a classifier

The various error criteria for normal classes

Re-sampling methods

Simple and multiple validation sets

Cross validation, "Leave-One-Out"

Bootstrap

How to chose the independent variables

Why restrict the number of independent variables ?

Stepwise techniques

Forward

Backward

Stepwise