TRAINING COURSE "CLASSIFICATION"
|
|
* Should a loan be granted to this applicant ?
* Which product is this customer going to buy ?
* Has this new customer a high life-time value ?
* Which ones of my customer are likely to leave my company for the competition ?
These questions, and many others, require assign an individual to a category (or "class"). It is even better if this assignment can be ponderated by a probability, and if the most pertinent attributes for this assignement can be identified. In other words, although they are seemingly different, all these questions belong to the general problem of classification.
Data Mining has a large number of classification techniques at its disposal. They differ widely by their performances and their operational characteristics. This 1 day training course (see outline below) reviews the most important classification techniques available in most Data Mining software.
Outline of the course
The general problem of classification
The geometrical approach
Classification functions.
Class
boundaries
The probabilistic approach
Bayes Theorem and bayesian decision making
Direct and indirect probabilistic classification
Factorial Discriminant Analysis
The 2 class example : Fisher's criterion
Generalization : the concept of discriminant direction
Discriminant projections
Connections between FDA and PCA
How to build a geometric classifier
The general idea of distance to the class barycenter
A special case : normal distribution classes.
The appropriate "distance": the Mahalanobis distance
Linear and quadratic classification rules
How to build a probabilitic classifier
Direct probabilistic models by regression on class indicators
Linear Multiple Regression on class indicators
Logistic Regression
Supervised Neural networks
Other direct models
K-Nearest Neighbors classification
Decision Trees
Indirect classification models
The Bayes Theorem
K Nearest Neighbors density estimation
Kernel density estimation.
"Mixture of Gaussian" density estimation
Classifier performance estimation (validation)
Confusion matrix and ROC curves
How to overestimate the true performance of a classifier
The various error criteria for normal classes
Re-sampling methods
Simple and multiple validation sets
Cross validation, "Leave-One-Out"
Bootstrap
How to chose the independent variables
Why restrict the number of independent variables ?
Stepwise techniques
Forward
Backward
Stepwise