Anamoly Detection Algorithms
Anamoly Detection is a class of semi-supervised (close to unsupervised) learning algorithm widely used in Manufacturing, data centres, fraud detection and as the name implies, anamoly detection. Normally this is used when we have a imbalanced classification problem, with, say, y=1(anamoly) is approx 20 and y=0 is 10,000. An example would be identifying faulty aircraft engines based on a wide number of parameters, where the anamolous data might not be available or if it is available, will be less than 0.1%.
Algorithm:
Suppose there are m training examples
Problem Statement : Is anamolous?
Approach:
Suppose are the features of the training examples
Model p(x) from the data; p(x) = ,
or
p(x) =
Identify unusual/anamolous examples by checking if p(x)<
Guassian Distribution
An assumption of the above model is that the features are distributed as per the Guassian Distribution with mean $\mu_j$ and variance $\sigma_j^2$. If the features are distributed in a different way apply transformations to convert them to the normal distribution.
Normal Distribution
~
=
Probability Density Function
Cumulative Distribution Function
Parameter Estimation:
=
Anamoly Detection Algorithm
- Choose features that you think might be indicative of anamolous examples
- Fit parameters using the formulae
- Given new example , compute as:
= =
Anamoly if p(x)<$\epsilon$
Example - Dividing data into Train, CV and Test Set
Anamoly Detection vs Supervised Learning