Anamoly Detection Algorithms
Anamoly Detection is a class of semi-supervised (close to unsupervised) learning algorithm widely used in Manufacturing, data centres, fraud detection and as the name implies, anamoly detection. Normally this is used when we have a imbalanced classification problem, with, say, y=1(anamoly) is approx 20 and y=0 is 10,000. An example would be identifying faulty aircraft engines based on a wide number of parameters, where the anamolous data might not be available or if it is available, will be less than 0.1%.
Suppose there are m training examples
Problem Statement : Is anamolous?
Suppose are the features of the training examples
Model p(x) from the data; p(x) = ,
Identify unusual/anamolous examples by checking if p(x)<
An assumption of the above model is that the features are distributed as per the Guassian Distribution with mean $\mu_j$ and variance $\sigma_j^2$. If the features are distributed in a different way apply transformations to convert them to the normal distribution.
Probability Density Function
Cumulative Distribution Function
Anamoly Detection Algorithm
- Choose features that you think might be indicative of anamolous examples
- Fit parameters using the formulae
- Given new example , compute as:
Anamoly if p(x)<$\epsilon$
Example - Dividing data into Train, CV and Test Set
Anamoly Detection vs Supervised Learning
Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.