# Anamoly Detection Algorithms

Anamoly Detection is a class of semi-supervised (close to unsupervised) learning algorithm widely used in Manufacturing, data centres, fraud detection and as the name implies, anamoly detection. Normally this is used when we have a imbalanced classification problem, with, say, y=1(anamoly) is approx 20 and y=0 is 10,000. An example would be identifying faulty aircraft engines based on a wide number of parameters, where the anamolous data might not be available or if it is available, will be less than 0.1%.

### Algorithm:

Suppose there are m training examples $x^{(1)},x^{(2)},x^{(3)},....,x^{(m)},$

Problem Statement : Is $x_{test}$ anamolous?

Approach:
Suppose $x_i$ are the features of the training examples

Model p(x) from the data; p(x) = $p(x_1;\mu_1 , \sigma_1^2)*p(x_2;\mu_2, \sigma_2^2)*p(x_3;\mu_3 , \sigma_3^2)*...$,
or
p(x) = $\prod_{j=1}^n p(x_j;\mu_j, \sigma_j^2)$

Identify unusual/anamolous examples by checking if p(x)<$\epsilon$

### Guassian Distribution

An assumption of the above model is that the features $x_j$ are distributed as per the Guassian Distribution with mean $\mu_j$ and variance $\sigma_j^2$. If the features are distributed in a different way apply transformations to convert them to the normal distribution.

Normal Distribution

$X$ ~ $N(\mu, \sigma^2)$

$p(x;\mu, \sigma^2)$ = $\frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Probability Density Function Cumulative Distribution Function Parameter Estimation:

$\sigma^2$ = $\frac1m\sum_{i=1}^m (x_i-\mu)^2$

### Anamoly Detection Algorithm

1. Choose features $x_i$ that you think might be indicative of anamolous examples
2. Fit parameters $\mu_1, \mu_2, ..\mu_n, \sigma_1^2, \sigma_2^2,... \sigma_n^2$ using the formulae
3. Given new example $x$, compute $p(x)$ as:

$p(x)$= $\prod_{j=1}^n p(x_j;\mu_j, \sigma_j^2)$ = $\frac1{\sqrt{2\pi \sigma_j^2}}e^{-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}}$

Anamoly if p(x)<$\epsilon$

Example - Dividing data into Train, CV and Test Set Anamoly Detection vs Supervised Learning Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on March 24, 2018
]