# Anamoly Detection Algorithms

Anamoly Detection is a class of semi-supervised (close to unsupervised) learning algorithm widely used in Manufacturing, data centres, fraud detection and as the name implies, anamoly detection. Normally this is used when we have a imbalanced classification problem, with, say, y=1(anamoly) is approx 20 and y=0 is 10,000. An example would be identifying faulty aircraft engines based on a wide number of parameters, where the anamolous data might not be available or if it is available, will be less than 0.1%.

### Algorithm:

Suppose there are m training examples $x^{(1)},x^{(2)},x^{(3)},....,x^{(m)},$

Problem Statement : Is $x_{test}$ anamolous?

Approach:
Suppose $x_i$ are the features of the training examples

Model p(x) from the data; p(x) = $p(x_1;\mu_1 , \sigma_1^2)*p(x_2;\mu_2, \sigma_2^2)*p(x_3;\mu_3 , \sigma_3^2)*...$,
or
p(x) = $\prod_{j=1}^n p(x_j;\mu_j, \sigma_j^2)$

Identify unusual/anamolous examples by checking if p(x)<$\epsilon$

### Guassian Distribution

An assumption of the above model is that the features $x_j$ are distributed as per the Guassian Distribution with mean $\mu_j$ and variance $\sigma_j^2$. If the features are distributed in a different way apply transformations to convert them to the normal distribution.

Normal Distribution

$X$ ~ $N(\mu, \sigma^2)$

$p(x;\mu, \sigma^2)$ = $\frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Probability Density Function

Cumulative Distribution Function

Parameter Estimation:

$\sigma^2$ = $\frac1m\sum_{i=1}^m (x_i-\mu)^2$

### Anamoly Detection Algorithm

1. Choose features $x_i$ that you think might be indicative of anamolous examples
2. Fit parameters $\mu_1, \mu_2, ..\mu_n, \sigma_1^2, \sigma_2^2,... \sigma_n^2$ using the formulae
3. Given new example $x$, compute $p(x)$ as:

$p(x)$= $\prod_{j=1}^n p(x_j;\mu_j, \sigma_j^2)$ = $\frac1{\sqrt{2\pi \sigma_j^2}}e^{-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}}$

Anamoly if p(x)<$\epsilon$

Example - Dividing data into Train, CV and Test Set

Anamoly Detection vs Supervised Learning

Written on March 24, 2018
[ ]