Facial Recognition & Verification using Convolutional Neural Network

Understanding the algorithm behind the Facial Recognition & Facial Verification technologies and the associated loss functions and technical details. I will also be building a code from scratch (will be posted separately - this post is mostly algorithms and mathematics) for Face Recognition using CNN

Face Verification vs Face Recognition

Face Verification: Input image, Name/ID. Output whether the input image is that of the claimed person (1:1 matching, 99% accuracy will work)

Face Recognition: Has a database of $K$ persons. Get an input image. Output ID if the image is any of the $K$ persons or not recognized (1:K matching, much harder to solve, over 99.9% accuracy is needed)

One Shot Learning

Learning from one example to recognize the person again

Approach 1: take the image of employees in your dataset, run them through a CNN and later a softmax output layer. This is not very accurate, and now if a new person joins the team the softmax output will have one more unit, and it will need to be retrained.

Learn a Similarify function d(image1, image2) = degree of difference between images

During recognition time, if d(img1,img2)<= $\tau$, images are same
if d(img1, img2)> $\tau$, images are different

This is how a Face Verification task is addressed

To use it for Face Recognition, the same function is used to find d() between the input and all other images in the database pairwise. This will solve the the one shot learning problem so long as you learn the similarity function d()

Training a Neural network to learn the function d()

Siamese Network

Input an image through the layers of a CNN and obtain a vector befor the output layer that is deeper in the network $f(x^{(1)}$.

Think of $f(x^{(1)})$ as an encoding of the input image $x^{(1)}$

For a different image, run the image through the same network to obtain the encoding $f(x^{(2)})$ of image $x^{(2)}$

If you want to compare, define distance between the tho images as the norm of difference between the encodings

$d(x^{(1)},x^{(2)} = || f(x^{(1)})-f(x^{(2)})||^2$

This idea of running the same CNN on two different images and compare the encodings is called a Simaese Network.

Used in the paper for DeepFace

Goal of Learning:

• Parameters of NN define an encoding $f(x^{(i)})$
• Learn parameters so that:
If $x^{(i)}, x^{(j)}$ are the same person $|| f(x^{(i)})-f(x^{(j)})||^2$ is small
If $x^{(i)}, x^{(j)}$ are the different person $|| f(x^{(i)})-f(x^{(j)})||^2$ is large

Defining an Objective Function : Triplet Loss Function

One way to learn the encodings of an image on a CNN is to apply the Gradient Descent optimization on the Triplet loss function.

To apply the triplet loss, you need to compare a pair of images.

An Anchor image is compared to a similar image (Positive) and a different image (Negative). The distance in the first case (Anchor-Positive) needs to be small while the distance in the second case (Anchor-Negative) must be large.

Triplet loss - you are looking at three images (Anchor, Positive, Negative - A,P,N)

Want: $|| f(A)-f(P)||^2$ <= $|| f(A)-f(N)||^2$ or d(A,P)<=d(A,N)

or $|| f(A)-f(P)||^2 - || f(A)-f(N)||^2 <=0$

To make sure the NN doesn’t set all the values = 0 (trivial output), $|| f(A)-f(P)||^2 - || f(A)-f(N)||^2 <=0-\alpha$ to prevent the trivial solution
or $|| f(A)-f(P)||^2 - || f(A)-f(N)||^2 +\alpha <=0$

$\alpha$ is also known as margin parameter

To make sure the margin $\alpha$ is good enough (if d(A,P) = 0.5 and d(A,N)=0.51 then this is not the correct solution), $\alpha$ must be atleast 0.2. It pushes the d(A,P) and d(A,N) further apart.

Defining the Triplet loss function - defined on the triple images

Given three images A,P,N

Loss $L(A,P,N) = max(|| f(A)-f(P)||^2 - || f(A)-f(N)||^2 +\alpha,0)$

Using max(), the first arguement is pushed to 0 or less than 0

Cost function $J = \sum^m_{i=1} L(A^{(i)},P^{(i)},N^{(i)})$

If there is a trainng set of 10000 pictures of 1k persons, you need a dataset where you have multiple pictures of same person (A,P) - in this case you’ll have atleast 10 images of one person.

How do you choose these triplets to form your training set?

Choosing the triplets A,P,N

During training, if A,P,N are chosen randomly, then d(A,P)+$\alpha$<=d(A,N) is easily satisfied.

To construct the triplet, choose A,P,N which are hard to train on. This increases the computational efficiency and gradient descent learn properly

This makes d(A,P) go down and d(A,N) go up so that the d(A,P) is much less than d(A,N) (margin of $\alpha$)

Paper: FaceNet

A lot of companies have already trained these CNNC over a large database (more than 100 Million images) and parameters from that training can directly be used here.

Using a binary classification function to classify the images

A logistic regression can also be used where the last unit in the CNN can be taken and applied to a logistic regression Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on December 23, 2017
]