Bounding Box Predictions, Intersection Over Union and Non Max Supression

Boundary Box Prediction

Making predictions for a boundary box more accurate

YOLO Algorithm: You Only Look Once !

Output for each grid cell: y = \left[ \eqalign{P_c\cr b_x\cr b_y\cr b_h\cr b_w\cr C_1\cr C_2 \cr C_3} \right]

8 Output units for each grid.

For an image divided into 3x3 grid, The algorithm will output 8 units for each grid.

Input: 100x100x3
Output: 3x3x8

The neural network outputs precise boundary box in this case.

In practice, a much finer grid of 19x19 is used in place of 3x3 grid.

Assigning an object to a grid cell - The grid which contains the midpoint of an object.

Advantage: Fast execution time

Encoding boundary boxes

y = \left[ \eqalign{1\cr b_x\cr b_y\cr b_h\cr b_w\cr 0\cr 1 \cr 0} \right]

$b_x, b_y, b_h, b_w$ are relative to each grid cell.

$b_x$ & $b_y$ are >0 and <1
$b_h$ & $b_w$ could be more than 1

**Intersection Over Union (IoU) **

IoU is a measure of overlap between two boundary boxes.

IoU = $\frac{SizeOfIntersection}{SizeOfUnion}$ Higher the IoU more accurate will be the boundary box. A threshold of IoU>=0.5 can be used

Non Max Supression

TO make sure the algorithm detects each object only once.

Problem: Multiple grid boxes might think they have the midpoint of the object. Non-Max supression cleans this up

Let $P_c$ be the probabilities associated with each detection.

NMS takes the box with highest prediction probability and removes all other overlapping boxes with IoU >= 0.5 Steps in running the Non Max Supression Algorithm:

1. Run the YOLO Algorithm
2. Discard all boxes with $P_c$ <=0.6
3. For the remaining boxes, pick the box with largest $P_c$ output
4. Discard any remaining boxes with IoU >= 0.5 with the box output in the previous step
5. Independently carry out NMS for ech output class

Anchor Boxes

What if a grid cell wants to detect multiple objects? Solution: Create different anchor boxes - say one vertical for detecting pedestrians and another horizontal for detecting cars.

Earlier, y = \left[ \eqalign{1\cr b_x\cr b_y\cr b_h\cr b_w\cr 0\cr 1 \cr 0} \right]

With Anchor boxes, each anchor box will have its separate 8 output units.

y = \left[ \eqalign{1\cr b_x\cr b_y\cr b_h\cr b_w\cr 0\cr 1 \cr 0\cr 1\cr b_x\cr b_y\cr b_h\cr b_w\cr 1\cr 0 \cr 0} \right]

The first anchor box detects a car ($C_2$ =1) and the second anchor box detects a pedestrian ($C_1$=1), both in the same grid cell The output for a 3x3 grid will be 3x3x16, in place of earlier 3x3x8 when no anchor boxes were used.

If there is no car but only a pedestrian, the output will become:

y = \left[ \eqalign{0\cr ?\cr ?\cr ?\cr ?\cr ?\cr ? \cr ?\cr 1\cr b_x\cr b_y\cr b_h\cr b_w\cr 1\cr 0 \cr 0} \right]

Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on December 22, 2017
]