Generally updated in the area of Object detection, you might be knowing about YOLO (You Only Look Once). YOLO is a neural network for real-time Object Detection. It is a system built on deep learning. YOLO model processes images in real-time at 45 frames per second. You can take any classifier and make it as an object detector by using sliding window approach where the classifier is run at evenly spaced locations over the entire image. At each step you run the classifier to get a prediction of what kind of object, the current window has.
Sliding window approach gives several hundreds or thousands of predictions for a single image, but you only keep the ones the classifier is the most certain about.Another approach can be region proposal methods in which first bounding boxes are generated in an image and then a classifier is used on these proposed boxes. After classification, post-processing has to be done to eliminate duplicate detections, and rescore the box based on other objects in the scene. The above approaches are slow and hard to optimize.A more efficient approach is to first predict which parts of the image contains required information and then run the classifier only on these parts only.
In YOLO We reframe object detection as a single regression problem.As the name suggest YOLO looks at the image just once. It divides up the image into a grid of 13 by 13 cells. Each of these cells predicts 5 bounding boxes. A bounding box is a rectangle that encloses an object.For each bounding box it parallely runs recognition algorithm to identify which object class do they belong to.The recognition algorithm gives a probability distribution over all the possible classes.
YOLO also outputs a “confidence score” that tells the certainty of predicted bounding box enclosing some object. The confidence score and the class prediction are combined into one final score that tells us the probability that this bounding box contains what type of object.According to the final score, it goes on to merging these boxes intelligently to form an optimal bounding box around the objects.
There are 13×13 = 169 grid cells and each cell predicts 5 bounding boxes, so we have 845 bounding boxes in total. Most of the boxes will have very low confidence scores, so we only keep the boxes whose final score is above a threshold, which depends on how accurate you want your detector to be. From the 845 total bounding boxes we only kept those that gave the best results.A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.Whereas in YOLO all the 845 separate predictions were made at the same time so the neural network just ran once. And that’s why YOLO is fast and directly optimizes detection performance.