YOLOv3: A Huge Improvement

10 min readMar 29, 2018

Speed/Accuracy tradeoff on the mAP at .5 IOU metric (Source: YOLOv3 paper)

YOLO : You Only Look Once by Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi in 2016 came up with a new approach to solve the object detection problem. Before YOLO all the object detection models had to perform some type of detection and then classification would be done on top of the detected ROI’s(Region of Interest). But YOLO framed this as a regression problem and tried to perform detection as well as classification using a single neural network. They trained this end to end network by optimizing(optimizing the loss) it for the detection performance. Since then many new ways or neural networks tried to solve the object detection problem but no one was faster than YOLO but YOLO had some drawbacks like lower MAP(Mean Average Precision) and localization errors which got solved in the next version YOLOv2 and YOLOv3.

YOLO V1 !

Paper : https://arxiv.org/abs/1506.02640

YOLOv1 introduced for the first time an unified object detection model : A single convolutional network which simultaneously predicts multiple bounding boxes and class probabilities for those boxes. It had some advantages over other models like, it was really fast, secondly YOLO was able to understand contextual information about the classes as YOLO sees the entire image during training and test time and lastly it could generalize(was able to detect same objects even if its in some different domain , eg: It could detect objects from a art-work even if it was trained on natural images) better than any other model present then. While the advantages seemed lucrative, YOLO lacked behind in terms of accuracy and did a lot of localization errors, specially for small objects.

Architecture

YOLO V1 uses DarkNet framework trained on ImageNet-1000 dataset as its feature extractor .The DarkNet framework is modified for detection by adding 4 convolutional layers and 2 fully connected layers on top. This architecture is very simple when compared with complex two stage detectors like Faster RCNN. Speaking about numbers the combined network has 24 convolutional layers and 2 fully connected layers.

How does it work in short….

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.Each bounding box consists of 5 predictions: x,y,w,h and confidence.Each grid cell also predicts C conditional class probabilities.

So, when we you pass an image through YOLOv1 it outputs, x, y co-ordinates of the center of the object, w : width of the object, h : height of the object and class probability of class which was detected.

Limitations

As you can clearly see YOLO performs more errors than Faster RCNN specially Location errors while on the other hand YOLO performs less background errors than Faster RCNN.

Quoting the exact words from the paper,

1.Our model struggles with small objects that appear in groups, such as flocks of birds.
2.Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations.
3.Our main source of error is incorrect localizations.

The above limitations were resolved in the next YOLO versions.

YOLO V2 !

Paper : https://arxiv.org/abs/1506.02640

YOLOv2 named YOLO9000 : Better, Faster and Stronger was published by Joseph Redmon and Ali Farhad at end of 2016 which was an improvement over YOLOv1. Speaking about the improvements, YOLOv2 was now almost able to match the MAP reported by Faster R-CNN and SSD(Single Shot Dectector), keeping its fast nature constant. Lets take a look at the Accuracy and speed tradeoff on the Pascal VOC dataset,

Accuracy and speed tradeoff on VOC 2007 (Source : YOLOv2 paper)

Changes to YOLOv1 that shaped a Better, Stronger and Faster YOLOv2 :

YOLO to YOLOv2 Changes (Source : YOLOv2 paper)

Batch Normalization : Adding Batch Normalization to all the convolutional layers improved the MAP by 2%. Batch Normalization also helped regularize the model and thus reduced any kind of overfitting.
High Resolution Classifier : Since AlexNet, most of the classifiers operate on input images smaller than 256 * 256 (Inception operates at 299*299 and NASNet operates at 331* 331 but NASNet was not present in 2016). YOLOv2 increased their 224 * 224 input size to 448 * 448 while training the DarkNet on ImageNet dataset. This increased the MAP by almost 4%.
Anchor Boxes : In YOLOv2 anchor boxes where introduced just like RPN or Faster RCNN but this reduced the total accuracy but increased the recall. This means that the model has more room to improve and it is now able to predict more bounding boxes per image(98 in YOLO to more than thousand in YOLOv2). After this YOLOv2 uses k-means clustering to define the input dimensions of the anchor boxes.
Fine-Grained Features : YOLOv2 predicts detections on a 13*13 feature map, which is smaller than YOLOv1. This helps localizing small objects while being efficient even for large objects.
Multi-Scale Training : YOLO performed weak while detecting the objects with different image sizes.(for eg : If the model was trained on large images of an object, it will not be able to detect small images of the same object). So during YOLOv2’s training, the network randomly chooses the image dimension sizes, with the minimum being 320 * 320 and the maximum being 608 * 608.
Darknet 19 : YOLOv2 uses a new classification model as a backbone classifier.

The Darknet-19 Architecture has 19 convolutional layers and 5 max pooling layers, on top of the last convolutional layer will be a softmax layer for classification. Similar to VGG, it uses a constant filter size of 3 * 3. Darknet achieves a top-1 accuracy of 72.9% and top-5 accuracy of 91.2% on ImageNet dataset.

So, which YOLOv1 problems got solved?

Using Multi-Scale training, the model learned to generalize and detect objects with different aspect ratios or configurations.
Fine grained features helped improving Average Precision for small objects, which was now at par with SSD300 but still lagging behind when compared with Faster RCNN models.
As the total MAP increased (from 63.4 to 78.6 on Pascal VOC dataset), so you can say that the localization errors reduced to some amount. The MAP at 0.5 IOU still lacks behind the best Faster RCNN on COCO datasets by 45.3 on Faster RCNN to 44.0 on YOLOv2.

YOLO V3 !

(Not just an Incremental Improvement !)

Paper : https://pjreddie.com/media/files/papers/YOLOv3.pdf

2017 witnessed some real fight for the best Object Detection model with RetinaNet (another one-stage detector), Faster RCNN with FPN with ResNext as the backbone and Mask RCNN with ResNext backbone and then RetinaNet with the ResNext backbone topping the charts with an MAP of 61 on COCO dataset for 0.5 IOU. RetinaNet being a one-stage detector was faster than the rest. With no new version of YOLO in 2017, 2018 came with best RetinaNet(the one I mentioned above) and then now YOLO V3!. The paper is written by again, Joseph Redmon and Ali Farhad and named YOLOv3: An Incremental Improvement. This brought the fast YOLOv2 at par with best accuracies. YOLOv3 gives a MAP of 57.9 on COCO dataset for IOU 0.5. For comparisons just refer the table below:

Mean Average Precision Comparisons 2018! (Source : YOLOv3 paper)

Now you can observe, 57.9 is at par with all the two stage detectors. YOLO608(best YOLO with high dimensional input images) is still almost 4x times faster than best RetinaNet and 2x faster than second best RetinaNet. The YOLO320 has same accuracy as the RetinaNet with ResNet50 backbone being 4x times faster. This makes YOLOv3 clearly very efficient for any general object detection use-case.

What changed ? What are the so called Incremental Improvements?

Bounding Box Predictions : YOLOv3 just like YOLOv2 uses dimension clusters to generate Anchor Boxes. Now as YOLOv3 is a single network the loss for objectiveness and classification needs to be calculated separately but from the same network. YOLOv3 predicts the objectiveness score using logistic regression where 1 means complete overlap of bounding box prior over the ground truth object. It will predict only 1 bonding box prior for one ground truth object( unlike Faster RCNN) and any error in this would incur for both classification as well as detection (objectiveness) loss. There would also be other bounding box priors which would have objectiveness score more than the threshold but less than the best one, for these error will only incur for the detection loss and not for the classification loss.
Class Predictions : YOLOv3 uses independent logistic classifiers for each class instead of a regular softmax layer. This is done to make the classification multi-label classification. What it means and how it adds value? Take an example, where a woman is shown in the picture and the model is trained on both person and woman, having a softmax here will lead to the class probabilities been divided between these 2 classes with say 0.4 and 0.45 probabilities. But independent classifiers solves this issue and gives a yes vs no probability for each class, like what’s the probability that there is a woman in the picture would give 0.8 and what’s the probability that there is a person in the picture would give 0.9 and we can label the object as both person and woman.
Predictions across scales : To support detection an varying scales YOLOv3 predicts boxes at 3 different scales. Then features are extracted from each scale by using a method similar to that of feature pyramid networks. What’s the method ? I will quote the paper for this,

We take the feature map from 2 layers previous and upsample it by 2× . We also take a feature map from earlier in the network and merge it with our upsampled features using element-wise addition. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size. We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as fine-grained features from early on in the network.

Thus, YOLOv3 gains the ability to better predict at varying scales using the above method. The bounding box priors generated using dimension clusters are divided into 3 scales, so that there are 3 bounding box priors per scale and thus total 9 bounding box priors.

4. Feature Extractor : YOLOv2 used Darknet-19 as its backbone feature extractor, YOLOv3 uses a new network- Darknet-53! Darknet-53 has 53 convolutional layers, its deeper than YOLOv2 and it also has residuals or shortcut connections. Its powerful than Darknet -19 and more efficient than ResNet-101 or ResNet-152.

ImageNet Results (Source : YOLOv3 paper)

As you can see, Darknet-53 is better than ResNet-101 but 1.5 times faster and is as accurate as ResNet-152 but 2x times faster.

What improved ?

The average precision for small objects improved, it is now better than Faster RCNN but Retinanet is still better in this.
As MAP increased localization errors decreased.
Predictions at different scales or aspect ratios for same object improved because of the addition of feature pyramid like method(They should have named this).
And, MAP increased significantly.

What can be improved (YOLOv4 expectations)?

The average precision for medium and large objects can be improved as medium is 5 percent and large is 10 percent behind the best.
MAP score between 0.5 to 0.95 IOU can be increased.
The implementation for DarkNet-53 as well as YOLO is currently in C, maybe a python implementation.

So what’s the conclusion ?

YOLOv3 ! is fast, has at par accuracy with best two stage detectors (on 0.5 IOU) and this makes it a very powerful object detection model. Applications of Object Detection in domains like media, retail, manufacturing, robotics, etc need the models to be very fast(a little compromise on accuracy is okay) but YOLOv3 is also very accurate. This makes it the best model to choose in these kind of applications where speed is important either because the products need to be real-time or the data is just too big. Some other applications like Security or Autonomous driving require the accuracy of the model to be very high because of the sensitive nature of the domain, you don’t want people dying right ? Maybe we cannot use YOLOv3 in such sensitive domains. A very good accuracy with the best speed makes YOLOv3 a go to object detection model at least for now(29th March 2018) !

References:

YOLOv1 : https://arxiv.org/abs/1506.02640

YOLOv2 : https://arxiv.org/abs/1506.02640

YOLOv3 : https://pjreddie.com/media/files/papers/YOLOv3.pdf

DarkNet : https://pjreddie.com/darknet/

YOLOv3: A Huge Improvement

YOLO V1 !

YOLO V2 !

YOLO V3 !

So what’s the conclusion ?

Written by Anand Sonawane