Fast R-CNN was introduced in April 2015. It was faster than R-CNN. But not fast enough and hence unsuitable for real time applications. Faster R-CNN was introduced by Ross Girshick et al. in the June of same year and its much faster that Fast R-CNN. The research paper can be found here.
2. First Look at Faster R-CNN
First we will learn the high level working of Faster R-CNN using Figure 1. Try to see that most of the network works very much like Fast R-CNN( except RPN). Keep an eye on Figure 1 while reading following steps.
- The input image is first fed to Convolutional neural network which extracts features and generates a feature map.This feature map is Fed to region proposal network (RPN).
- Anchors are generated for input image ( not shown on Figure 1). The anchors are like fixed size bounding boxes which will try to cover objects of different size and aspect ratio. These anchors span over the entire image to cover all the objects present in it. More details about anchor will be covered in next section.
- In Faster R-CNN, RPN is used to generate object proposals, also called region proposals. RPN is also a CNN. The output of initial layers of this CNN goes into two branches. The first branch is for objectness classification which gives two probabilities of whether any object is present in a given anchor or not. Second branch is bounding box regressor which adjusts the boundaries of bounding boxes to fit the objects inside it(if any object is at all present inside them.)
- These object proposals generated by RPN are then projected onto the feature map, generated in step 1.
- RoI pooling is used to extract feature vectors corresponding to each object proposal. This process is explained in full details in my previous story on Fast R-CNN here.
- These feature vectors are given as an input to a fully connected layer, whose output then goes into two branches. One branch for multiclass classification and other for bounding box regression. Remember that this bounding box regressor is different from bounding box regressor in RPN.
- The multiclass classification branch uses a number of convolution layers and a softmax layer to classify corresponding object proposal into any of the object categories (classes). Background of the image is also considered as one of categories.
- The bounding box regressor branch further refines the boundaries of bounding boxes to fit the boundaries of object present inside them.
In a nutshell, for generating object proposals, Fast R-CNN uses selective search , whereas Faster R-CNN uses RPN. Faster R-CNN can be thought of combination of RPN and Fast R-CNN detector.
This completes our high level understanding of Faster R-CNN. If you want to go deeper, follow along……
3. More insights into Faster R-CNN
A broad overview of Faster R-CNN is given by Figure 2.
To put everything in context …..
Faster R-CNN generates a lot of anchor boxes of different size and aspect ratio, all over the image to accommodate all the objects present in an image. First a pretrained CNN like ResNet-101 or VGG16 can be used to generate a feature map for input image. RPN checks which of the anchors really contain objects and modifies their boundaries to fit the object inside them. Now an object may be present in more than one overlapping anchors. Non-maximum suppression (NMS) is used to keep only the best fitting anchor and reject the rest. These selected anchors (object proposals) are used to extract the fix length feature vectors from feature map using RoI pooling. These feature vectors are further used for multiclass classification to get the object present in the corresponding object proposal and for bounding box regressor to refine the boundaries of bounding box to precisely fit the object.
I hope this makes sense.
A broad overview of Faster R-CNN is given by Figure 2.
3.1 Anchors and Region Proposal Network
OK. So I will try to oversimplify it………
The most basic task CNNs can effectively perform is to classify images. All algorithms in the family of R-CNN, are very clever ways of extending this capacity of CNNs to detect objects in an image.
So the most logical thing to do is to do it in following two steps.
- Extract the regions in an image that contain an object.
- Then use CNN on each of those regions to identify the object inside them, just like we do for basic classification task.
But how can we get the regions (anchors in our case) in an image that contain objects? We can try all possible combination of anchor height and width. But that would leave us with insane number of anchors for one image. We can draw infinite number of rectangle on an image withing repetition, can’t we? ( by the way, before R-CNN, some methods actually used to do something similar to that! can you believe? I don't want to confuse you by giving more details. But it was called exhaustive search. It was very very slow. And hence selective search was proposed? But selective search was also not fast enough and so …… stop!)
We can draw infinite number of rectangle on an image without repetition, can’t we? Instead Faster R-CNN does it in a clever way…..
Consider an image of typical size 600 x 1000 pixels. We will select locations on this image with a distance (also called stride) of 16 pixels. Thus we get locations in 1000/16 = 62 columns and 600/16= 37 rows as shown in Figure 3.
Figure 3 contains an image where we choose location in a grid of 62 columns and 37 rows. Thus we have 37 x 62 = 2294 location at each location, we will consider 9 anchors with three sizes and 3 aspect ratios.
As shown in Figure 4, the anchors have three scales with box areas of 128x128(shown in red), 256x256(shown in orange) and 512x512(shown in green). For every scale there are three anchors with three aspect ratios of 1:1, 1:2 and 2:1.
In Figure 3we have 2294 locations and for every location we will consider 9 anchor boxes centered at the location as shown in Figure 4. Thus, in all, we will have 2294 x 9 = 20646 anchors.
Now a very commonsensical (it is a valid word according to cambridge) question to ask is what if there are objects in image which do not fit exactly in any of those 20646 anchor. And the answer is, firstly, the anchor sizes are chosen such that objects of different shapes and sizes would fit approximately in one of those anchor and secondly, we are going to use bounding box regressor twice in our network to refine those anchor boxes as per the object boundaries.
As you can very easily imagine, that for locations near the border on an image, the corresponding bounding boxes will go out of the image. They are called cross-boundary anchors. So when we remove the cross-boundary anchors from all the 20646 anchors, we will be left with around 6000 anchors.
Note that these cross-boundary anchors are removing only for training. During testing , the cross-boundary anchors are clipped to the image boundary.
The objectness classification branch of RPN, then calculates the objectness score of each anchor, which indicates its possibility of containing an object. The bounding box regressor refines the anchor to fit to the object.
Positive label is assigned to two kinds of anchors
- The anchors with highest IoU with ground truth box.
- The anchors with IoU greater than 0.7 with ground truth box.
We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes.
Rest of the proposals do not contribute to training.
3.3 Non-maximum supression (NMS)
Non-maximum suppression is a technique used to reduce the number of candidate boxes by eliminating boxes that overlap by an amount larger than a threshold.
Some of the anchors try to enclose the same object and have high overlap with each other. NMS considers IoU threshold (typically 0.7) and all the overlapping proposals with IoU less than the threshold are rejected, which leaves us with around 2000 region proposals to be processed by Fast RCNN detector.
Figure 5 shows region proposals produces by region proposal network. Objectness score of each region proposal is also shown.
( Figure 5 is taken from original research paper. Though it labels every region with name of object, I don't think names are available at this stage in the network. Please let me know your views in comment section)
3.4 RoI Pooling
RoI pooling works just like the way it used to work in Fast R-CNN.
The input images are resized such that the shorter side becomes 600 pixels. Fast R-CNN can accept input image with any size such that the longer side should not exceed 1000 pixels when the shorter side is resized to 600 pixels.
The convolutional layers do not have any problem with this variable image size. But the output of convolutional layers is of different size for different sizes of input images. And next fully connected layers do not accept input of variable size. So it is necessary to bring the output of convolutional layers to a fix size before feeding it to fully connected layers. This is done by RoI Pooling.
Suppose we want to resize the output of convolutional layers corresponding to region proposal to H x W. Then we divide the corresponding feature map into a grid of size H x W containing bins of size h/H and W/w where h and w are height and width of feature map corresponding to region proposal. Then max pooling is applied on all bins to get a resized feature map of H x W.
3.5 Sharing Features with RPN and Fast R-CNN
In Faster R-CNN training is single stage using multi-task loss. The anchors are labeled as discussed in section 3.2. The loss function is combination of classification loss and bounding box regression loss.
Faster R-CNN has two networks : RPN and Fast R-CNN. If we train them independently, then their layers will be modified in different ways. But we want Faster R-CNN to be a unified network with shared convolutional layers.
The research paper uses the approach of alternative training for all its experiments. In alternative training RPN is trained first which generates region proposals. Then Fast R-CNN is trained on those proposals. The network thus tuned by Fast R-CNN is then used to initialize RPN. And so on …. This solves our purpose of unified network with shared features.
3.6 Final Multi-class Classification
RoI feature vectors obtained from RoI pooling are then passed through a sequence of fully connected layers. And the output goes into two sibling output layers/branches : one branch of Multi-class Classification and the other branch containing bounding box regressor as can be seen in Figure 1.
The Multi-class Classification branch predicts probabilities of the corresponding RoI belonging to any of the object classes or to the ‘background’ class. In short, this branch performs Multi-class classification and it finds out the object that is present in corresponding region of interest.
The bounding box regressor branch refines the boundaries of bounding boxes to fit the boundaries of object present inside them.
- The computation of proposals with Faster R-CNN is nearly cost free as compared to Fast R-CNN.
- Faster R-CNN is a unified object detection algorithm with near real time frame rates.
- With VGG16 as backbone network, it takes just 198 ms for object detection, including all steps.