R-CNN is slow. Detecting with R-CNN with VGG16 backbone takes 47 seconds for one image at test time. That makes it unsuitable for low latency applications. So in April 2015 Ross Girshick who was also one of the authors of R-CNN, single handedly proposed a better algorithm, called Fast Algorithm. The research paper can be found here. In this write up, we will get down to the nitty gritty of Fast R-CNN. But before that it is highly recommended to read my previous medium story on R-CNN here, as I will be using a lot of terms and concepts from R-CNN while assuming that you know them well.
CNNs are inherently capable of classifying images. It is helpful to look at R-CNN and Fast R-CNN as methods to extend this inherent capacity of CNNs for detecting objects in an image. Detecting objects means finding out the class of objects along with their actual locations in an image. So it involves classification and localization as well.
Suppose we have three classes of objects. Mountain, sea and building. CNNs can very easily tell us, if an image contains a mountain or not, or if an image contains a building or not. But if we have a sea and a mountain and a building , all in the same image then CNNs can not directly be used for all three objects. In such a case, we can extract regions from an image, that may contain an object and then apply CNN on those regions separately to check if any of those regions contain a sea or a mountain or building. And in a nutshell, that’s what R-CNN does. Those extracted regions ( generally obtained using selective search) which were called region proposals in R-CNN are called object proposals or regions of interest (RoI) in Fast R-CNN. And the entire process of object detection is done more intelligently in Fast R-CNN to give better performance in terms of speed and memory requirement.
2. First Look at Fast R-CNN
Fast R-CNN first computes the feature map for entire input image and then from that feature map, extracts feature vectors corresponding to each RoI using RoI pooling. These feature vectors are then used for classification and localization. Keep an eye on Figure 1 while reading following steps
- The input image along with a set of object proposals(also called region of interests) are given as a input to the convolutional neural network (CNN).
- The CNN, which contains several convolutional and max pooling layers produce a feature map for input image as shown in Figure 1. This way we get one feature map for entire input image and its much smaller in size than the input image because of stride.
- Next stage is RoI pooling layer. It is different from regular max pooling layer and its job is to extract fixed length feature vectors from feature map. Thus we get feature vectors corresponding to every region of interest. They can further be used for classification and localization.
- These RoI feature vectors then pass through a sequence of fully connected layers. And the output goes into two sibling output layers/branches : one branch containing softmax layer and the other branch containing bounding box regressor as can be seen in Figure 1.
- The Softmax branch predicts probabilities of the corresponding RoI belonging to any of the object classes or to the ‘background’ class. In short, this branch performs classification and it finds out the object that is present in corresponding region of interest.
- The bounding box regressor branch predicts the refined location of object in term of r,c,,h and w; Where (r,c) specifies the coordinates of top left corner of the bounding box for corresponding object and (h,w) are height and width of the bounding box. This stage is necessary because the RoIs given by selective search are not perfect and need refinement.
This completes our high level understanding of Fast R-CNN. If you want to go deeper, follow along……
3. More insights into Fast R-CNN
According to Wikipedia ……
While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. At the end of the network is a novel method called ROIPooling, which slices out each ROI from the network’s output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses Selective Search to generate its region proposals.
This is exactly what we learned in previous section.
3.1 Initializing Fast R-CNN from pretrained network
When a pretrained network like VGG16 is used to initialize Fast R-CNN , then VGG16 is modified in following three ways. Again keep an eye on Figure 1 while reading following points
- The last max pooling layer is replaced by an RoI pooling layer.
- The last fully connected layer and softmax layer are replace with sibling branches of softmax and bounding box regressor.
- The network is modified to take two inputs: images and RoIs.
Few of VGG16's last layers are fine tuned for transfer learning.
3.2 How does RoI pooling works?
For Fast R-CNN, the input images are resized such that the shorter side becomes 600 pixels. Fast R-CNN can accept input image with any size such that the longer side should not exceed 1000 pixels when the shorter side is resized to 600 pixels.
The convolutional layers do not have any problem with this variable image size. But the output of convolutional layers is of different size for different sizes of input images. And next fully connected layers do not accept input of variable size. So it is necessary to bring the output of convolutional layers to a fix size before feeding it to fully connected layers. This is done by RoI Pooling.
Lets take an example…
Suppose we give an image and two object proposals as an input to CNN and CNN produces a feature map of size 21 x 16 as shown in Figure 2. There are two boxes(red and blue) in Figure 2. These two boxes correspond to two object proposals. These boxes are of different size and they are much smaller than their corresponding object proposals due to stride. The aim of RoI pooling is to resize them and bring them to a smaller and same size. In our example we will resize both of them to a fix size of 3 x 3. So here H=3 and W=3. How this is done is very interesting….
The first step is to divide the RED RoI into a grid having 3 x 3 = 9 bins. The RED RoI is of size 18 x 10 as shown in Figure 2. Because we want to resize it to H x W, we will create bins of height = 18/3=6 and width = 10/3=3. Please note that 10 is not multiplier of 3 and hence bins in last column of grid has more width. On these 9 bins we apply max pool (get one maximum number from each bin) and we get a 3 x 3 resized feature map as shown in the right.
For BLUE RoI also, the first step is to divide it into a grid having 3 x 3 = 9 bins. The BLUE RoI is of size 16 x 13 as shown in Figure 2. Because we want to resize it to H x W, we will create bins of height = 16/3=5 and width = 10/3=3. Please note that 16 and 13 are not multiplier of 3 and hence bins in last row and last column of grid has more height and width respectively. On these 9 bins we apply max pool (get one maximum number from each bin) and we get a 3 x 3 resized feature map as shown in the right.
This is how, in spite of having RoI of variable size, we are always capable of resizing them to same size. Here we have resized them to 3 x 3 but in actual research paper while using VGG16 as CNN the author has resized it to 7 x 7 using RoI pooling. Thus H=W=7. The resized RoI are then flattened and fed to next fully connected layers, whose output goes to two sibling layers as shown in Figure 1.
3.3 Are more proposals always better?
The author of Fast R-CNN carried out an experiment where the vary the number of object proposals and check the performance of object detection. The performance for object proposals generated using selective search is shown by solid blue line in Figure 5 and it is observed that the mAP actually goes down when the number of object proposals increase beyond a certain range.
3.4 Softmax vs. SVM
If you remember from my previous story on R-CNN, R-CNN uses number of class specific linear SVMs for classification. Why Fast RCNN uses softmax?
Experiments were carried out using networks of small, medium and large size. They are referred as S, M and L respectively. As shown in Figure 6 R-CNN was using SMVs only. But Fast R-CNN when tried with SVMs and softmax, performed slightly better with softmax.
3.5 Comparing R-CNN and Fast R-CNN architectures
- For R-CNN, training is a multistage pipeline. It first trains ConvNets followed by SVMs and then followed by bounding box regressor. In Fast R-CNN training is single stage using multi-task loss. Hence it is faster.
- In R-CNN all object proposals are separately feed to CNN to get the fixed length feature vectors. Fast R-CNN takes entire image as input, generates its feature map using CNN in one pass and then feature vectors corresponding to each region proposal are extracted from feature map using RoI pooling. Hence it is faster. This is called sharing computation or sharing features.
- The features extracted from CNN in R-CNN require hundreds of gigabytes of storage. Fast R-CNN do not need to store any features.
3.6 Comparing R-CNN and Fast R-CNN results
- Fast R-CNN trains very deep VGG16 networks 9 times faster than R-CNN.
- R-CNN with VGG16 takes 47 seconds to detect objects in one test image while Fast R-CNN takes only 0.3 seconds(Excluding object proposal time).
- Fast R-CNN achieves an mAP of 66% on PASCAL VOC 2102 dataset. mAP of R-CNN for same dataset is 62%.
3.7 Drawbacks of Fast R-CNN
The speed of Fast R-CNN is limited by the time taken by selective search to generate object proposals.
As can be seen from Figure 7, there is a huge difference in test times of Fast R-CNN with and without considering the time taken by selective search to generate region proposals at test time.
considering the time taken by selective search the test time is 2.3 seconds. Though this is a great improvement over R-CNN, but still Fast R-CNN is not suitable for real time applications
This is the perfect time to start talking about Faster R-CNN , a state of the art algorithm for object detection and semantic segmentation. As the name suggests its faster than the fast R-CNN.
Stay tuned …..