The CNN based objection detection techniques have improved dramatically over the last few years. The journey starts with R-CNN followed by Fast R-CNN, Faster R-CNN, mask R-CNN and YOLO. These are all very simple, very intuitive and easy to implement algorithms. Personally I have found them very fascinating. I intend to write medium stories for all of them……
R-CNN is the first in the family. Understanding R-CNN is crucial as it will lay foundation to understand subsequent algorithms.
Note — Before reading this story please make sure that you have understood basics of CNN and few other concepts like IoU, mAP,SVM and linear regression.
1. What is Object Detection?
The most basic task CNNs can effectively perform is to classify images. R-CNN is a very clever way of extending this capacity of CNNs to detect objects in an image.
Lets consider we have three types of images as shown in figure above. By using some technique if we can identify whether the image contains a square or a triangle or a circle then this task is called image classification.
Now consider this image! Here we have all three objects in one image. If we can identify what objects are present in the image and we can also find there location in the image, then this task is called object detection. Thus it involves two tasks. Object identification and object localization. Object localization is simply finding the location of that object in an image. So in the output we will draw a bounding box around that object to show its location. And for Object identification we will specify what object is present in each bounding box.
A sample output of object detection is shown in above image. Here we get bounding boxes for each object and each bounding box is labeled with the name of the object. It also gives a number as shown in image. This number is called objectness score. It indicates how much sure the model is about this prediction. For example first box reads square 0.98. It means the model is 98% sure that this object is a square.
In Figure 2, all objects are in one image, hence we can’t directly use CNN for object detection. So now the most logical thing to do is to do it in following two steps.
- Extract the regions in an image that may contain an object.
- Then use CNN on each of those regions to identify the object inside them, just like we do for basic classification task.
This is what RCNN does internally. And that’s why its called R-CNN: Regions with CNN features.
In Figure 2, the image contains only three classes : square, triangle and circle. And rest of the image is simply blank. A real world image will have tens or even hundreds of classes like trees, buildings, cars, dogs and so on. And apart from these fixed classes, it will also have many more other objects, that we are not interesting in detecting. How on earth, R-CNN handles all that complexity? Keep reading …….
2. First look at RCNN
Keep an eye at Figure 4 while reading following points.
- The regions that are extracted from an input image are called region proposals.
- RCNN extracts only around 2000 region proposals from an input image. (All 2000 images will not actually contain objects of our interest)
- As the CNN can only accept inputs with fixed size, all the extracted region proposals are first converted to a fixed size of 227 x 227 pixels. This is called image warping.
- These fixed size region proposals are then fed to a Large convolutional neural network consisting of five convolutional layers and two fully connected layers to get a fixed length features vector for each region proposal.
- These feature maps are then passed to a set of linear SVMs to predict the actual objects in corresponding region proposals. The number of SVMs are equal to the number of classes in dataset plus 1.
- For example if the dataset contains 200 classes, then 201 linear SVMs are used in last stage. 1 extra class is for background. The background of the image is treated as one class.
- Thus each SVM corresponds to one class. So each class specific SVM will predict if the region proposal contains corresponding object or not.
- After that, A bounding box regressor uses a linear regression model to predict more accurate bounding boxes in terms of location.
The above steps are summarized in Figure 5 below.
If you have read and understood up to this point, you have just grabbed the basic information and intuition of R-CNN. Hallelujah !!!
If you are looking for a deeper understanding of R-CNN, keep reading !!!
3. More insights into R-CNN
According to Wikipedia ……
Given an input image, R-CNN begins by applying a mechanism called Selective Search to extract regions of interest (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as two thousand ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI’s output features, a collection of support-vector machine classifiers is used to determine what type of object (if any) is contained within the ROI.
This is exactly what we learned in previous section.
Ross Girshick et al. introduced R-CNN in November 2013. Object detection has come a long way since then. The more recent algorithms are based on foundations laid by R-CNN. Hence it is crucial for aspiring computer vision engineers to understand R-CNN.
3.1 How region proposals are generated?
As a very first step of R-CNN, we have to generate region proposals from an image, which can then be warped and fed to CNN for further processing. R-CNN is agnostic to region proposal methods. Which means we can use any method to generate region proposals. The authors of R-CNN have used Selective Search. The corresponding research paper by J.R.R. Uijlings et. al. gives the details of selective search.
The criterion used to differentiate a region proposals from rest of the image are worth giving a thought.
- Image (c) in Figure 6contains a chameleon whose color is matching a lot with its surrounding. So we have to use a criterion of texture to differentiate and extract chameleon from rest of the image.
- Image (b) in Figure 6contains two cats whose texture is same. So we have to use a criterion of color.
- In image (d), the wheel of car have different color and texture from car and we may by mistake declare it as separate region. Images are intrinsically hierarchical. Hence wheels and rest of the car should merge into one object.
- On the contrary, in image (a), spoon is inside the bowl and the bowl stands on the table. But we may want to declare spoon and bowl as separate objects instead of merging them into table and declaring them as one object.
The bottom-line here is , there do not exit a single strategy for generating region proposals.
Selective search algorithm presents a variety of diversification strategies to deal with as many image conditions as possible.
The process of selective search can be explained in the following steps with the help of Figure 7.
- A set of small starting regions which ideally do not span multiple objects is created as shown in first column of Figure 7.
- Similarities between all the adjacent regions are found and two most similar regions are grouped together.
- Again similarities are found and grouping is done. This process is repeated as shown Figure 7.
As the grouping goes on, we are left with less and less boxes. This approach is called as bottom up grouping. The details of how a set of small starting regions is created can be found here.
Three types of diversification strategies are used.
- Variety of color spaces with increasing degree of invariance in terms of light intensity, shadow/shading, highlights.
- By using different similarity measures like color similarity, texture similarity, encouraging small regions to merge early and measuring how well two regions fit into each other.
- Varying the complimentary starting regions.
Around 2000 region proposals thus created are warped and fed to CNN as shown in Figure 4.
3.2 The need of bounding box regressor
This is a refinement step. A bounding box given by selective search is represented using [x,y,w,h], where x, y are the coordinates of top left corner of region proposal and w,h are width and height of region proposal. This bounding box given by selective search is further refined by bounding box regressor. Here, A linear regression model is trained to predict a modified window. This way localization error is further reduced and mAP is improved by 3 to 4points.
3.3 Training and evaluation of R-CNN
R-CNN is trained on ILSVRC2013 dataset. ILSVRC stands for ImageNet Large Scale Visual Recognition Challenge. The dataset is split into train(395918 images), val(20121 images) and test(40152 images) set.
Three types of training occurs in R-CNN. 1) CNN fine tuning 2) SVM training 3) bounding box regressor training.
At test time, we again obtain 2K region proposals for every image, warp them, feed them to CNN, get their score from SVMs and then R-CNN uses something called as NMS (Non-maximum supression) to reject unwanted region proposals. A object may have been detected by more than one region proposals. NMS will consider only those proposals that have higher SVM score than its learned threshold. NMS will also reject a region if it has overlap(IoU) with another region having higher SVM score. This way we get the best region proposal for a given object.
Figure 8 shows mAP comparison with various other techniques present prior to R-CNN. Note that methods preceded by * in Figure use outside training data in test set. BB stands for bounding box. As it is very clear from figure that R-CNN outperformed all other techniques on ILSVRC2013 dataset.
3.5 Drawbacks of R-CNN
- It takes more than 40 seconds to detect the objects in a test image which makes it unsuitable for real time applications.
- The CNN has to run for every region proposals. There is no weight sharing.
This is my first story in the series of CNN based object detection. I have tried to keep it comprehensive and yet simple. Its my understanding of RCNN and may conflict with yours. In such a case, I will be happy to learn from you.