Note
This is my personal note after studying the course of the 3rd week convolutional neural networks and the copyright belongs to deeplearning.ai.
01_object-localization
Hello and welcome back. This week you learn about object detection. This is one of the areas of computer vision that’s just exploding and is working so much better than just a couple of years ago. In order to build up to object detection, you first learn about object localization.
Let’s start by defining what that means. You’re already familiar with the image classification task where an algorithm looks at this picture and might be responsible for saying this is a car. So that was classification. The problem you learn to build in your network to address later on this video is classification with localization. Which means not only do you have to label this as say a car but the algorithm also is responsible for putting a bounding box, or drawing a red rectangle around the position of the car in the image. So that’s called the classification with localization problem. Where the term localization refers to figuring out where in the picture is the car you’ve detective. Later this week, you then learn about the detection problem where now there might be multiple objects in the picture and you have to detect them all and and localized them all. And if you’re doing this for an autonomous driving application, then you might need to detect not just other cars, but maybe other pedestrians and motorcycles and maybe even other objects. So you’ll see that later this week. So in the terminology we’ll use this week, the classification and the classification of localization problems usually have one object. Usually one big object in the middle of the image that you’re trying to recognize or recognize and localize. In contrast, in the detection problem there can be multiple objects. And in fact, maybe even multiple objects of different categories within a single image. So the ideas you’ve learned about for image classification will be useful for classification with localization. And that the ideas you learn for localization will then turn out to be useful for detection.
So let’s start by talking about classification with localization. You’re already familiar with the image classification problem, in which you might input a picture into a ConvNet with multiple layers so that’s our ConvNet. And this results in a vector features that is fed to maybe a softmax unit that outputs the predicted clause. So if you are building a self driving car, maybe your object categories are the following. Where you might have a pedestrian, or a car, or a motorcycle, or a background. This means none of the above. So if there’s no pedestrian, no car, no motorcycle, then you might have an output background. So these are your classes, they have a softmax with four possible outputs. So this is the standard classification pipeline. How about if you want to localize the car in the image as well. To do that, you can change your neural network to have a few more output units that output a bounding box. So, in particular, you can have the neural network output four more numbers, and I’m going to call them bx, by, bh, and bw. And these four numbers parameterized the bounding box of the detected object. So in these videos, I am going to use the notational convention that the upper left of the image, I’m going to denote as the coordinate (0,0), and at the lower right is (1,1). So, specifying the bounding box, the red rectangle requires specifying the midpoint. So that’s the point bx, by as well as the height, that would be bh, as well as the width, bw of this bounding box. So now if your training set contains not just the object cross label, which a neural network is trying to predict up here, but it also contains four additional numbers. Giving the bounding box then you can use supervised learning to make your algorithm outputs not just a class label but also the four parameters to tell you where is the bounding box of the object you detected. So in this example the ideal bx might be about 0.5 because this is about halfway to the right to the image. by might be about 0.7 since it’s about maybe 70% to the way down to the image. bh might be about 0.3 because the height of this red square is about 30% of the overall height of the image. And bw might be about 0.4 let’s say because the width of the red box is about 0.4 of the overall width of the entire image. So let’s formalize this a bit more in terms of how we define the target label y for this as a supervised learning task.
So just as a reminder these are our four classes, and the neural network now outputs those four numbers $b_x, b_y, b_h, b_w$ as well as a class label, or maybe probabilities of the class labels. So, let’s define the target label y as follows. Is going to be a vector where the first component $p_c$ is going to be, is there an object? So, if the object is, classes 1, 2 or 3, $p_c$ will be equal to 1. And if it’s the background class, so if it’s none of the objects you’re trying to detect, then $p_c$ will be 0. And $p_c$ you can think of that as standing for the probability that there’s an object. Probability that one of the classes you’re trying to detect is there. So something other than the background class. Next if there is an object, then you wanted to output $b_x$, $b_y$, $b_h$ and $b_w$, the bounding box for the object you detected. And finally if there is an object, so if $p_c$ is equal to 1, you wanted to also output $c_1$, $c_2$ and $c_3$ which tells us is it the class 1, class 2 or class 3. So is it a pedestrian, a car or a motorcycle. And remember in the problem we’re addressing we assume that your image has only one object. So at most, one of these objects appears in the picture, in this classification with localization problem.
So let’s go through a couple of examples. If this is a training set image, so if that is x, then y will be the first component pc will be equal to 1 because there is an object, then bx, by, by, bh and bw will specify the bounding box. So your labeled training set will need bounding boxes in the labels. And then finally this is a car, so it’s class 2. So c1 will be 0 because it’s not a pedestrian, c2 will be 1 because it is car, c3 will be 0 since it is not a motorcycle. So among c1, c2 and c3 at most one of them should be equal to 1. So that’s if there’s an object in the image.
What if there’s no object in the image? What if we have a training example where x is equal to that? In this case, $p_c$ would be equal to 0, and the rest of the elements of this, will be don’t cares, so I’m going to write question marks in all of them. So this is a don’t care, because if there is no object in this image, then you don’t care what bounding box the neural network outputs as well as which of the three objects, c1, c2, c3 it thinks it is. So given a set of label training examples, this is how you will construct x, the input image as well as y, the cost label both for images where there is an object and for images where there is no object. And the set of this will then define your training set.
Finally, next let’s describe the loss function you use to train the neural network. So the ground true label was y and the neural network outputs some yhat. What should be the loss be? Well if you’re using squared error then the loss can be (y1 hat- y1) squared + (y2 hat- y2) squared + …+( y8 hat- y8) squared. Notice that y here has eight components. So that goes from sum of the squares of the difference of the elements. And that’s the loss if y1=1. So that’s the case where there is an object. So y1= pc. So, pc = 1, that if there is an object in the image then the loss can be the sum of squares of all the different elements. The other case is if y1=0, so that’s if this pc = 0. In that case the loss can be just (y1 hat-y1) squared, because in that second case, all of the rest of the components are don’t care us. And so all you care about is how accurately is the neural network ourputting pc in that case. So just a recap, if y1 = 1, that’s this case, then you can use squared error to penalize square deviation from the predicted, and the actual output of all eight components. Whereas if y1 = 0, then the second to the eighth components I don’t care. So all you care about is how accurately is your neural network estimating y1, which is equal to pc.
Just as a side comment for those of you that want to know all the details, I’ve used the squared error just to simplify the description here. In practice you could improbably use a log likelihood loss for the c1, c2, c3 to the softmax output. One of those elements usually you can use squared error or something like squared error for the bounding box coordinates and if a $p_c$ you could use something like the logistics regression loss. Although even if you use squared error it’ll probably work okay.
So that’s how you get a neural network to not just classify an object but also to localize it. The idea of having a neural network output a bunch of real numbers to tell you where things are in a picture turns out to be a very powerful idea. In the next video I want to share with you some other places where this idea of having a neural network output a set of real numbers, almost as a regression task, can be very powerful to use elsewhere in computer vision as well. So let’s go on to the next video.
02_landmark-detection
In the previous video, you saw how you can get a neural network to output four numbers of bx, by, bh, and bw to specify the bounding box of an object you want a neural network to localize. In more general cases, you can have a neural network just output X and Y coordinates of important points and image, sometimes called landmarks, that you want the neural networks to recognize. Let me show you a few examples.
Let’s say you’re building a face recognition application and for some reason, you want the algorithm to tell you where is the corner of someone’s eye. So that point has an X and Y coordinate, so you can just have a neural network have its final layer and have it just output two more numbers which I’m going to call our lx and ly to just tell you the coordinates of that corner of the person’s eye. Now, what if you want it to tell you all four corners of the eye, really of both eyes. So, if we call the points, the first, second, third and fourth points going from left to right, then you could modify the neural network now to output l1x, l1y for the first point and l2x, l2y for the second point and so on, so that the neural network can output the estimated position of all those four points of the person’s face. But what if you don’t want just those four points? What do you want to output this point, and this point and this point and this point along the eye? Maybe I’ll put some key points along the mouth, so you can extract the mouth shape and tell if the person is smiling or frowning, maybe extract a few key points along the edges of the nose but you could define some number, for the sake of argument, let’s say 64 points or 64 landmarks on the face. Maybe even some points that help you define the edge of the face, defines the jaw line but by selecting a number of landmarks and generating a label training sets that contains all of these landmarks, you can then have the neural network to tell you where are all the key positions or the key landmarks on a face. So what you do is you have this image, a person’s face as input, have it go through a convnet and have a convnet, then have some set of features, maybe have it output 0 or 1, like zero face changes or not and then have it also output l1x, l1y and so on down to l64x, l64y. And here I’m using l to stand for a landmark. So this example would have 129 output units, one for is your face or not? And then if you have 64 landmarks, that’s sixty-four times two, so 128 plus one output units and this can tell you if there’s a face as well as where all the key landmarks on the face. So, this is a basic building block for recognizing emotions from faces and if you played with the Snapchat and the other entertainment, also AR augmented reality filters like the Snapchat photos can draw a crown on the face and have other special effects. Being able to detect these landmarks on the face, there’s also a key building block for the computer graphics effects that warp the face or drawing various special effects like putting a crown or a hat on the person. Of course, in order to treat a network like this, you will need a label training set. We have a set of images as well as labels Y where people, where someone will have had to go through and laboriously annotate all of these landmarks.
One last example, if you are interested in people pose detection, you could also define a few key positions like the midpoint of the chest, the left shoulder, left elbow, the wrist, and so on, and just have a neural network to annotate key positions in the person’s pose as well and by having a neural network output, all of those points I’m annotating, you could also have the neural network output the pose of the person. And of course, to do that you also need to specify on these key landmarks like maybe l1x and l1y is the midpoint of the chest down to maybe l32x, l32y, if you use 32 coordinates to specify the pose of the person.
So, this idea might seem quite simple of just adding a bunch of output units to output the X,Y coordinates of different landmarks you want to recognize. To be clear, the identity of landmark one must be consistent across different images like maybe landmark one is always this corner of the eye, landmark two is always this corner of the eye, landmark three, landmark four, and so on. So, the labels have to be consistent across different images. But if you can hire labelers or label yourself a big enough data set to do this, then a neural network can output all of these landmarks which is going to used to carry out other interesting effect such as with the pose of the person, maybe try to recognize someone’s emotion from a picture, and so on. So that’s it for landmark detection. Next, let’s take these building blocks and use it to start building up towards object detection.
03_object-detection
You’ve learned about Object Localization as well as Landmark Detection. Now, let’s build up to other object detection algorithm. In this video, you’ll learn how to use a ConvNet to perform object detection using something called the Sliding Windows Detection Algorithm.
Let’s say you want to build a car detection algorithm. Here’s what you can do. You can first create a label training set, so x and y with closely cropped examples of cars. So, this is image x has a positive example, there’s a car, here’s a car, here’s a car, and then there’s not a car, there’s not a car. And for our purposes in this training set, you can start off with the one with the car closely cropped images. Meaning that x is pretty much only the car. So, you can take a picture and crop out and just cut out anything else that’s not part of a car. So you end up with the car centered in pretty much the entire image. Given this label training set, you can then train a ConvNet that inputs an image, like one of these closely cropped images. And then the job of the cofinite is to output y, zero or one, is there a car or not. Once you’ve trained up this ConvNet, you can then use it in Sliding Windows Detection.
So the way you do that is, if you have a test image like this what you do is you start by picking a certain window size, shown down there. And then you would input into this ConvNet a small rectangular region. So, take just this below red square, input that into the ConvNet, and have a ConvNet make a prediction. And presumably for that little region in the red square, it’ll say, no that little red square does not contain a car. In the Sliding Windows Detection Algorithm, what you do is you then pass as input a second image now bounded by this red square shifted a little bit over and feed that to the ConvNet. So, you’re feeding just the region of the image in the red squares of the ConvNet and run the ConvNet again. And then you do that with a third image and so on. And you keep going until you’ve slid the window across every position in the image. And I’m using a pretty large stride in this example just to make the animation go faster. But the idea is you basically go through every region of this size, and pass lots of little cropped images into the ConvNet and have it classified zero or one for each position as some stride.
jj
Now, having done this once with running this was called the sliding window through the image. You then repeat it, but now use a larger window. So, now you take a slightly larger region and run that region. So, resize this region into whatever input size the ConvNet is expecting, and feed that to the ConvNet and have it output zero or one. And then slide the window over again using some stride and so on. And you run that throughout your entire image until you get to the end.
And then you might do the third time using even larger windows and so on. Right. And the hope is that if you do this, then so long as there’s a car somewhere in the image that there will be a window where, for example if you are passing in this window into the cofinite, hopefully the cofinite will have outputs one for that input region. So then you detect that there is a car there. So this algorithm is called Sliding Windows Detection because you take these windows, these square boxes, and slide them across the entire image and classify every square region with some stride as containing a car or not.
Now there’s a huge disadvantage of Sliding Windows Detection, which is the computational cost. Because you’re cropping out so many different square regions in the image and running each of them independently through a ConvNet. And if you use a very coarse stride, a very big stride, a very big step size, then that will reduce the number of windows you need to pass through the ConvNet, but that courser granularity may hurt performance. Whereas if you use a very fine granularity or a very small stride, then the huge number of all these little regions you’re passing through the ConvNet means that means there is a very high computational cost. So, before the rise of Neural Networks people used to use much simpler classifiers like a simple linear classifier over hand engineer features in order to perform object detection. And in that era because each classifier was relatively cheap to compute, it was just a linear function, Sliding Windows Detection ran okay. It was not a bad method, but with ConvNet now running a single classification task is much more expensive and sliding windows this way is infeasibily slow. And unless you use a very fine granularity or a very small stride, you end up not able to localize the objects that accurately within the image as well. Fortunately however, this problem of computational cost has a pretty good solution. In particular, the Sliding Windows Object Detector can be implemented convolutionally or much more efficiently. Let’s see in the next video how you can do that.
04_convolutional-implementation-of-sliding-windows
In the last video, you learned about the sliding windows object detection algorithm using a convnet but we saw that it was too slow. In this video, you’ll learn how to implement that algorithm convolutionally. Let’s see what this means.
To build up towards the convolutional implementation of sliding windows let’s first see how you can turn fully connected layers in neural network into convolutional layers. We’ll do that first on this slide and then the next slide, we’ll use the ideas from this slide to show you the convolutional implementation.
So let’s say that your object detection algorithm inputs 14 by 14 by 3 images. This is quite small but just for illustrative purposes, and let’s say it then uses 5 by 5 filters, and let’s say it uses 16 of them to map it from 14 by 14 by 3 to 10 by 10 by 16. And then does a 2 by 2 max pooling to reduce it to 5 by 5 by 16. Then has a fully connected layer to connect to 400 units. Then now they’re fully connected layer and then finally outputs a Y using a softmax unit. In order to make the change we’ll need to in a second, I’m going to change this picture a little bit and instead I’m going to view Y as four numbers, corresponding to the cause probabilities of the four causes that softmax units is classified amongst. And the full causes could be pedestrian, car, motorcycle, and background or something else. Now, what I’d like to do is show how these layers can be turned into convolutional layers. So, the convnet will draw same as before for the first few layers. And now, one way of implementing this next layer, this fully connected layer is to implement this as a 5 by 5 filter and let’s use 400 5 by 5 filters. So if you take a 5 by 5 by 16 image and convolve it with a 5 by 5 filter, remember, a 5 by 5 filter is implemented as 5 by 5 by 16 because our convention is that the filter looks across all 16 channels. So this 16 and this 16 must match and so the outputs will be 1 by 1. And if you have 400 of these 5 by 5 by 16 filters, then the output dimension is going to be 1 by 1 by 400. So rather than viewing these 400 as just a set of nodes, we’re going to view this as a 1 by 1 by 400 volume. Mathematically, this is the same as a fully connected layer because each of these 400 nodes has a filter of dimension 5 by 5 by 16. So each of those 400 values is some arbitrary linear function of these 5 by 5 by 16 activations from the previous layer. Next, to implement the next convolutional layer, we’re going to implement a 1 by 1 convolution. If you have 400 1 by 1 filters then, with 400 filters the next layer will again be 1 by 1 by 400. So that gives you this next fully connected layer. And then finally, we’re going to have another 1 by 1 filter, followed by a softmax activation. So as to give a 1 by 1 by 4 volume to take the place of these four numbers that the network was operating. So this shows how you can take these fully connected layers and implement them using convolutional layers so that these sets of units instead are not implemented as 1 by 1 by 400 and 1 by 1 by 4 volumes.
Armed of this conversion, let’s see how you can have a convolutional implementation of sliding windows object detection. The presentation on this slide is based on the OverFeat paper, referenced at the bottom, by Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Robert Fergus and Yann Lecun. Let’s say that your sliding windows convnet inputs 14 by 14 by 3 images and again, I’m just using small numbers like the 14 by 14 image in this slide mainly to make the numbers and illustrations simpler. So as before, you have a neural network as follows that eventually outputs a 1 by 1 by 4 volume, which is the output of your softmax. Again, to simplify the drawing here, 14 by 14 by 3 is technically a volume 5 by 5 or 10 by 10 by 16, the second clear volume. But to simplify the drawing for this slide, I’m just going to draw the front face of this volume. So instead of drawing 1 by 1 by 400 volume, I’m just going to draw the 1 by 1 cause of all of these. So just dropped the three components of these drawings, just for this slide. So let’s say that your convnet inputs 14 by 14 images or 14 by 14 by 3 images and your tested image is 16 by 16 by 3. So now added that yellow stripe to the border of this image. In the original sliding windows algorithm, you might want to input the blue region into a convnet and run that once to generate a consecration 01 and then slightly down a bit, least he uses a stride of two pixels and then you might slide that to the right by two pixels to input this green rectangle into the convnet and we run the whole convnet and get another label, 01. Then you might input this orange region into the convnet and run it one more time to get another label. And then do it the fourth and final time with this lower right purple square. To run sliding windows on this 16 by 16 by 3 image is pretty small image. You run this convnet four times in order to get four labels. But it turns out a lot of this computation done by these four convnets is highly duplicative. So what the convolutional implementation of sliding windows does is it allows these four forward passes in the convnet to share a lot of computation. Specifically, here’s what you can do. You can take the convnet and just run it same parameters, the same 5 by 5 filters, also 16 5 by 5 filters and run it. Now, you can have a 12 by 12 by 16 output volume. Then do the max pool, same as before. Now you have a 6 by 6 by 16, runs through your same 400 5 by 5 filters to get now your 2 by 2 by 40 volume. So now instead of a 1 by 1 by 400 volume, we have instead a 2 by 2 by 400 volume. Run it through a 1 by 1 filter gives you another 2 by 2 by 400 instead of 1 by 1 like 400. Do that one more time and now you’re left with a 2 by 2 by 4 output volume instead of 1 by 1 by 4. It turns out that this blue 1 by 1 by 4 subset gives you the result of running in the upper left hand corner 14 by 14 image. This upper right 1 by 1 by 4 volume gives you the upper right result. The lower left gives you the results of implementing the convnet on the lower left 14 by 14 region. And the lower right 1 by 1 by 4 volume gives you the same result as running the convnet on the lower right 14 by 14 medium. And if you step through all the steps of the calculation, let’s look at the green example, if you had cropped out just this region and passed it through the convnet through the convnet on top, then the first layer’s activations would have been exactly this region. The next layer’s activation after max pooling would have been exactly this region and then the next layer, the next layer would have been as follows. So what this process does, what this convolution implementation does is, instead of forcing you to run four propagation on four subsets of the input image independently, Instead, it combines all four into one form of computation and shares a lot of the computation in the regions of image that are common. So all four of the 14 by 14 patches we saw here.
Now let’s just go through a bigger example. Let’s say you now want to run sliding windows on a 28 by 28 by 3 image. It turns out If you run four from the same way then you end up with an 8 by 8 by 4 output. And just go small and surviving sliding windows with that 14 by 14 region. And that corresponds to running a sliding windows first on that region thus, giving you the output corresponding the upper left hand corner. Then using a slider too to shift one window over, one window over, one window over and so on and the eight positions. So that gives you this first row and then as you go down the image as well, that gives you all of these 8 by 8 by 4 outputs. Because of the max pooling up too that this corresponds to running your neural network with a stride of two on the original image.
So just to recap, to implement sliding windows, previously, what you do is you crop out a region. Let’s say this is 14 by 14 and run that through your convnet and do that for the next region over, then do that for the next 14 by 14 region, then the next one, then the next one, then the next one, then the next one and so on, until hopefully that one recognizes the car. But now, instead of doing it sequentially, with this convolutional implementation that you saw in the previous slide, you can implement the entire image, all maybe 28 by 28 and convolutionally make all the predictions at the same time by one forward pass through this big convnet and hopefully have it recognize the position of the car.
So that’s how you implement sliding windows convolutionally and it makes the whole thing much more efficient. Now, this algorithm still has one weakness, which is the position of the bounding boxes is not going to be too accurate. In the next video, let’s see how you can fix that problem.
05_bounding-box-predictions
In the last video, you learned how to use a convolutional implementation of sliding windows. That’s more computationally efficient, but it still has a problem of not quite outputting the most accurate bounding boxes. In this video, let’s see how you can get your bounding box predictions to be more accurate.
With sliding windows, you take this three sets of locations and run the crossfire through it. And in this case, none of the boxes really match up perfectly with the position of the car. So, maybe that box is the best match. And also, it looks like in drawn through, the perfect bounding box isn’t even quite square, it’s actually has a slightly wider rectangle or slightly horizontal aspect ratio. So, is there a way to get this algorithm to outputs more accurate bounding boxes? A good way to get this output more accurate bounding boxes is with the YOLO algorithm. YOLO stands for, You Only Look Once. And is an algorithm due to Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi.
Here’s what you do. Let’s say you have an input image at 100 by 100, you’re going to place down a grid on this image. And for the purposes of illustration, I’m going to use a 3 by 3 grid. Although in an actual implementation, you use a finer one, like maybe a 19 by 19 grid. And the basic idea is you’re going to take the image classification and localization algorithm that you saw in the first video of this week and apply that to each of the nine grid cells of this image. So the more concrete, here’s how you define the labels you use for training. So for each of the nine grid cells, you specify a label Y, where the label Y is this eight dimensional vector, same as you saw previously. Your first output $p_c$ 01 depending on whether or not there’s an image in that grid cell and then $b_x, b_y, b_h, b_w$ to specify the bounding box if there is an image, if there is an object associated with that grid cell. And then say, $c_1, c_2, c_3$, if you try and recognize three classes not counting the background class. So you try to recognize pedestrian’s class, motorcycles and the background class. Then $c_1, c_2, c_3$ be the pedestrian, car and motorcycle classes. So in this image, we have nine grid cells, so you have a vector like this for each of the grid cells. So let’s start with the upper left grid cell, this one up here. For that one, there is no object. So, the label vector Y for the upper left grid cell would be zero, and then don’t cares for the rest of these. The output label Y would be the same for this grid cell, and this grid cell, and all the grid cells with nothing, with no interesting object in them. Now, how about this grid cell(the 5th grid cell)? To give a bit more detail, this image has two objects. And what the YOLO algorithm does is it takes the midpoint of each of the two objects and then assigns the object to the grid cell containing the midpoint. So the left car is assigned to this grid cell(the 4th grid cell), and the car on the right, which is this midpoint, is assigned to this grid cell(the 6th grid cell). And so even though the central grid cell(the 5th grid cell) has some parts of both cars, we’ll pretend the central grid cell has no interesting object so that the central grid cell the class label Y also looks like this vector with no object, and so the first component $p_c$, and then the rest are don’t cares. Whereas for this cell, this cell that I have circled in green on the left, the target label Y would be as follows. There is an object, and then you write $b_x, b_y, b_h, b_w$, to specify the position of this bounding box. And then you have, let’s see, if class one was a pedestrian, then that was zero. Class two is a car, that’s one. Class three was a motorcycle, that’s zero. And then similarly, for the grid cell on their right because that does have an object in it, it will also have some vector like this as the target label corresponding to the grid cell on the right. So, for each of these nine grid cells, you end up with a eight dimensional output vector. And because you have 3 by 3 grid cells, you have nine grid cells, the total volume of the output is going to be 3 by 3 by 8. So the target output is going to be 3 by 3 by 8 because you have 3 by 3 grid cells. And for each of the 3 by 3 grid cells, you have a eight dimensional Y vector. So the target output volume is 3 by 3 by 8. Where for example, this 1 by 1 by 8 volume in the upper left corresponds to the target output vector for the upper left of the nine grid cells. And so for each of the 3 by 3 positions, for each of these nine grid cells, does it correspond in eight dimensional target vector Y that you want to the output. Some of which could be don’t cares, if there’s no object there. And that’s why the total target outputs, the output label for this image is now itself a 3 by 3 by 8 volume. So now, to train your neural network, the input is 100 by 100 by 3, that’s the input image. And then you have a usual convnet with conv, layers of max pool layers, and so on. So that in the end, you have this, should choose the conv layers and the max pool layers, and so on, so that this eventually maps to a 3 by 3 by 8 output volume. And so what you do is you have an input X which is the input image like that, and you have these target labels Y which are 3 by 3 by 8, and you use map propagation to train the neural network to map from any input X to this type of output volume Y.
So the advantage of this algorithm is that the neural network outputs precise bounding boxes as follows. So at test time, what you do is you feed an input image X and run forward prop until you get this output Y. And then for each of the nine outputs of each of the 3 by 3 positions in which of the output, you can then just read off 1 or 0. Is there an object associated with that one of the nine positions? And that there is an object, what object it is, and where is the bounding box for the object in that grid cell? And so long as you don’t have more than one object in each grid cell, this algorithm should work okay. And the problem of having multiple objects within the grid cell is something we’ll address later. Of use a relatively small 3 by 3 grid, in practice, you might use a much finer, grid maybe 19 by 19. So you end up with 19 by 19 by 8, and that also makes your grid much finer. It reduces the chance that there are multiple objects assigned to the same grid cell. And just as a reminder, the way you assign an object to grid cell as you look at the midpoint of an object and then you assign that object to whichever one grid cell contains the midpoint of the object. So each object, even if the objects spends multiple grid cells, that object is assigned only to one of the nine grid cells, or one of the 3 by 3, or one of the 19 by 19 grid cells. Algorithm of a 19 by 19 grid, the chance of an object of two midpoints of objects appearing in the same grid cell is just a bit smaller.
So notice two things, first, this is a lot like the image classification and localization algorithm that we talked about in the first video of this week. And that it outputs the bounding boxs coordinates explicitly. And so this allows in your network to output bounding boxes of any aspect ratio, as well as, output much more precise coordinates that aren’t just dictated by the stripe size of your sliding windows classifier. And second, this is a convolutional implementation and you’re not implementing this algorithm nine times on the 3 by 3 grid or if you’re using a 19 by 19 grid.19 squared is 361. So, you’re not running the same algorithm 361 times or 19 squared times. Instead, this is one single convolutional implantation, where you use one consonant with a lot of shared computation between all the computations needed for all of your 3 by 3 or all of your 19 by 19 grid cells. So, this is a pretty efficient algorithm. And in fact, one nice thing about the YOLO algorithm, which is constant popularity is because this is a convolutional implementation, it actually runs very fast. So this works even for real time object detection.
Now, before wrapping up, there’s one more detail I want to share with you, which is, how do you encode these bounding boxes $b_x, b_y, b_h, b_w$? Let’s discuss that on the next slide. So, given these two cars, remember, we have the 3 by 3 grid. Let’s take the example of the car on the right. So, in this grid cell there is an object and so the target label y will be one, that was $p_c$ is equal to one. And then $b_x, b_y, b_h, b_w$, and then 0 1 0. So, how do you specify the bounding box? In the YOLO algorithm, relative to this square, when I take the convention that the upper left point here is 0 0 and this lower right point is 1 1. So to specify the position of that midpoint, that orange dot, bx might be, let’s say x looks like is about 0.4. Maybe its about 0.4 of the way to their right. And then y, looks I guess maybe 0.3. And then the height of the bounding box is specified as a fraction of the overall width of this box. So, the width of this red box is maybe 90% of that blue line. And so BH is 0.9 and the height of this is maybe one half of the overall height of the grid cell. So in that case, BW would be, let’s say 0.5. So, in other words, this $b_x, b_y, b_h, b_w$ as specified relative to the grid cell. And so bx and by, this has to be between 0 and 1, right? Because pretty much by definition that orange dot is within the bounds of that grid cell is assigned to. If it wasn’t between 0 and 1 it was outside the square, then we’ll have been assigned to a different grid cell. But these could be greater than one. In particular if you have a car where the bounding box was that, then the height and width of the bounding box, this could be greater than one. So, there are multiple ways of specifying the bounding boxes, but this would be one convention that’s quite reasonable. Although, if you read the YOLO research papers, the YOLO research line there were other parameterizations that work even a little bit better, but I hope this gives one reasonable condition that should work okay. Although, there are some more complicated parameterizations involving sigmoid functions to make sure this is between 0 and 1. And using an explanation parameterization to make sure that these are non-negative, since 0.9, 0.5, this has to be greater or equal to zero. There are some other more advanced parameterizations that work things a little bit better, but the one you saw here should work okay.
So, that’s it for the YOLO or the You Only Look Once algorithm. And in the next few videos I’ll show you a few other ideas that will help make this algorithm even better. In the meantime, if you want, you can take a look at YOLO paper reference at the bottom of these past couple slides I use. Although, just one warning, if you take a look at these papers which is the YOLO paper is one of the harder papers to read. I remember, when I was reading this paper for the first time, I had a really hard time figuring out what was going on. And I wound up asking a couple of my friends, very good researchers to help me figure it out, and even they had a hard time understanding some of the details of the paper. So, if you look at the paper, it’s okay if you have a hard time figuring it out. I wish it was more uncommon, but it’s not that uncommon, sadly, for even senior researchers, that review research papers and have a hard time figuring out the details. And have to look at open source code, or contact the authors, or something else to figure out the details of these outcomes. But don’t let me stop you from taking a look at the paper yourself though if you wish, but this is one of the harder ones. So, that though, you now understand the basics of the YOLO algorithm. Let’s go on to some additional pieces that will make this algorithm work even better.
06_intersection-over-union
So how do you tell if your object detection algorithm is working well? In this video, you’ll learn about a function called, “Intersection Over Union”. And as we use both for evaluating your object detection algorithm, as well as in the next video, using it to add another component to your object detection algorithm, to make it work even better. Let’s get started.
In the object detection task, you expected to localize the object as well. So if that’s the ground-truth bounding box, and if your algorithm outputs this bounding box in purple, is this a good outcome or a bad one? So what the intersection over union function does, or IoU does, is it computes the intersection over union of these two bounding boxes. So, the union of these two bounding boxes is this area, is really the area that is contained in either bounding boxes, whereas the intersection is this smaller region here. So what the intersection of a union does is it computes the size of the intersection. So that orange shaded area, and divided by the size of the union, which is that green shaded area. And by convention, the low compute division task will judge that your answer is correct if the IoU is greater than 0.5. And if the predicted and the ground-truth bounding boxes overlapped perfectly, the IoU would be one, because the intersection would equal to the union. But in general, so long as the IoU is greater than or equal to 0.5, then the answer will look okay, look pretty decent. And by convention, very often 0.5 is used as a threshold to judge as whether the predicted bounding box is correct or not. This is just a convention. If you want to be more stringent, you can judge an answer as correct, only if the IoU is greater than equal to 0.6 or some other number. But the higher the IoUs, the more accurate the bounding the box. And so, this is one way to map localization, to accuracy where you just count up the number of times an algorithm correctly detects and localizes an object where you could use a definition like this, of whether or not the object is correctly localized. And again 0.5 is just a human chosen convention. There’s no particularly deep theoretical reason for it. You can also choose some other threshold like 0.6 if you want to be more stringent. I sometimes see people use more stringent criteria like 0.6 or maybe 0.7. I rarely see people drop the threshold below 0.5. Now, what motivates the definition of IoU, as a way to evaluate whether or not your object localization algorithm is accurate or not. But more generally, IoU is a measure of the overlap between two bounding boxes. Where if you have two boxes, you can compute the intersection, compute the union, and take the ratio of the two areas. And so this is also a way of measuring how similar two boxes are to each other. And we’ll see this use again this way in the next video when we talk about non-max suppression.
So that’s it for IoU or Intersection over Union. Not to be confused with the promissory note concept in IoU, where if you lend someone money they write you a note that says, “ Oh I owe you this much money,” so that’s also called an IoU. It’s totally a different concept, that maybe it’s cool that these two things have a similar name. So now, onto this definition of IoU, Intersection of Union. In the next video, I want to discuss with you non-max suppression, which is a tool you can use to make the outputs of YOLO work even better. So let’s go on to the next video.
07_non-max-suppression
One of the problems of Object Detection as you’ve learned about this so far, is that your algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times. Non-max suppression is a way for you to make sure that your algorithm detects each object only once.
Let’s go through an example. Let’s say you want to detect pedestrians, cars, and motorcycles in this image. You might place a grid over this, and this is a 19 by 19 grid. Now, while technically this car has just one midpoint, so it should be assigned just one grid cell. And the car on the left also has just one midpoint, so technically only one of those grid cells should predict that there is a car. In practice, you’re running an object classification and localization algorithm for every one of these split cells. So it’s quite possible that this split cell might think that the center of a car is in it, and so might this, and so might this, and for the car on the left as well. Maybe not only this box, if this is a test image you’ve seen before, not only that box might decide things that’s on the car, maybe this box, and this box and maybe others as well will also think that they’ve found the car.
Let’s step through an example of how non-max suppression will work. So, because you’re running the image classification and localization algorithm on every grid cell, on 361 grid cells, it’s possible that many of them will raise their hand and say, “My $p_c$, my chance of thinking I have an object in it is large.” Rather than just having two of the grid cells out of the 19 squared or 361 think they have detected an object. So, when you run your algorithm, you might end up with multiple detections of each object. So, what non-max suppression does, is it cleans up these detections. So they end up with just one detection per car, rather than multiple detections per car. So concretely, what it does, is it first looks at the probabilities associated with each of these detections count on $p_c$s, although there are some details you’ll learn about in this week’s problem exercises, is actually $p_c$ times C1, or C2, or C3. But for now, let’s just say is $p_c$ with the probability of a detection. And it first takes the largest one, which in this case is 0.9 and says, “That’s my most confident detection, so let’s highlight that and just say I found the car there.” Having done that the non-max suppression part then looks at all of the remaining rectangles and all the ones with a high overlap, with a high IOU, with this one that you’ve just output will get suppressed. So those two rectangles with the 0.6 and the 0.7. Both of those overlap a lot with the light blue rectangle. So those, you are going to suppress and darken them to show that they are being suppressed. Next, you then go through the remaining rectangles and find the one with the highest probability, the highest $p_c$, which in this case is this one with 0.8. So let’s commit to that and just say, “Oh, I’ve detected a car there.”
And then, the non-max suppression part is to then get rid of any other ones with a high IOU. So now, every rectangle has been either highlighted or darkened. And if you just get rid of the darkened rectangles, you are left with just the highlighted ones, and these are your two final predictions. So, this is non-max suppression. And non-max means that you’re going to output your maximal probabilities classifications but suppress the close-by ones that are non-maximal. Hence the name, non-max suppression.
Let’s go through the details of the algorithm. First, on this 19 by 19 grid, you’re going to get a 19 by 19 by eight output volume. Although, for this example, I’m going to simplify it to say that you only doing car detection. So, let me get rid of the C1, C2, C3, and pretend for this line, that each output for each of the 19 by 19, so for each of the 361, which is 19 squared, for each of the 361 positions, you get an output prediction of the following. Which is the chance there’s an object, and then the bounding box. And if you have only one object, there’s no C1, C2, C3 prediction. The details of what happens, you have multiple objects, I’ll leave to the programming exercise, which you’ll work on towards the end of this week. Now, to intimate non-max suppression, the first thing you can do is discard all the boxes, discard all the predictions of the bounding boxes with $p_c$ less than or equal to some threshold, let’s say 0.6. So we’re going to say that unless you think there’s at least a 0.6 chance it is an object there, let’s just get rid of it. This has caused all of the low probability output boxes. The way to think about this is for each of the 361 positions, you output a bounding box together with a probability of that bounding box being a good one. So we’re just going to discard all the bounding boxes that were assigned a low probability. Next, while there are any remaining bounding boxes that you’ve not yet discarded or processed, you’re going to repeatedly pick the box with the highest probability, with the highest $p_c$, and then output that as a prediction. So this is a process on a previous slide of taking one of the bounding boxes, and making it lighter in color. So you commit to outputting that as a prediction for that there is a car there. Next, you then discard any remaining box. Any box that you have not output as a prediction, and that was not previously discarded. So discard any remaining box with a high overlap, with a high IOU, with the box that you just output in the previous step. This second step in the while loop was when on the previous slide you would darken any remaining bounding box that had a high overlap with the bounding box that we just made lighter, that we just highlighted. And so, you keep doing this while there’s still any remaining boxes that you’ve not yet processed, until you’ve taken each of the boxes and either output it as a prediction, or discarded it as having too high an overlap, or too high an IOU, with one of the boxes that you have just output as your predicted position for one of the detected objects.
I’ve described the algorithm using just a single object on this slide. If you actually tried to detect three objects say pedestrians, cars, and motorcycles, then the output vector will have three additional components. And it turns out, the right thing to do is to independently carry out non-max suppression three times, one on each of the outputs classes. But the details of that, I’ll leave to this week’s program exercise where you get to implement that yourself, where you get to implement non-max suppression yourself on multiple object classes.
So that’s it for non-max suppression, and if you implement the Object Detection algorithm we’ve described, you actually get pretty decent results. But before wrapping up our discussion of the YOLO algorithm, there’s just one last idea I want to share with you, which makes the algorithm work much better, which is the idea of using anchor boxes. Let’s go on to the next video.
08_anchor-boxes
One of the problems with object detection as you have seen it so far is that each of the grid cells can detect only one object. What if a grid cell wants to detect multiple objects? Here is what you can do. You can use the idea of anchor boxes. Let’s start with an example.
Let’s say you have an image like this. And for this example, I am going to continue to use a 3 by 3 grid. Notice that the midpoint of the pedestrian and the midpoint of the car are in almost the same place and both of them fall into the same grid cell. So, for that grid cell, if Y outputs this vector where you are detecting three causes, pedestrians, cars and motorcycles, it won’t be able to output two detections. So I have to pick one of the two detections to output. With the idea of anchor boxes, what you are going to do, is pre-define two different shapes called, anchor boxes or anchor box shapes. And what you are going to do is now, be able to associate two predictions with the two anchor boxes. And in general, you might use more anchor boxes, maybe five or even more. But for this video, I am just going to use two anchor boxes just to make the description easier. So what you do is you define the cross label to be, instead of this vector on the left, you basically repeat this twice. S, you will have PC, PX, PY, PH, PW, C1, C2, C3, and these are the eight outputs associated with anchor box 1. And then you repeat that PC, PX and so on down to C1, C2, C3, and other eight outputs associated with anchor box 2. So, because the shape of the pedestrian is more similar to the shape of anchor box 1 than anchor box 2, you can use these eight numbers to encode that $p_c$ as one, yes there is a pedestrian. Use this to encode the bounding box around the pedestrian, and then use this to encode that that object is a pedestrian. And then because the box around the car is more similar to the shape of anchor box 2 than anchor box 1, you can then use this to encode that the second object here is the car, and have the bounding box and so on be all the parameters associated with the detected car.
So to summarize, previously, before you are using anchor boxes, you did the following, which is for each object in the training set and the training set image, it was assigned to the grid cell that corresponds to that object’s midpoint. And so the output Y was 3 by 3 by 8 because you have a 3 by 3 grid. And for each grid position, we had that output vector which is PC, then the bounding box, and C1, C2, C3. With the anchor box, you now do that following. Now, each object is assigned to the same grid cell as before, assigned to the grid cell that contains the object’s midpoint, but it is assigned to a grid cell and anchor box with the highest IoU with the object’s shape. So, you have two anchor boxes, you will take an object and see. So if you have an object with this shape, what you do is take your two anchor boxes. Maybe one anchor box is this this shape that’s anchor box 1, maybe anchor box 2 is this shape, and then you see which of the two anchor boxes has a higher IoU, will be drawn through bounding box. And whichever it is, that object then gets assigned not just to a grid cell but to a pair. It gets assigned to grid cell comma anchor box pair. And that’s how that object gets encoded in the target label. And so now, the output Y is going to be 3 by 3 by 16. Because as you saw on the previous slide, Y is now 16 dimensional. Or if you want, you can also view this as 3 by 3 by 2 by 8 because there are now two anchor boxes and Y is eight dimensional. And dimension of Y being eight was because we have three objects causes if you have more objects than the dimension of Y would be even higher.
So let’s go through a conrete example. For this grid cell, let’s specify what is Y. So the pedestrian is more similar to the shape of anchor box 1. So for the pedestrian, we’re going to assign it to the top half of this vector. So yes, there is an object, there will be some bounding box associated at the pedestrian. And I guess if a pedestrian is cos one, then we see one as one, and then zero, zero. And then the shape of the car is more similar to anchor box 2. And so the rest of this vector will be one and then the bounding box associated with the car, and then the car is C2, so there’s zero, one, zero. And so that’s the label Y for that lower middle grid cell that this arrow was pointing to. Now, what if this grid cell only had a car and had no pedestrian? If it only had a car, then assuming that the shape of the bounding box around the car is still more similar to anchor box 2, then the target label Y, if there was just a car there and the pedestrian had gone away, it will still be the same for the anchor box 2 component. Remember that this is a part of the vector corresponding to anchor box 2. And for the part of the vector corresponding to anchor box 1, what you do is you just say there is no object there. So $p_c$ is zero, and then the rest of these will be don’t cares. Now, just some additional details. What if you have two anchor boxes but three objects in the same grid cell? That’s one case that this algorithm doesn’t handle well. Hopefully, it won’t happen. But if it does, this algorithm doesn’t have a great way of handling it. I will just influence some default tiebreaker for that case. Or what if you have two objects associated with the same grid cell, but both of them have the same anchor box shape? Again, that’s another case that this algorithm doesn’t handle well. If you influence some default way of tiebreaking if that happens, hopefully this won’t happen with your data set, it won’t happen much at all. And so, it shouldn’t affect performance as much.
So, that’s it for anchor boxes. And even though I’d motivated anchor boxes as a way to deal with what happens if two objects appear in the same grid cell, in practice, that happens quite rarely, especially if you use a 19 by 19 rather than a 3 by 3 grid. The chance of two objects having the same midpoint rather these 361 cells, it does happen, but it doesn’t happen that often. Maybe even better motivation or even better results that anchor boxes gives you is it allows your learning algorithm to specialize better. In particular, if your data set has some tall, skinny objects like pedestrians, and some white objects like cars, then this allows your learning algorithm to specialize so that some of the outputs can specialize in detecting white, fat objects like cars, and some of the output units can specialize in detecting tall, skinny objects like pedestrians. So finally, how do you choose the anchor boxes? And people used to just choose them by hand or choose maybe five or 10 anchor box shapes that spans a variety of shapes that seems to cover the types of objects you seem to detect. As a much more advanced version, just in the advance common for those of who have other knowledge in machine learning, and even better way to do this in one of the later YOLO research papers, is to use a K-means algorithm, to group together two types of objects shapes you tend to get. And then to use that to select a set of anchor boxes that this most stereotypically representative of the maybe multiple, of the maybe dozens of object causes you’re trying to detect. But that’s a more advanced way to automatically choose the anchor boxes. And if you just choose by hand a variety of shapes that reasonably expands the set of object shapes, you expect to detect some tall, skinny ones, some fat, white ones. That should work with these as well. So that’s it for anchor boxes. In the next video, let’s take everything we’ve seen and tie it back together into the YOLO algorithm.
09_yolo-algorithm
You’ve already seen most of the components of object detection. In this video, let’s put all the components together to form the YOLO object detection algorithm.
First, let’s see how you construct your training set. Suppose you’re trying to train an algorithm to detect three objects: pedestrians, cars, and motorcycles. And you will need to explicitly have the full background class, so just the class labels here. If you’re using two anchor boxes, then the outputs y will be three by three because you are using three by three grid cell, by two, this is the number of anchors, by eight because that’s the dimension of this. Eight is actually five which is plus the number of classes. So five because you have $p_c$ and then the bounding boxes, that’s five, and then c1, c2, c3. That dimension is equal to the number of classes. And you can either view this as three by three by two by eight, or by three by three by sixteen. So to construct the training set, you go through each of these nine grid cells and form the appropriate target vector y. So take this first grid cell, there’s nothing worth detecting in that grid cell. None of the three classes pedestrian, car and motocycle, appear in the upper left grid cell and so, the target y corresponding to that grid cell would be equal to this. Where Pc for the first anchor box is zero because there’s nothing associated for the first anchor box, and is also zero for the second anchor box and so on all of these other values are don’t cares. Now, most of the grid cells have nothing in them, but for that box over there, you would have this target vector y. So assuming that your training set has a bounding box like this for the car, it’s just a little bit wider than it is tall. And so if your anchor boxes are that, this is a anchor box one, this is anchor box two, then the red box has just slightly higher IoU with anchor box two. And so the car gets associated with this lower portion of the vector. So notice then that Pc associate anchor box one is zero. So you have don’t cares all these components. Then you have this Pc is equal to one, then you should use these to specify the position of the red bounding box, and then specify that the correct object is class two. Right that it is a car. So you go through this and for each of your nine grid positions each of your three by three grid positions, you would come up with a vector like this. Come up with a 16 dimensional vector. And so that’s why the final output volume is going to be 3 by 3 by 16. Oh and as usual for simplicity on the slide I’ve used a 3 by 3 the grid. In practice it might be more like a 19 by 19 by 16. Or in fact if you use more anchor boxes, maybe 19 by 19 by 5 x 8 because five times eight is 40. So it will be 19 by 19 by 40. That’s if you use five anchor boxes. So that’s training and you train ConvNet that inputs an image, maybe 100 by 100 by 3, and your ConvNet would then finally output this output volume in our example, 3 by 3 by 16 or 3 by 3 by 2 by 8.
Next, let’s look at how your algorithm can make predictions. Given an image, your neural network will output this by 3 by 3 by 2 by 8 volume, where for each of the nine grid cells you get a vector like that. So for the grid cell here on the upper left, if there’s no object there, hopefully, your neural network will output zero here, and zero here, and it will output some other values. Your neural network can’t output a question mark, can’t output a don’t care. So I’ll put some numbers for the rest. But these numbers will basically be ignored because the neural network is telling you that there’s no object there. So it doesn’t really matter whether the output is a bounding box or there’s is a car. So basically just be some set of numbers, more or less noise. In contrast, for this box over here hopefully, the value of y to the output for that box at the bottom left, hopefully would be something like zero for bounding box one. And then just open a bunch of numbers, just noise. Hopefully, you’ll also output a set of numbers that corresponds to specifying a pretty accurate bounding box for the car. So that’s how the neural network will make predictions. Finally, you run this through non-max suppression.
So just to make it interesting. Let’s look at the new test set image. Here’s how you would run non-max suppression. If you’re using two anchor boxes, then for each of the non-grid cells, you get two predicted bounding boxes. Some of them will have very low probability, very low Pc, but you still get two predicted bounding boxes for each of the nine grid cells. So let’s say, those are the bounding boxes you get. And notice that some of the bounding boxes can go outside the height and width of the grid cell that they came from. Next, you then get rid of the low probability predictions. So get rid of the ones that even the neural network says, gee this object probably isn’t there. So get rid of those. And then finally if you have three classes you’re trying to detect, you’re trying to detect pedestrians, cars and motorcycles. What you do is, for each of the three classes, independently run non-max suppression for the objects that were predicted to come from that class. But use non-max suppression for the predictions of the pedestrians class, run non-max suppression for the car class, and non-max suppression for the motorcycle class. But run that basically three times to generate the final predictions. And so the output of this is hopefully that you will have detected all the cars and all the pedestrians in this image.
So that’s it for the YOLO object detection algorithm. Which is really one of the most effective object detection algorithms, that also encompasses many of the best ideas across the entire computer vision literature that relate to object detection. And you get a chance to practice implementing many components of this yourself, in this week’s problem exercise. So I hope you enjoy this week’s problem exercise. There’s also an optional video that follows this one which you can either watch or not watch as you please. But either way I also look forward to seeing you next week.
10_optional-region-proposals
If you look at the object detection literature, there’s a set of ideas called region proposals that’s been very influential in computer vision as well. I wanted to make this video optional because I tend to use the region proposal instead of algorithm a bit less often but nonetheless, it has been an influential body of work and an idea that you might come across in your own work. Let’s take a look.
So if you recall the sliding windows idea, you would take a train classifier and run it across all of these different windows and run the detector to see if there’s a car, pedestrian, or maybe a motorcycle. Now, you could run the algorithm convolutionally, but one downside that the algorithm is it just classifiers a lot of the regions where there’s clearly no object. So this rectangle down here is pretty much blank. It’s clearly nothing interesting there to classify, and maybe it was also running it on this rectangle, which look likes there’s nothing that interesting there. So what Russ Girshik, Jeff Donahue, Trevor Darrell, and Jitendra Malik proposed in the paper, as cited to the bottom of the slide, is an algorithm called R-CNN, which stands for Regions with convolutional networks or regions with CNNs. And what that does is it tries to pick just a few regions that makes sense to run your continent classifier. So rather than running your sliding windows on every single window, you instead select just a few windows and run your continent classifier on just a few windows. The way that they perform the region proposals is to run an algorithm called a segmentation algorithm, that results in this output on the right, in order to figure out what could be objects. So, for example, the segmentation algorithm finds a block over here. And so you might pick that pounding balls and say, “Let’s run a classifier on that blob.” It looks like this little green thing finds a block there, as you might also run the classifier on that rectangle to see if there’s some interesting there. And in this case, this blue block, if you run a classifier on that, hope you find the pedestrian, and if you run it on this light cyan block, maybe you’ll find a car, maybe not,. I’m not sure. So the details of this, this is called a segmentation algorithm, and what you do is you find maybe 2000 blobs and place bounding boxes around about 2000 blobs and value classifier on just those 2000 blobs, and this can be a much smaller number of positions on which to run your continent classifier, then if you have to run it at every single position throughout the image. And this is a special case if you are running your continent not just on square-shaped regions but running them on tall skinny regions to try to find pedestrians or running them on your white fat regions try to find cars and running them at multiple scales as well. So that’s the R-CNN or the region with CNN, a region of CNN features idea.
Now, it turns out the R-CNN algorithm is still quite slow. So there’s been a line of work to explore how to speed up this algorithm. So the basic R-CNN algorithm with proposed regions using some algorithm and then classifier the proposed regions one at a time. And for each of the regions, they will output the label. So is there a car? Is there a pedestrian? Is there a motorcycle there? And then also outputs a bounding box, so you can get an accurate bounding box if indeed there is a object in that region. So just to be clear, the R-CNN algorithm doesn’t just trust the bounding box it was given. It also outputs a bounding box, B X B Y B H B W, in order to get a more accurate bounding box and whatever happened to surround the blob that the image segmentation algorithm gave it. So it can get pretty accurate bounding boxes. Now, one downside of the R-CNN algorithm was that it is actually quite slow. So over the years, there been a few improvements to the R-CNN algorithm. Russ Girshik proposed the fast R-CNN algorithm, and it’s basically the R-CNN algorithm but with a convolutional implementation of sliding windows. So the original implementation would actually classify the regions one at a time. So far, R-CNN use a convolutional implementation of sliding windows, and this is roughly similar to the idea you saw in the fourth video of this week. And that speeds up R-CNN quite a bit. It turns out that one of the problems of fast R-CNN algorithm is that the clustering step to propose the regions is still quite slow and so a different group, Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Son, proposed the faster R-CNN algorithm, which uses a convolutional neural network instead of one of the more traditional segmentation algorithms to propose a blob on those regions, and that wound up running quite a bit faster than the fast R-CNN algorithm. Although, I think the faster R-CNN algorithm, most implementations are usually still quit a bit slower than the YOLO algorithm.
So the idea of region proposals has been quite influential in computer vision, and I wanted you to know about these ideas because you see others still used these ideas, for myself, and this is my personal opinion, not the opinion of the computer vision research committee as a whole. I think that we can propose an interesting idea but that not having two steps, first, proposed region and then crossfire, being able to do everything more or at the same time, similar to the YOLO or the You Only Look Once algorithm that seems to me like a more promising direction for the long term. But that’s my personal opinion and not necessary the opinion of the whole computer vision research committee. So feel free to take that with a grain of salt, but I think that the R-CNN idea, you might come across others using it. So it was worth learning as well so you can understand others algorithms better. So we’re now finished up our material for this week on object detection. I hope you enjoy working on this week’s problem exercise, and I look forward to seeing you this week.