01_foundations-of-convolutional-neural-networks

Note

This is my personal note after studying the course of the 1st week convolutional neural networks and the copyright belongs to deeplearning.ai.

01_computer-vision

Welcome to this course on Convolutional Networks. Computer vision is one of the areas that’s been advancing rapidly thanks to deep learning. Deep learning computer vision is now helping self-driving cars figure out where the other cars and pedestrians around so as to avoid them. Is making face recognition work much better than ever before, so that perhaps some of you will soon, or perhaps already, be able to unlock a phone, unlock even a door using just your face. And if you look on your cell phone, I bet you have many apps that show you pictures of food, or pictures of a hotel, or just fun pictures of scenery. And some of the companies that build those apps are using deep learning to help show you the most attractive, the most beautiful, or the most relevant pictures. And I think deep learning is even enabling new types of art to be created. So, I think the two reasons I’m excited about deep learning for computer vision and why I think you might be too. First, rapid advances in computer vision are enabling brand new applications to view, though they just were impossible a few years ago. And by learning these tools, perhaps you will be able to invent some of these new products and applications. Second, even if you don’t end up building computer vision systems per se, I found that because the computer vision research community has been so creative and so inventive in coming up with new neural network architectures and algorithms, is actually inspire that creates a lot cross-fertilization into other areas as well. For example, when I was working on speech recognition, I sometimes actually took inspiration from ideas from computer vision and borrowed them into the speech literature. So, even if you don’t end up working on computer vision, I hope that you find some of the ideas you learn about in this course hopeful for some of your algorithms and your architectures. So with that, let’s get started.

Here are some examples of computer vision problems we’ll study in this course. You’ve already seen image classifications, sometimes also called image recognition, where you might take as input say a 64 by 64 image and try to figure out, is that a cat? Another example of the computer vision problem is object detection. So, if you’re building a self-driving car, maybe you don’t just need to figure out that there are other cars in this image. But instead, you need to figure out the position of the other cars in this picture, so that your car can avoid them. In object detection, usually, we have to not just figure out that these other objects say cars and picture, but also draw boxes around them. We have some other way of recognizing where in the picture are these objects. And notice also, in this example, that they can be multiple cars in the same picture, or at least every one of them within a certain distance of your car. Here’s another example, maybe a more fun one is neural style transfer. Let’s say you have a picture, and you want this picture repainted in a different style. So neural style transfer, you have a content image, and you have a style image. The image on the right is actually a Picasso. And you can have a neural network put them together to repaint the content image (that is the image on the left), but in the style of the image on the right, and you end up with the image at the bottom. So, algorithms like these are enabling new types of artwork to be created. And in this course, you’ll learn how to do this yourself as well.


One of the challenges of computer vision problems is that the inputs can get really big. For example, in previous courses, you’ve worked with 64 by 64 images. And so that’s 64 by 64 by 3 because there are three color channels. And if you multiply that out, that’s 12288. So x the input features has dimension 12288. And that’s not too bad. But 64 by 64 is actually a very small image. If you work with larger images, maybe this is a 1000 pixel by 1000 pixel image, and that’s actually just one megapixel. But the dimension of the input features will be 1000 by 1000 by 3, because you have three RGB channels, and that’s three million. If you are viewing this on a smaller screen, this might not be apparent, but this is actually a low res 64 by 64 image, and this is a higher res 1000 by 1000 image. But if you have three million input features, then this means that X here will be three million dimensional. And so, if in the first hidden layer maybe you have just a 1000 hidden units, then the total number of weights that is the matrix W1, if you use a standard or fully connected network like we have in courses one or two. This matrix will be a 1000 by 3 million dimensional matrix. Because X is now R by three million. 3m. I’m using to denote three million. And this means that this matrix here will have three billion parameters which is just very, very large. And with that many parameters, it’s difficult to get enough data to prevent a neural network from overfitting. And also, the computational requirements and the memory requirements to train a neural network with three billion parameters is just a bit infeasible.

But for computer vision applications, you don’t want to be stuck using only tiny little images. You want to use large images. To do that, you need to better implement the convolution operation, which is one of the fundamental building blocks of convolutional neural networks. Let’s see what this means, and how you can implement this, in the next video. And we’ll illustrate convolutions, using the example of-

02_edge-detection-example

The convolution operation is one of the fundamental building blocks of a convolutional neural network. Using edge detection as the motivating example in this video, you will see how the convolution operation works.


In previous videos, I have talked about how the early layers of the neural network might detect edges and then the some later layers might detect cause of objects and then even later layers may detect cause of complete objects like people’s faces in this case. In this video, you see how you can detect edges in an image. Lets take an example. Given a picture like that for a computer to figure out what are the objects in this picture, the first thing you might do is maybe detect vertical edges in this image. For example, this image has all those vertical lines, where the buildings are, as well as kind of vertical lines idea all lines of these pedestrians and so those get detected in this vertical edge detector output. And you might also want to detect horizontal edges so for example, there is a very strong horizontal line where this railing is and that also gets detected sort of roughly here. How do you detect edges in image like this?


Let us look with an example. Here is a 6 by 6 grayscale image and because this is a grayscale image, this is just a 6 by 6 by 1 matrix rather than 6 by 6 by 3 because they are on a separate rgb channels. In order to detect edges or lets say vertical edges in his image, what you can do is construct a 3 by 3 matrix and in the pooling when the terminology of convolutional neural networks, this is going to be called a filter. And I am going to construct a 3 by 3 filter or 3 by 3 matrix that looks like this 1, 1, 1, 0, 0, 0, -1, -1, -1. Sometimes research papers will call this a kernel instead of a filter but I am going to use the filter terminology in these videos. And what you are going to do is take the 6 by 6 image and convolve it and the convolution operation is denoted by this asterisk and convolve it with the 3 by 3 filter. One slightly unfortunate thing about the notation is that in mathematics, the asterisk is the standard symbol for convolution but in Python, this is also used to denote multiplication or maybe element-wise multiplication. This asterisk has dual purposes is overloaded notation but I will try to be clear in these videos when this asterisk refers to convolution.


The output of this convolution operator will be a 4 by 4 matrix, which you can interpret, which you can think of as a 4 by 4 image. The way you compute this 4 by 4 output is as follows, to compute the first elements, the upper left element of this 4 by 4 matrix, what you are going to do is take the 3 by 3 filter and paste it on top of the 3 by 3 region of your original input image. I have written here 1, 1, 1, 0, 0, 0, -1, -1, -1. And what you should do is take the element-wise product so the first one would be three times 1 and then the second one would be one times one I’m going down here, one times one and then plus two times one, just one and then add up all of the resulting nine numbers. So then the middle column gives you zero times zero, plus five times zero, plus seven times zero and then the right most column gives one times -1, eight times -1, plus two times -1. Adding up these nine numbers will give you negative 5 and so I’m going to fill in negative 5 over here. You can add up these nine numbers in any order of course. It is just that I went down the first column, then second column, then the third.


Next, to figure out what is this second element, you are going to take the blue square and shift it one step to the right like so. Let me get rid of the green marks here. You are going to do the same element wise product and then addition. You have zero times one, plus five times one, plus seven times one, plus one time zero, plus eight times zero, plus two times zero, plus two times negative 1, plus nine times negative one, plus five times negative one and if you add up those nine numbers, you end up with negative four and so on.

If you shift this to the right, do the nine products and add them up, you get zero and then over here you should get 8.

Just to verify, you have 2 plus 9 plus 5 that’s 16. Then the middle column gives you zero and then the right most column 4 plus 1 plus three times negative 1, that’s -8 so that is 16 on the left column -8 and that gives you 8 like we have over here.


Next, in order to get you this element in the next row what you do is take the blue square and now shift it one down so you now have it in that position, and again repeat the element wise products and then adding exercise. If you do that, you should get negative 10 here. If you shift it one to the right, you should get negative 2 and then 2 and then 3 and so on. Then fill in all the rest of the elements of the matrix.

To be clearer, this -16 would be obtained by from this lower right 3 by 3 region. A 6 by 6 matrix convolve of the 3 by 3 matrix gives you a 4 by 4 matrix. And these are images and filters. These are really just matrices of various dimensions. But the matrix on the left is convenient to interpret as image, and the one in the middle we interpret as a filter and the one on the right, you can interpret that as maybe another image. And this turns out to be a vertical edge detector, and you see why on the next slide. Before going on though, just one other comment, which is that if you implement this in a programming language, then in practice, most foreign languages will have some different functions rather than an asterisk to denote convolution. For example, in the previous exercise, you use or you implement a function called conv-forwardin python. If you do this in tensorflow, there is a function tf.nn.cont2d. And then other deep learning programming frameworks in the keras program firmware, we shall see later in this course, there is a function called cont2d that implements convolution and so on. But all the deep learning frameworks that have a good support for computer vision will have some functions for implementing this convolution operator.


Why is this doing vertical edge detection? Lets look at another example. To illustrate this, we are going to use a simplified image. Here is a simple 6 by 6 image where the left half of the image is 10 and the right half is zero. If you plot this as a picture, it might look like this, where the left half, the 10s, give you brighter pixel intensive values and the right half gives you darker pixel intensive values. I am using that shade of gray to denote zeros, although maybe it could also be drawn as black. But in this image, there is clearly a very strong vertical edge right down the middle of this image as it transitions from white to black or white to darker color. When you convolve this with the 3 by 3 filter and so this 3 by 3 filter can be visualized as follows, where is lighter, brighter pixels on the left and then this mid tone zeroes in the middle and then darker on the right. What you get is this matrix on the right. Just to verify this math if you want, this zero for example, is obtained by taking the element wise products and then multiplying with this 3 by 3 block and so you get from the left column 10 plus 10 plus 10 and then zeroes in the middle and then -10, -10, -10 which is why you end up with zero over here. Whereas in contrast, if that 30 will be obtained from this, which you get from having 10 plus 10 plus 10 and then minus zero, minus zero which is why you end up with a 30 over there. Now, if you plot this rightmost matrix’s image it will look like that where there is this lighter region right in the middle and that corresponds to this having detected this vertical edge down the middle of your 6 by 6 image. In case the dimensions here seem a little bit wrong that the detected edge seems really thick, that’s only because we are working with very small images in this example. And if you are using, say a 1000 by 1000 image rather than a 6 by 6 image then you find that this does a pretty good job, really detecting the vertical edges in your image. In this example, this bright region in the middle is just the output images way of saying that it looks like there is a strong vertical edge right down the middle of the image. Maybe one intuition to take away from vertical edge detection is that a vertical edge is a three by three region since we are using a 3 by 3 filter where there are bright pixels on the left, you do not care that much what is in the middle and dark pixels on the right. The middle in this 6 by 6 image is really where there could be bright pixels on the left and darker pixels on the right and that is why it thinks its a vertical edge over there. The convolution operation gives you a convenient way to specify how to find these vertical edges in an image.

You have now seen how the convolution operator works. In the next video, you will see how to take this and use it as one of the basic building blocks of a Convolution Neural Network.

03_more-edge-detection

You’ve seen how the convolution operation allows you to implement a vertical edge detector. In this video, you’ll learn the difference between positive and negative edges, that is, the difference between light to dark versus dark to light edge transitions. And you’ll also see other types of edge detectors, as well as how to have an algorithm learn, rather than have us hand code an edge detector as we’ve been doing so far. So let’s get started.


Here’s the example you saw from the previous video, where you have this image, six by six, there’s light on the left and dark on the right, and convolving it with the vertical edge detection filter results in detecting the vertical edge down the middle of the image. What happens in an image where the colors are flipped, where it is darker on the left and brighter on the right? So the 10s are now on the right half of the image and the 0s on the left. If you convolve it with the same edge detection filter, you end up with negative 30s, instead of 30 down the middle, and you can plot that as a picture that maybe looks like that. So because the shade of the transitions is reversed, the 30s now gets reversed as well.** And the negative 30s shows that this is a dark to light rather than a light to dark transition.** And if you don’t care which of these two cases it is, you could take absolute values of this output matrix. But this particular filter does make a difference between the light to dark versus the dark to light edges.


Let’s see some more examples of edge detection. This three by three filter we’ve seen allows you to detect vertical edges. So maybe it should not surprise you too much that this three by three filter will allow you to detect horizontal edges. So as a reminder, a vertical edge according to this filter, is a three by three region where the pixels are relatively bright on the left part and relatively dark on the right part. So similarly, a horizontal edge would be a three by three region where the pixels are relatively bright on top and relatively dark in the bottom row. So here’s one example, this is a more complex one, where you have here 10s in the upper left and lower right-hand corners. So if you draw this as an image, this would be an image which is going to be darker where there are 0s, so I’m going to shade in the darker regions, and then lighter in the upper left and lower right-hand corners. And if you convolve this with a horizontal edge detector, you end up with this. And so just to take a couple of examples, this 30 here corresponds to this three by three region, where indeed there are bright pixels on top and darker pixels on the bottom. It’s kind of over here. And so it finds a strong positive edge there. And this -30 here corresponds to this region, which is actually brighter on the bottom and darker on top. So that is a negative edge in this example. And again, this is kind of an artifact of the fact that we’re working with relatively small images, that this is just a six by six image. But these intermediate values, like this -10, for example, just reflects the fact that that filter here, it captures part of the positive edge on the left and part of the negative edge on the right, and so blending those together gives you some intermediate value. But if this was a very large, say a thousand by a thousand image with this type of checkerboard pattern, then you won’t see these transitions regions of the 10s. The intermediate values would be quite small relative to the size of the image.


So in summary, different filters allow you to find vertical and horizontal edges. It turns out that the three by three vertical edge detection filter we’ve used is just one possible choice. And historically, in the computer vision literature, there was a fair amount of debate about what is the best set of numbers to use. So here’s something else you could use, which is maybe 1, 2, 1, 0, 0, 0, -1, -2, -1. This is called a Sobel filter. And the advantage of this is it puts a little bit more weight to the central row, the central pixel, and this makes it maybe a little bit more robust. But computer vision researchers will use other sets of numbers as well, like maybe instead of a 1, 2, 1, it should be a 3, 10, 3, right? And then -3, -10, -3. And this is called a Scharr filter. And this has yet other slightly different properties. And this is just for vertical edge detection. And if you flip it 90 degrees, you get horizontal edge detection. And with the rise of deep learning, one of the things we learned is that when you really want to detect edges in some complicated image, maybe you don’t need to have computer vision researchers handpick these nine numbers. Maybe you can just learn them and treat the nine numbers of this matrix as parameters, which you can then learn using back propagation. And the goal is to learn nine parameters so that when you take the image, the six by six image, and convolve it with your three by three filter, that this gives you a good edge detector. And what you see in later videos is that by just treating these nine numbers as parameters, the backprop can choose to learn 1, 1, 1, 0, 0, 0, -1,-1, if it wants, or learn the Sobel filter or learn the Scharr filter, or more likely learn something else that’s even better at capturing the statistics of your data than any of these hand coded filters. And rather than just vertical and horizontal edges, maybe it can learn to detect edges that are at 45 degrees or 70 degrees or 73 degrees or at whatever orientation it chooses. And so by just letting all of these numbers be parameters and learning them automatically from data, we find that neural networks can actually learn low level features, can learn features such as edges, even more robustly than computer vision researchers are generally able to code up these things by hand. But underlying all these computations is still this convolution operation, Which allows back propagation to learn whatever three by three filter it wants and then to apply it throughout the entire image, at this position, at this position, at this position, in order to output whatever feature it’s trying to detect. Be it vertical edges, horizontal edges, or edges at some other angle or even some other filter that we might not even have a name for in English.

So the idea you can treat these nine numbers as parameters to be learned has been one of the most powerful ideas in computer vision. And later in this course, later this week, we’ll actually talk about the details of how you actually go about using back propagation to learn these nine numbers. But first, let’s talk about some other details, some other variations, on the basic convolution operation. In the next two videos, I want to discuss with you how to use padding as well as different strides for convolutions. And these two will become important pieces of this convolutional building block of convolutional neural networks. So let’s go on to the next video.

04_padding

In order to build deep neural networks one modification to the basic convolutional operation that you need to really use is padding. Let’s see how it works. What we saw in earlier videos is that if you take a six by six image and convolve it with a three by three filter, you end up with a four by four output with a four by four matrix, and that’s because the number of possible positions with the three by three filter, there are only, sort of, four by four possible positions, for the three by three filter to fit in your six by six matrix. And the math of this this turns out to be that if you have a end by end image and to involved that with an f by f filter, then the dimension of the output will be; n minus f plus one by n minus f plus one. And in this example, six minus three plus one is equal to four, which is why you wound up with a four by four output. So the two downsides to this; one is that, if every time you apply a convolutional operator, your image shrinks, so you come from six by six down to four by four then, you can only do this a few times before your image starts getting really small, maybe it shrinks down to one by one or something, so maybe, you don’t want your image to shrink every time you detect edges or to set other features on it, so that’s one downside, and the second downside is that, if you look the pixel at the corner or the edge, this little pixel is touched as used only in one of the outputs, because this touches that three by three region. Whereas, if you take a pixel in the middle, say this pixel, then there are a lot of three by three regions that overlap that pixel and so, is as if pixels on the corners or on the edges are use much less in the output. So you’re throwing away a lot of the information near the edge of the image.


So, to solve both of these problems, both the shrinking output, and when you build really deep neural networks, you see why you don’t want the image to shrink on every step because if you have, maybe a hundred layer of deep net, then it’ll shrinks a bit on every layer, then after a hundred layers you end up with a very small image. So that was one problem, the other is throwing away a lot of the information from the edges of the image. So in order to fix both of these problems, what you can do is the full apply of convolutional operation. You can pad the image. So in this case, let’s say you pad the image with an additional one border, with the additional border of one pixel all around the edges. So, if you do that, then instead of a six by six image, you’ve now padded this to eight by eight image and if you convolve an eight by eight image with a three by three image you now get that out. Now, the ouput is the four by four by the six by six image, so you managed to preserve the original input size of six by six. So by convention when you pad, you padded with zeros and if p is the padding amounts. So in this case, p is equal to one, because we’re padding all around with an extra boarder of one pixels, then the output becomes n plus 2p minus f plus one by n plus 2p minus f by one. So, this becomes six plus two times one minus three plus one by the same thing on that. So, six plus two minus three plus one that’s equals to six. So you end up with a six by six image that preserves the size of the original image. So this being pixel actually influences all of these cells of the output and so this effective, maybe not by throwing away but counting less the information from the edge of the corner or the edge of the image is reduced. And I’ve shown here, the effect of padding deep border with just one pixel. If you want, you can also pad the border with two pixels, in which case I guess, you do add on another border here and they can pad it with even more pixels if you choose. So, I guess what I’m drawing here, this would be a padded equals to p plus two.


In terms of how much to pad, it turns out there two common choices that are called, Valid convolutions and Same convolutions. Not really is a great names but in a valid convolution, this basically means no padding. And so in this case you might have n by n image convolve with an f by f filter and this would give you an n minus f plus one by n minus f plus one dimensional output. So this is like the example we had previously on the previous videos where we had an n by n image convolve with the three by three filter and that gave you a four by four output. The other most common choice of padding is called the same convolution and that means when you pad, so the output size is the same as the input size. So if we actually look at this formula, when you pad by p pixels then, its as if n goes to n plus 2p and then you have from the rest of this, right? Minus f plus one. So we have an n by n image and the padding of a border of p pixels all around, then the output sizes of this dimension is xn plus 2p minus f plus one. And so, if you want n plus 2p minus f plus one to be equal to one, so the output size is same as input size, if you take this and solve for, I guess, n cancels out on both sides and if you solve for p, this implies that p is equal to f minus one over two. So when f is odd, by choosing the padding size to be as follows, you can make sure that the output size is same as the input size and that’s why, for example, when the filter was three by three as this had happened in the previous slide, the padding that would make the output size the same as the input size was three minus one over two, which is one. And as another example, if your filter was five by five, so if f is equal to five, then, if you pad it into that equation you find that the padding of two is required to keep the output size the same as the input size when the filter is five by five. And by convention in computer vision, f is usually odd. It’s actually almost always odd and you rarely see even numbered filters, filter works using computer vision. And I think that two reasons for that; one is that if f was even, then you need some asymmetric padding. So only if f is odd that this type of same convolution gives a natural padding region, had the same dimension all around rather than pad more on the left and pad less on the right, or something that asymmetric. And then second, when you have an odd dimension filter, such as three by three or five by five, then it has a central position and sometimes in computer vision its nice to have a distinguisher, it’s nice to have a pixel, you can call the central pixel so you can talk about the position of the filter. Right, maybe none of this is a great reason for using f to be pretty much always odd but if you look a convolutional literature you see three by three filters are very common. You see some five by five, seven by sevens. And actually sometimes, later we’ll also talk about one by one filters and that why that makes sense. But just by convention, I recommend you just use odd number filters as well. I think that you can probably get just fine performance even if you want to use an even number value for f, but if you stick to the common computer vision convention, I usually just use odd number f.

So you’ve now seen how to use padded convolutions. To specify the padding for your convolution operation, you can either specify the value for p or you can just say that this is a valid convolution, which means p equals zero or you can say this is a same convolution, which means pad as much as you need to make sure the output has same dimension as the input. So that’s it for padding. In the next video, let’s talk about how you can implement Strided convolutions.

05_strided-convolutions

Strided convolutions is another piece of the basic building block of convolutions as used in Convolutional Neural Networks.

Let me show you an example. Let’s say you want to convolve this seven by seven image with this three by three filter, except that instead of doing the usual way, we are going to do it with a stride of two.


What that means is you take the element Y’s product as usual in this upper left three by three region and then multiply and add and that gives you 91. But then instead of stepping the blue box over by one step, we are going to step over by two steps. So, we are going to make it hop over two steps like so. Notice how the upper left hand corner has gone from this start to this start, jumping over one position. And then you do the usual element Y’s product and summing it turns out 100.

And now we are going to do they do that again, and make the blue box jump over by two steps. You end up there, and that gives you 83.

Now, when you go to the next row, you again actually take two steps instead of one step so going to move the blue box over there. Notice how we are stepping over one of the positions and then this gives you 69 and now you again step over two steps, this gives you 91 and so on so 127. And then for the final row 44, 72, and 74.

In this example, we convolve with a seven by seven matrix to this three by three matrix and we get a three by three outputs. The input and output dimensions turns out to be governed by the following formula, if you have an N by N image, they convolve with an F by F filter. And if you use padding P and stride S. In this example, S is equal to two then you end up with an output that is N plus two P minus F, and now because you’re stepping S steps of the time, you step just one step of the time, you now divide by S plus one and then can apply the same thing. In our example, we have seven plus zero, minus three, divided by two S stride plus one equals let’s see, that’s four over two plus one equals three, which is why we wound up with this is three by three output.

Now, just one last detail which is what of this fraction is not an integer? In that case, we’re going to round this down so this notation denotes the flow of something. This is also called the floor of Z. It means taking Z and rounding down to the nearest integer. The way this is implemented is that you take this type of blue box multiplication only if the blue box is fully contained within the image or the image plus to the padding and if any of this blue box kind of part of it hangs outside and you just do not do that computation. Then it turns out that if that’s the convention that your three by three filter, must lie entirely within your image or the image plus the padding region before there’s as a corresponding output generated that’s convention. Then the right thing to do to compute the output dimension is to round down in case this N plus two P minus F over S is not an integer.


Just to summarize the dimensions, if you have an N by N matrix or N by N image that you convolve with an F by F matrix or F by F filter with padding P N stride S, then the output size will have this dimension. It is nice we can choose all of these numbers so that there is an integer although sometimes you don’t have to do that and rounding down is just fine as well. But please feel free to work through a few examples of values of N, F, P and S on yourself to convince yourself if you want, that this formula is correct for the output size.

Now, before moving on there is a technical comment I want to make about cross-correlation versus convolutions and just for the facts what you have to do to implement convolutional neural networks. If you reading different math textbook or signal processing textbook, there is one other possible inconsistency in the notation which is that, if you look at the typical math textbook, the way that the convolution is defined before doing the element Y’s product and summing, there’s actually one other step that you’ll first take which is to convolve this six by six matrix with this three by three filter. You at first take the three by three filter and slip it on the horizontal as well as the vertical axis so this 345102 minus 197, will become, three goes here, four goes there, five goes there and then the second row becomes this,102 minus 197. Well, this is really taking the three by three filter and narrowing it both on the vertical and horizontal axes. And then it was this flit matrix that you would then copy over here. To compute the output, you will take two times seven, plus three times two, plus seven times five and so on. I should multiply out the elements of this flit matrix in order to compute the upper left hand rows elements of the four by four output as follows. Then you take those nine numbers and shift them over by one shift them over by one and so on. The way we’ve define the convolution operation in this video is that we’ve skipped this narrowing operation. Technically, what we’re actually doing, the operation we’ve been using for the last few videos is sometimes cross-correlation instead of convolution. But in the deep learning literature by convention, we just call this a convolutional operation. Just to summarize, by convention in machine learning, we usually do not bother with this skipping operation and technically, this operation is maybe better called cross-correlation but most of the deep learning literature just calls it the convolution operator. And so I’m going to use that convention in these videos as well, and if you read a lot of the machines learning literature, you’ll find most people just call this the convolution operator without bothering to use these slips. It turns out that in signal processing or in certain branches of mathematics, doing the flipping in the definition of convolution causes convolution operator to enjoy this property that A convolve with B, convolve with C is equal to A convolve with B, convolve with C, and this is called associativity in mathematics. This is nice for some signal processing applications but for deep neural networks it really doesn’t matter and so omitting this double mirroring operation just simplifies the code and makes the neural networks work just as well. And by convention, most of us just call this convolution or even though the mathematicians prefer to call this cross-correlation sometimes. But this should not affect anything you have to implement in the problem exercises and should not affect your ability to read and understand the deep learning literature.

So you’ve now seen how to carry out convolutions and you’ve seen how to use padding as well as strides to convolutions. But so far, all we’ve been using is convolutions over matrices, like over a six by six matrix. In the next video, you’ll see how to carry out convolutions over volumes and this would make what you can do a convolutions sounds really much more powerful. Let’s go on to the next video.

06_convolutions-over-volume

You’ve seen how convolutions over 2D images works. Now, let’s see how you can implement convolutions over, not just 2D images, but over three dimensional volumes.


Let’s start with an example, let’s say you want to detect features, not just in a great scale image, but in a RGB image. So, an RGB image might be instead of a six by six image, it could be six by six by three, where the three here responds to the three color channels. So, you think of this as a stack of three six by six images. In order to detect edges or some other feature in this image, you can vault this, not with a three by three filter, as we have previously, but now with also with a 3D filter, that’s going to be three by three by three. So the filter itself will also have three layers corresponding to the red, green, and blue channels. So to give these things some names, this first six here, that’s the height of the image, that’s the width, and this three is the number of channels. And your filter also similarly has a height, a width, and the number of channels. And the number of channels in your image must match the number of channels in your filter, so these two numbers have to be equal. We’ll see on the next slide how this convolution operation actually works, but the output of this will be a four by four image. And notice this is four by four by one, there’s no longer a three at the end.


Let’s go through in detail how this works but let’s use a more nicely drawn image. So here’s the six by six by three image, and here’s a three by three by three filter, and this last number, the number of channels matches the 3D image and the filter. So to simplify the drawing of this three by three by three filter, instead of joining it is a stack of the matrices, I’m also going to, sometimes, just draw it as this three dimensional cube, like that. So to compute the output of this convolutional operation, what you would do is take the three by three by three filter and first, place it in that upper left most position. So, notice that this three by three by three filter has 27 numbers, or 27 parameters, that’s three cubes. And so, what you do is take each of these 27 numbers and multiply them with the corresponding numbers from the red, green, and blue channels of the image, so take the first nine numbers from red channel, then the three beneath it to the green channel, then the three beneath it to the blue channel, and multiply it with the corresponding 27 numbers that gets covered by this yellow cube show on the left. Then add up all those numbers and this gives you this first number in the output, and then to compute the next output you take this cube and slide it over by one, and again, due to 27 multiplications, add up the 27 numbers, that gives you this next output, do it for the next number over, for the next position over, that gives the third output and so on. That gives you the forth and then one row down and then the next one, to the next one, to the next one, and so on, you get the idea, until at the very end, that’s the position you’ll have for that final output. So, what does this allow you to do? Well, here’s an example, this filter is three by three by three. So, if you want to detect edges in the red channel of the image, then you could have the first filter, the one, one, one, one is one, one is one, one is one as usual, and have the green channel be all zeros, and have the blue filter be all zeros. And if you have these three stock together to form your three by three by three filter, then this would be a filter that detect edges, vertical edges but only in the red channel. Alternatively, if you don’t care what color the vertical edge is in, then you might have a filter that’s like this, whereas this one, one, one, minus one, minus one, minus one, in all three channels. So, by setting this second alternative, set the parameters, you then have a edge detector, a three by three by three edge detector, that detects edges in any color.

And with different choices of these parameters you can get different feature detectors out of this three by three by three filter. And by convention, in computer vision, when you have an input with a certain height, a certain width, and a certain number of channels, then your filter will have a potential different height, different width, but the same number of channels. And in theory it’s possible to have a filter that maybe only looks at the red channel or maybe a filter looks at only the green channel and a blue channel. And once again, you notice that convolving a volume, a six by six by three convolve with a three by three by three, that gives a four by four, a 2D output.


Now that you know how to convolve on volumes, there is one last idea that will be crucial for building convolutional neural networks, which is what if we don’t just wanted to detect vertical edges? What if we wanted to detect vertical edges and horizontal edges and maybe 45 degree edges and maybe 70 degree edges as well, but in other words, what if you want to use multiple filters at the same time? So, here’s the picture we had from the previous slide, we had six by six by three convolved with the three by three by three, gets four by four, and maybe this is a vertical edge detector, or maybe it’s run to detect some other feature. Now, maybe a second filter may be denoted by this orange color, which could be a horizontal edge detector. So, maybe convolving it with the first filter gives you this first four by four output and convolving with the second filter gives you a different four by four output. And what we can do is then take these two four by four outputs, take this first one within the front and you can take this second filter output and well, let me draw it here, put it at back as follows, so that by stacking these two together, you end up with a four by four by two output volume, right? And you can think of the volume as if we draw this is a box, I guess it would look like this. So this would be a four by four by two output volume, which is the result of taking your six by six by three image and convolving it or applying two different three by three filters to it, resulting in two four by four outputs that then gets stacked up to form a four by four by two volume. And the two here comes from the fact that we used two different filters. So, let’s just summarize the dimensions, if you have a n by n by number of channels input image, so an example, there’s a six by six by three, where n subscript C is the number of channels, and you convolve that with a f by f by, and again, this should be the same nC, so this was, three by three by three, and by convention this and this have to be the same number. Then, what you get is n minus f plus one by n minus f plus one by and you want to use this nC prime, or its really nC of the next layer, but this is the number of filters that you use. So this in our example would be be four by four by two. And I wrote this assuming that you use a stride of one and no padding. But if you used a different stride of padding than this n minus F plus one would be affected in a usual way, as we see in the previous videos. So this idea of convolution on volumes, turns out to be really powerful. Only a small part of it is that you can now operate directly on RGB images with three channels. But even more important is that you can now detect two features, like vertical, horizontal edges, or 10, or maybe a 128, or maybe several hundreds of different features. And the output will then have a number of channels equal to the number of filters you are detecting. And as a note of notation, I’ve been using your number of channels to denote this last dimension in the literature, people will also often call this the depth of this 3D volume and both notations, channels or depth, are commonly used in the literature. But they find depth more confusing because you usually talk about the depth of the neural network as well, so I’m going to use the term channels in these videos to refer to the size of this third dimension of these filters.

So now that you know how to implement convolutions over volumes, you now are ready to implement one layer of the convolutional neural network. Let’s see how to do that in the next video.

07_one-layer-of-a-convolutional-network

Get now ready to see how to build one layer of a convolutional neural network, let’s go through the example.


You’ve seen at the previous video how to take a 3D volume and convolve it with say two different filters. In order to get in this example to different 4 by 4 outputs. So let’s say convolving with the first filter gives this first 4 by 4 output, and convolving with this second filter gives a different 4 by 4 output. The final thing to turn this into a convolutional neural net layer, is that for each of these we’re going to add it bias, so this is going to be a real number. And where python broadcasting, you kind of have to add the same number so every one of these 16 elements. And then apply a non-linearity which for this illustration that says relative non-linearity, and this gives you a 4 by 4 output, all right? After applying the bias and the non-linearity. And then for this thing at the bottom as well, you add some different bias, again, this is a real number. So you add the single number to all 16 numbers, and then apply some non-linearity, let’s say a real non-linearity. And this gives you a different 4 by 4 output. Then same as we did before, if we take this and stack it up as follows, so we ends up with a 4 by 4 by 2 outputs. Then this computation where you come from a 6 by 6 by 3 to 4 by 4 by 4, this is one layer of a convolutional neural network. So to map this back to one layer of four propagation in the standard neural network, in a non-convolutional neural network. Remember that one step before the prop was something like this, right? z1 = w1 times a0, a 0 was also equal to x, and then plus b[1]. And you apply the non-linearity to get a[1], so that’s g(z[1]). So this input here, in this analogy this is a[0], this is x3. And these filters here, this plays a role similar to w1. And you remember during the convolution operation, you were taking these 27 numbers, or really well, 27 times 2, because you have two filters. You’re taking all of these numbers and multiplying them. So you’re really computing a linear function to get this 4 x 4 matrix. So that 4 x 4 matrix, the output of the convolution operation, that plays a rolesimilar to w1 times a0. That’s really maybe the output of this 4 x 4 as well as that 4 x 4. And then the other thing you do is add the bias. So, this thing here before applying value, this plays a role similar to z. And then it’s finally by applying the non-linearity, this kind of this I guess. So, this output plays a role, this really becomes your activation at the next layer. So this is how you go from a0 to a1, as far as tthe linear operation and then convolution has all these multipled. So the convolution is really applying a linear operation and you have the biases and the applied value operation. And you’ve gone from a 6 by 6 by 3, dimensional a0, through one layer of neural network to, I guess a 4 by 4 by 2 dimensional a(1). And so 6 by 6 by 3 has gone to 4 by 4 by 2, and so that is one layer of convolutional net. Now in this example we have two filters, so we had two features of you will, which is why we wound up with our output 4 by 4 by 2. But if for example we instead had 10 filters instead of 2, then we would have wound up with the 4 by 4 by 10 dimensional output volume. Because we’ll be taking 10 of these naps not just two of them, and stacking them up to form a 4 by 4 by 10 output volume, and that’s what a1 would be.


So, to make sure you understand this, let’s go through an exercise. Let’s suppose you have 10 filters, not just two filters, that are 3 by 3 by 3 and 1 layer of a neural network, how many parameters does this layer have? Well, let’s figure this out. Each filter, is a 3 x 3 x 3 volume, so 3 x 3 x 3, so each fill has 27 parameters, all right? There’s 27 numbers to be run, and plus the bias. So that was the b parameter, so this gives you 28 parameters. And then if you imagine that on the previous slide we had drawn two filters, but now if you imagine that you actually have ten of these, right? 1, 2…, 10 of these, then all together you’ll have 28 times 10, so that will be 280 parameters. Notice one nice thing about this, is that no matter how big the input image is, the input image could be 1,000 by 1,000 or 5,000 by 5,000, but the number of parameters you have still remains fixed as 280. And you can use these ten filters to detect features, vertical edges, horizontal edges maybe other features anywhere even in a very, very large image is just a very small number of parameters. So these is really one property of convolution neural network that makes less prone to overfitting then if you could. So once you’ve learned 10 feature detectors that work, you could apply this even to large images. And the number of parameters still is fixed and relatively small, as 280 in this example.


All right, so to wrap up this video let’s just summarize the notation we are going to use to describe one layer to describe a covolutional layer in a convolutional neural network. So layer l is a convolution layer, l am going to use f superscript,[l] to denote the filter size. So previously we’ve been seeing the filters are f by f, and now this superscript square bracket l just denotes that this is a filter size of f by f filter layer l. And as usual the superscript square bracket l is the notation we’re using to refer to particular layer l. going to use p[l] to denote the amount of padding. And again, the amount of padding can also be specified just by saying that you want a valid convolution, which means no padding, or a same convolution which means you choose the padding. So that the output size has the same height and width as the input size. And then you’re going to use s[l] to denote the stride. Now, the input to this layer is going to be some dimension. It’s going be some n by n by number of channels in the previous layer. Now, I’m going to modify this notation a little bit. I’m going to us superscript l- 1, because that’s the activation from the previous layer, l- 1 times nc of l- 1. And in the example so far, we’ve been just using images of the same height and width. That in case the height and width might differ, l am going to use superscript h and superscript w, to denote the height and width of the input of the previous layer, all right? So in layer l, the size of the volume will be nh by nw by nc with superscript squared bracket l. It’s just in layer l, the input to this layer Is whatever you had for the previous layer, so that’s why you have l- 1 there. And then this layer of the neural network will itself output the value. So that will be nh of l by nw of l, by nc of l, that will be the size of the output. And so whereas we approve this set that the output volume size or at least the height and weight is given by this formula, n + 2p- f over s + 1, and then take the full of that and round it down. In this new notation what we have is that the outputs value that’s in layer l, is going to be the dimension from the previous layer, plus the padding we’re using in this layer l, minus the filter size we’re using this layer l and so on. And technically this is true for the height, right? So the height of the output volume is given by this, and you can compute it with this formula on the right, and the same is true for the width as well. So you cross out h and throw in w as well, then the same formula with either the height or the width plugged in for computing the height or width of the output value. So that’s how nhl -1 relates to nhl and wl- 1 relates to nwl. Now, how about the number of channels, where did those numbers come from? Let’s take a look, if the output volume has this depth, while we know from the previous examples that that’s equal to the number of filters we have in that layer, right? So we had two filters, the output value was 4 by 4 by 2, was 2 dimensional. And if you had 10 filters and your upper volume was 4 by 4 by 10. So, this the number of channels in the output value, that’s just the number of filters we’re using in this layer of the neural network. Next, how about the size of this filter? Well, each filter is going to be fl by fl by 100 number, right? So what is this last number? Well, we saw that you needed to convolve a 6 by 6 by 3 image, with a 3 by 3 by 3 filter. And so the number of channels in your filter, must match the number of channels in your input, so this number should match that number, right? Which is why each filter is going to be f(l) by f(l) by nc(l-1). And the output of this layer often apply devices in non-linearity, is going to be the activations of this layer al. And that we’ve already seen will be this dimension, right? The al will be a 3D volume, that’s nHl by nwl by ncl. And when you are using a vectorized implementation or batch gradient descent or mini batch gradient descent, then you actually outputs Al, which is a set of m activations, if you have m examples. So that would be M by nHl, by nwl by ncl right? If say you’re using bash grading decent and in the programming sizes this will be ordering of the variables. And we have the index and the trailing examples first, and then these three variables. Next how about the weights or the parameters, or kind of the w parameter? Well we saw already what the filter dimension is. So the filters are going to be f[l] by f[l] by nc [l- 1], but that’s the dimension of one filter. How many filters do we have? Well, this is a total number of filters, so the weights really all of the filters put together will have dimension given by this, times the total number of filters, right? Because this, Last quantity is the number of filters, In layer l. And then finally, you have the bias parameters, and you have one bias parameter, one real number for each filter. So you’re going to have, the bias will have this many variables, it’s just a vector of this dimension. Although later on we’ll see that the code will be more convenient represented as 1 by 1 by 1 by nc[l] four dimensional matrix, or four dimensional tensor.

So I know that was a lot of notation, and this is the convention I’ll use for the most part. I just want to mention in case you search online and look at open source code. There isn’t a completely universal standard convention about the ordering of height, width, and channel. So If you look on source code on GitHub or these open source implementations, you’ll find that some authors use this order instead, where you first put the channel first, and you sometimes see that ordering of the variables. And in fact in some common frameworks, actually in multiple common frameworks, there’s actually a variable or a parameter. Why do you want to list the number of channels first, or list the number of channels last when indexing into these volumes. I think both of these conventions work okay, so long as you’re consistent. And unfortunately maybe this is one piece of annotation where there isn’t consistency in the deep learning literature but i’m going to use this convention for these videos. Where we list height and width and then the number of channels last.

So I know there was suddenly a lot of new notations you could use, but you’re thinking wow, that’s a long notation, how do I need to remember all of these? Don’t worry about it, you don’t need to remember all of this notation, and through this week’s exercises you become more familiar with it at that time. But the key point I hope you take a way from this video, is just one layer of how convolutional neural network works. And the computations involved in taking the activations of one layer and mapping that to the activations of the next layer. And next, now that you know how one layer of the compositional neural network works, let’s stack a bunch of these together to actually form a deeper compositional neural network. Let’s go on to the next video to see.

08_simple-convolutional-network-example

In the last video, you saw the building blocks of a single layer, of a single convolution layer in the ConvNet. Now let’s go through a concrete example of a deep convolutional neural network. And this will give you some practice with the notation that we introduced toward the end of the last video as well.


Let’s say you have an image, and you want to do image classification, or image recognition. Where you want to take as input an image, x, and decide is this a cat or not, 0 or 1, so it’s a classification problem. Let’s build an example of a ConvNet you could use for this task. For the sake of this example, I’m going to use a fairly small image. Let’s say this image is 39 x 39 x 3. This choice just makes some of the numbers work out a bit better. And so, nH in layer 0 will be equal to nw height and width are equal to 39 and the number of channels and layer 0 is equal to 3. Let’s say the first layer uses a set of 3 by 3 filters to detect features, so f = 3 or really f1 = 3, because we’re using a 3 by 3 process. And let’s say we’re using a stride of 1, and no padding. So using a same convolution, and let’s say you have 10 filters. Then the activations in this next layer of the neutral network will be 37 x 37 x 10, and this 10 comes from the fact that you use 10 filters. And 37 comes from this formula n + 2p- f over s + 1. Right, then I guess you have 39 + 0- 3 over 1 + 1 that’s = to 37. So that’s why the output is 37 by 37, it’s a valid convolution and that’s the output size. So in our notation you would have nh[1] = nw[1] = 37 and nc[1] = 10, so nc[1] is also equal to the number of filters from the first layer. And so this becomes the dimension of the activation at the first layer. Let’s say you now have another convolutional layer and let’s say this time you use 5 by 5 filters. So, in our notation f[2] at the next neural network = 5, and let’s say use a stride of 2 this time. And maybe you have no padding and say, 20 filters. So then the output of this will be another volume, this time it will be 17 x 17 x 20. Notice that, because you’re now using a stride of 2, the dimension has shrunk much faster. 37 x 37 has gone down in size by slightly more than a factor of 2, to 17 x 17. And because you’re using 20 filters, the number of channels now is 20. So it’s this activation a2 would be that dimension and so nh[2] = nw[2] = 17 and nc[2] = 20. All right, let’s apply one last convolutional layer. So let’s say that you use a 5 by 5 filter again, and again, a stride of 2. So if you do that, I’ll skip the math, but you end up with a 7 x 7, and let’s say you use 40 filters, no padding, 40 filters. You end up with 7 x 7 x 40. So now what you’ve done is taken your 39 x 39 x 3 input image and computed your 7 x 7 x 40 features for this image. And then finally, what’s commonly done is if you take this 7 x 7 x 40, 7 times 7 times 40 is actually 1,960. And so what we can do is take this volume and flatten it or unroll it into just 1,960 units, right? Just flatten it out into a vector, and then feed this to a logistic regression unit, or a softmax unit. Depending on whether you’re trying to recognize or trying to recognize any one of key different objects and then just have this give the final predicted output for the neural network. So just be clear, this last step is just taking all of these numbers, all 1,960 numbers, and unrolling them into a very long vector. So then you just have one long vector that you can feed into softmax until it’s just a regression in order to make prediction for the final output. So this would be a pretty typical example of a ConvNet. A lot of the work in designing convolutional neural net is selecting hyperparameters like these, deciding what’s the total size? What’s the stride? What’s the padding and how many filters are used? And both later this week as well as next week, we’ll give some suggestions and some guidelines on how to make these choices. But for now, maybe one thing to take away from this is that as you go deeper in a neural network, typically you start off with larger images, 39 by 39. And then the height and width will stay the same for a while and gradually trend down as you go deeper in the neural network. It’s gone from 39 to 37 to 17 to 7. Whereas the number of channels will generally increase. It’s gone from 3 to 10 to 20 to 40, and you see this general trend in a lot of other convolutional neural networks as well. So we’ll get more guidelines about how to design these parameters in later videos. But you’ve now seen your first example of a convolutional neural network, or a ConvNet for short. So congratulations on that.


And it turns out that in a typical ConvNet, there are usually three types of layers. One is the convolutional layer, and often we’ll denote that as a Conv layer. And that’s what we’ve been using in the previous network. It turns out that there are two other common types of layers that you haven’t seen yet but we’ll talk about in the next couple of videos. One is called a pooling layer, often I’ll call this pool. And then the last is a fully connected layer called FC. And although it’s possible to design a pretty good neural network using just convolutional layers, most neural network architectures will also have a few pooling layers and a few fully connected layers. Fortunately pooling layers and fully connected layers are a bit simpler than convolutional layers to define. So we’ll do that quickly in the next two videos and then you have a sense of all of the most common types of layers in a convolutional neural network. And you will put together even more powerful networks than the one we just saw.

So congrats again on seeing your first full convolutional neural network. We’ll also talk later in this week about how to train these networks, but first let’s talk briefly about pooling and fully connected layers. And then training these, we’ll be using back propagation, which you’re already familiar with. But in the next video, let’s quickly go over how to implement a pooling layer for your ConvNet.

09_pooling-layers

Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that detects a bit more robust. Let’s take a look.


Let’s go through an example of pooling, and then we’ll talk about why you might want to do this. Suppose you have a four by four input, and you want to apply a type of pooling called max pooling. And the output of this particular implementation of max pooling will be a two by two output. And the way you do that is quite simple. Take your four by four input and break it into different regions and I’m going to color the four regions as follows. And then, in the output, which is two by two, each of the outputs will just be the max from the corresponding reshaded region. So the upper left, I guess, the max of these four numbers is nine. On upper right, the max of the blue numbers is two. Lower left, the biggest number is six, and lower right, the biggest number is three. So to compute each of the numbers on the right, we took the max over a two by two regions. So, this is as if you apply a filter size of two because you’re taking a two by two regions and you’re taking a stride of two. So, these are actually the hyperparameters of max pooling because we start from this filter size. It’s like a two by two region that gives you the nine. And then, you step all over two steps to look at this region, to give you the two, and then for the next row, you step it down two steps to give you the six, and then step to the right by two steps to give you three. So because the squares are two by two, f is equal to two, and because you stride by two, s is equal to two. So here’s the intuition behind what max pooling is doing. If you think of this four by four region as some set of features, the activations in some layer of the neural network, then a large number, it means that it’s maybe detected a particular feature. So, the upper left-hand quadrant has this particular feature. It maybe a vertical edge or maybe a higher or whisker if you trying to detect a [inaudible]. Clearly, that feature exists in the upper left-hand quadrant. Whereas this feature, maybe it isn’t cat eye detector. Whereas this feature, it doesn’t really exist in the upper right-hand quadrant. So what the max operation does is a lots of features detected anywhere, and one of these quadrants , it then remains preserved in the output of max pooling. So, what the max operates to does is really to say, if these features detected anywhere in this filter, then keep a high number. But if this feature is not detected, so maybe this feature doesn’t exist in the upper right-hand quadrant. Then the max of all those numbers is still itself quite small. So maybe that’s the intuition behind max pooling. But I have to admit, I think the main reason people use max pooling is because it’s been found in a lot of experiments to work well, and the intuition I just described, despite it being often cited, I don’t know of anyone fully knows if that is the real underlying reason. I don’t have anyone knows if that’s the real underlying reason that max pooling works well in ConvNets. One interesting property of max pooling is that it has a set of hyperparameters but it has no parameters to learn. There’s actually nothing for gradient descent to learn. Once you fix f and s, it’s just a fixed computation and gradient descent doesn’t change anything.

Let’s go through an example with some different hyperparameters. Here, I am going to use, sure you have a five by five input and we’re going to apply max pooling with a filter size that’s three by three. So f is equal to three and let’s use a stride of one.

So in this case, the output size is going to be three by three. And the formulas we had developed in the previous videos for figuring out the output size for conv layer, those formulas also work for max pooling. So, that’s n plus 2p minus f over s plus 1. That formula also works for figuring out the output size of max pooling. But in this example, let’s compute each of the elements of this three by three output. The upper left-hand elements, we’re going to look over that region. So notice this is a three by three region because the filter size is three and to the max there. So, that will be nine, and then we shifted over by one because which you can stride at one. So, that max in the blue box is nine. Let’s shift that over again. The max of the blue box is five. And then let’s go on to the next row, a stride of one. So we’re just stepping down by one step. So max in that region is nine, max in that region is nine, max in that region, it’s now with a two fives, we have maxes of five. And then finally, max in that is eight. Max in that is six, and max in that, this is not [inaudible]. Okay, so this, with this set of hyperparameters f equals three, s equals one gives that output shown [inaudible]. Now, so far, I’ve shown max pooling on a 2D inputs.


If you have a 3D input, then the outputs will have the same dimension. So for example, if you have five by five by two, then the output will be three by three by two and the way you compute max pooling is you perform the computation we just described on each of the channels independently. So the first channel which is shown here on top is still the same, and then for the second channel, I guess, this one that I just drew at the bottom, you would do the same computation on that slice of this value and that gives you the second slice. And more generally, if this was five by five by some number of channels, the output would be three by three by that same number of channels. And the max pooling computation is done independently on each of these $N_C$ channels. So, that’s max pooling.

This one is the type of pooling that isn’t used very often, but I’ll mention briefly which is average pooling. So it does pretty much what you’d expect which is, instead of taking the maxes within each filter, you take the average. So in this example, the average of the numbers in purple is 3.75, then there is 1.25, and four and two. And so, this is average pooling with hyperparameters f equals two, s equals two, we can choose other hyperparameters as well. So these days, max pooling is used much more often than average pooling with one exception, which is sometimes very deep in a neural network. You might use average pooling to collapse your representation from say, 7 by 7 by 1,000. An average over all the spacial extents, you get 1 by 1 by 1,000. We’ll see an example of this later. But you see, max pooling used much more in the neural network than average pooling.


So just to summarize, the hyperparameters for pooling are f, the filter size and s, the stride, and maybe common choices of parameters might be f equals two, s equals two. This is used quite often and this has the effect of roughly shrinking the height and width by a factor of above two, and a common chosen hyperparameters might be f equals two, s equals two, and this has the effect of shrinking the height and width of the representation by a factor of two. I’ve also seen f equals three, s equals two used, and then the other hyperparameter is just like a binary bit that says, are you using max pooling or are you using average pooling. If you want, you can add an extra hyperparameter for the padding although this is very, very rarely used. When you do max pooling, usually, you do not use any padding, although there is one exception that we’ll see next week as well. But for the most parts of max pooling, usually, it does not use any padding. So, the most common value of p by far is p equals zero. And the input of max pooling is that you input a volume of size that, N_H by N_W by N_C, and it would output a volume of size given by this. So assuming there’s no padding by N_W minus f over s, this one for by N_C. So the number of input channels is equal to the number of output channels because pooling applies to each of your channels independently. One thing to note about pooling is that there are no parameters to learn. So, when we implement that crop, you find that there are no parameters that backdrop will adapt through max pooling. Instead, there are just these hyperparameters that you set once, maybe set ones by hand or set using cross-validation. And then beyond that, you are done. Its just a fixed function that the neural network computes in one of the layers, and there is actually nothing to learn. It’s just a fixed function.

So, that’s it for pooling. You now know how to build convolutional layers and pooling layers. In the next video, let’s see a more complex example of a ConvNet. One that will also allow us to introduce fully connected layers.

10_cnn-example

You now know pretty much all the building blocks of building a full convolutional neural network. Let’s look at an example.


Let’s say you’re inputting an image which is 32 x 32 x 3, so it’s an RGB image and maybe you’re trying to do handwritten digit recognition. So you have a number like 7 in a 32 x 32 RGB initiate trying to recognize which one of the 10 digits from zero to nine is this. Let’s throw the neural network to do this. And what I’m going to use in this slide is inspired, it’s actually quite similar to one of the classic neural networks called LeNet-5, which is created by Yann LeCun many years ago. What I’ll show here isn’t exactly LeNet-5 but it’s inspired by it, but many parameter choices were inspired by it. So with a 32 x 32 x 3 input let’s say that the first layer uses a 5 x 5 filter and a stride of 1, and no padding. So the output of this layer, if you use 6 filters would be 28 x 28 x 6, and we’re going to call this layer conv 1. So you apply 6 filters, add a bias, apply the non-linearity, maybe a real non-linearity, and that’s the conv 1 output. Next, let’s apply a pooling layer, so I am going to apply mass pooling here and let’s use a f=2, s=2. When I don’t write a padding use a pad easy with a 0. Next let’s apply a pooling layer, I am going to apply, let’s see max pooling with a 2 x 2 filter and the stride equals 2. So this is should reduce the height and width of the representation by a factor of 2. So 28 x 28 now becomes 14 x 14, and the number of channels remains the same so 14 x 14 x 6, and we’re going to call this the Pool 1 output. So, it turns out that in the literature of a ConvNet there are two conventions which are inside the inconsistent about what you call a layer. One convention is that this is called one layer. So this will be layer one of the neural network, and now the conversion will be to call they convey layer as a layer and the pool layer as a layer. When people report the number of layers in a neural network usually people just record the number of layers that have weight, that have parameters. And because the pooling layer has no weights, has no parameters, only a few hyper parameters, I’m going to use a convention that Conv 1 and Pool 1 shared together. I’m going to treat that as Layer 1, although sometimes you see people maybe read articles online and read research papers, you hear about the conv layer and the pooling layer as if they are two separate layers. But this is maybe two slightly inconsistent notation terminologies, but when I count layers, I’m just going to count layers that have weights. So achieve both of this together as Layer 1. And the name Conv1 and Pool1 use here the 1 at the end also refers the fact that I view both of this is part of Layer 1 of the neural network. And Pool 1 is grouped into Layer 1 because it doesn’t have its own weights.


Next, given a 14 x 14 bx 6 volume, let’s apply another convolutional layer to it, let’s use a filter size that’s 5 x 5, and let’s use a stride of 1, and let’s use 10 filters this time. So now you end up with, A 10 x 10 x 10 volume, so I’ll call this Comv 2, and then in this network let’s do max pulling with f=2, s=2 again. So you could probably guess the output of this, f=2, s=2, this should reduce the height and width by a factor of 2, so you’re left with 5 x 5 x 10. And so I’m going to call this Pool 2, and in our convention this is Layer 2 of the neural network.


Now let’s apply another convolutional layer to this. I’m going to use a 5 x 5 filter, so f = 5, and let’s try this, 1, and I don’t write the padding, means there’s no padding. And this will give you the Conv 2 output, and that’s your 16 filters. So this would be a 10 x 10 x 16 dimensional output. So we look at that, and this is the Conv 2 layer. And then let’s apply max pooling to this with f=2, s=2. You can probably guess the output of this, we’re at 10 x 10 x 16 with max pooling with f=2, s=2. This will half the height and width, you can probably guess the result of this, right? Left pooling with f = 2, s = 2. This should halve the height and width so you end up with a 5 x 5 x 16 volume, same number of channels as before. We’re going to call this Pool 2. And in our convention this is Layer 2 because this has one set of weights and your Conv 2 layer. Now 5 x 5 x 16, 5 x 5 x 16 is equal to 400. So let’s now fatten our Pool 2 into a 400 x 1 dimensional vector. So think of this as fatting this up into these set of neurons, like so. And what we’re going to do is then take these 400 units and let’s build the next layer, As having 120 units. So this is actually our first fully connected layer. I’m going to call this FC3 because we have 400 units densely connected to 120 units. So this fully connected unit, this fully connected layer is just like the single neural network layer that you saw in Courses 1 and 2. This is just a standard neural network where you have a weight matrix that’s called W3 of dimension 120 x 400. And this is fully connected because each of the 400 units here is connected to each of the 120 units here, and you also have the bias parameter, yes that’s going to be just a 120 dimensional, this is 120 outputs. And then lastly let’s take 120 units and add another layer, this time smaller but let’s say we had 84 units here, I’m going to call this fully connected Layer 4. And finally we now have 84 real numbers that you can fit to a [INAUDIBLE] unit. And if you’re trying to do handwritten digital recognition, to recognize this hand it is 0, 1, 2, and so on up to 9. Then this would be a softmax with 10 outputs. So this is a vis-a-vis typical example of what a convolutional neural network might look like. And I know this seems like there a lot of hyper parameters. We’ll give you some more specific suggestions later for how to choose these types of hyper parameters. Maybe one common guideline is to actually not try to invent your own settings of hyper parameters, but to look in the literature to see what hyper parameters you work for others. And to just choose an architecture that has worked well for someone else, and there’s a chance that will work for your application as well. We’ll see more about that next week. But for now I’ll just point out that as you go deeper in the neural network, usually nh and nw to height and width will decrease. Pointed this out earlier, but it goes from 32 x 32, to 20 x 20, to 14 x 14, to 10 x 10, to 5 x 5. So as you go deeper usually the height and width will decrease, whereas the number of channels will increase. It’s gone from 3 to 6 to 16, and then your fully connected layer is at the end. And another pretty common pattern you see in neural networks is to have conv layers, maybe one or more conv layers followed by a pooling layer, and then one or more conv layers followed by pooling layer. And then at the end you have a few fully connected layers and then followed by maybe a softmax. And this is another pretty common pattern you see in neural networks.


So let’s just go through for this neural network some more details of what are the activation shape, the activation size, and the number of parameters in this network. So the input was 32 x 30 x 3, and if you multiply out those numbers you should get 3,072. So the activation, a0 has dimension 3072. Well it’s really 32 x 32 x 3. And there are no parameters I guess at the input layer. And as you look at the different layers, feel free to work out the details yourself. These are the activation shape and the activation sizes of these different layers. So just to point out a few things. First, notice that the max pooling layers don’t have any parameters. Second, notice that the conv layers tend to have relatively few parameters, as we discussed in early videos. And in fact, a lot of the parameters tend to be in the fully collected layers of the neural network. And then you notice also that the activation size tends to maybe go down gradually as you go deeper in the neural network. If it drops too quickly, that’s usually not great for performance as well. So it starts first there with 6,000 and 1,600, and then slowly falls into 84 until finally you have your Softmax output. You find that a lot of will have properties will have patterns similar to these.

So you’ve now seen the basic building blocks of neural networks, your convolutional neural networks, the conv layer, the pooling layer, and the fully connected layer. A lot of computer division research has gone into figuring out how to put together these basic building blocks to build effective neural networks. And putting these things together actually requires quite a bit of insight. I think that one of the best ways for you to gain intuition is about how to put these things together is a C a number of concrete examples of how others have done it. So what I want to do next week is show you a few concrete examples even beyond this first one that you just saw on how people have successfully put these things together to build very effective neural networks. And through those videos next week l hope you hold your own intuitions about how these things are built. And as we are given concrete examples that architectures that maybe you can just use here exactly as developed by someone else or your own application. So we’ll do that next week, but before wrapping this week’s videos just one last thing which is one I’ll talk a little bit in the next video about why you might want to use convolutions. Some benefits and advantages of using convolutions as well as how to put them all together. How to take a neural network like the one you just saw and actually train it on a training set to perform image recognition for some of the tasks. So with that let’s go on to the last video of this week.

11_why-convolutions

For this final video for this week, let’s talk a bit about why convolutions are so useful when you include them in your neural networks. And then finally, let’s briefly talk about how to put this all together and how you could train a convolution neural network when you have a label training set.


I think there are two main advantages of convolutional layers over just using fully connected layers. And the advantages are parameter sharing and sparsity of connections. Let me illustrate with an example. Let’s say you have a 32 by 32 by 3 dimensional image, and this actually comes from the example from the previous video, but let’s say you use five by five filter with six filters. And so, this gives you a 28 by 28 by 6 dimensional output. So, 32 by 32 by 3 is 3,072, and 28 by 28 by 6 if you multiply all those numbers is 4,704. And so, if you were to create a neural network with 3,072 units in one layer, and with 4,704 units in the next layer, and if you were to connect every one of these neurons, then the weight matrix, the number of parameters in a weight matrix would be 3,072 times 4,704 which is about 14 million. So, that’s just a lot of parameters to train. And today you can train neural networks with even more parameters than 14 million, but considering that this is just a pretty small image, this is a lot of parameters to train. And of course, if this were to be 1,000 by 1,000 image, then your display matrix will just become invisibly large. But if you look at the number of parameters in this convolutional layer, each filter is five by five. So, each filter has 25 parameters, plus a bias parameter miss of 26 parameters per a filter, and you have six filters, so, the total number of parameters is that, which is equal to 156 parameters. And so, the number of parameters in this conv layer remains quite small.


And the reason that a ConvNet has run to these small parameters is really two reasons. One is parameter sharing. And parameter sharing is motivated by the observation that feature detector such as vertical edge detector, that’s useful in one part of the image is probably useful in another part of the image. And what that means is that, if you’ve figured out say a three by three filter for detecting vertical edges, you can then apply the same three by three filter over here, and then the next position over, and the next position over, and so on. And so, each of these feature detectors, each of these aqua’s can use the same parameters in lots of different positions in your input image in order to detect say a vertical edge or some other feature. And I think this is true for low-level features like edges, as well as the higher level features, like maybe, detecting the eye that indicates a face or a cat or something there. But being with a share in this case the same nine parameters to compute all 16 of these aquas, is one of the ways the number of parameters is reduced. And it also just seems intuitive that a feature detector like a vertical edge detector computes it for the upper left-hand corner of the image. The same feature seems like it will probably be useful, has a good chance of being useful for the lower right-hand corner of the image. So, maybe you don’t need to learn separate feature detectors for the upper left and the lower right-hand corners of the image. And maybe you do have a dataset where you have the upper left-hand corner and lower right-hand corner have different distributions, so, they maybe look a little bit different but they might be similar enough, they’re sharing feature detectors all across the image, works just fine. The second way that ConvNet get away with having relatively few parameters is by having sparse connections. So, here’s what I mean, if you look at the zero, this is computed via three by three convolution. And so, it depends only on this three by three inputs grid or cells. So, it is as if this output units on the right is connected only to nine out of these six by six, 36 input features. And in particular, the rest of these pixel values, all of these pixel values do not have any effects on the other output. So, that’s what I mean by sparsity of connections. As another example, this output depends only on these nine input features. And so, it’s as if only those nine input features are connected to this output, and the other pixels just don’t affect this output at all. And so, through these two mechanisms, a neural network has a lot fewer parameters which allows it to be trained with smaller training cells and is less prone to be over 30. And so, sometimes you also hear about convolutional neural networks being very good at capturing translation invariance. And that’s the observation that a picture of a cat shifted a couple of pixels to the right, is still pretty clearly a cat. And convolutional structure helps the neural network encode the fact that an image shifted a few pixels should result in pretty similar features and should probably be assigned the same oval label. And the fact that you are applying to same filter, knows all the positions of the image, both in the early layers and in the late layers that helps a neural network automatically learn to be more robust or to better capture the desirable property of translation invariance. So, these are maybe a couple of the reasons why convolutions or convolutional neural network work so well in computer vision.


Finally, let’s put it all together and see how you can train one of these networks. Let’s say you want to build a cat detector and you have a labeled training sets as follows, where now, X is an image. And the y’s can be binary labels, or one of K causes. And let’s say you’ve chosen a convolutional neural network structure, may be inserted the image and then having neural convolutional and pooling layers and then some fully connected layers followed by a software output that then operates Y hat. The conv layers and the fully connected layers will have various parameters, W, as well as bias’s B. And so, any setting of the parameters, therefore, lets you define a cost function similar to what we have seen in the previous courses, where we’ve randomly initialized parameters W and B. You can compute the cause J, as the sum of losses of the neural networks predictions on your entire training set, maybe divide it by M. So, to train this neural network, all you need to do is then use gradient descents or some of the algorithm like, gradient descent momentum, or RMSProp or Adam, or something else, in order to optimize all the parameters of the neural network to try to reduce the cost function J. And you find that if you do this, you can build a very effective cat detector or some other detector.

So, congratulations on finishing this week’s videos. You’ve now seen all the basic building blocks of a convolutional neural network, and how to put them together into an effective image recognition system. In this week’s program exercises, I think all of these things will come more concrete, and you’ll get the chance to practice implementing these things yourself and seeing it work for yourself. Next week, we’ll continue to go deeper into convolutional neural networks. I mentioned earlier, that there’re just a lot of the hyperparameters in convolution neural networks. So, what I want to do next week, is show you a few concrete examples of some of the most effective convolutional neural networks, so you can start to recognize the patterns of what types of network architectures are effective. And one thing that people often do is just take the architecture that someone else has found and published in a research paper and just use that for your application. And so, by seeing some more concrete examples next week, you also learn how to do that better. And beyond that, next week, we’ll also just get that intuitions about what makes ConvNet work well, and then in the rest of the course, we’ll also see a variety of other computer vision applications such as, object detection, and neural store transfer. How they create new forms of artwork using these set of algorithms. So, that’s over this week, best of luck with the home works, and I look forward to seeing you next week.

查看本网站请使用全局科学上网
欢迎打赏来支持我的免费分享
0%