04_special-applications-face-recognition-neural-style-transfer

Note

This is my personal note after studying the course of the 4th week convolutional neural networks and the copyright belongs to deeplearning.ai.

01_face-recognition

01_what-is-face-recognition

Hi, and welcome to this fourth and final week of this course on convolutional neural networks. By now, you’ve learned a lot about confidence. What I want to do this week is show you a couple important special applications of confidence. We’ll start the face recognition, and then go on later this week to neurosal transfer, which you get to implement in the problem exercise as well to create your own artwork.


But first, let’s start the face recognition and just for fun, I want to show you a demo. When I was leading by those AI group, one of the teams I worked with led by Yuanqing Lin had built a face recognition system that I thought is really cool. Let’s take a look. So, I’m going to play this video here, but I can also get whoever is editing this raw video configure out to this better to splice in the raw video or take the one I’m playing here. I want to show you a face recognition demo. I’m in Baidu’s headquarters in China. Most companies require that to get inside, you swipe an ID card like this one but here we don’t need that. Using face recognition, check what I can do. When I walk up, it recognizes my face, it says, “Welcome Andrew,” and I just walk right through without ever having to use my ID card. Let me show you something else. I’m actually here with Lin Yuanqing, the director of IDL which developed all of this face recognition technology. I’m gonna hand him my ID card, which has my face printed on it, and he’s going to use it to try to sneak in using my picture instead of a live human. I’m gonna use Andrew’s card and try to sneak in and see what happens. So the system is not recognizing it, it refuses to recognize. Okay. Now, I’m going to use my own face. So face recognition technology like this is taking off very rapidly in China and I hope that this type of technology soon makes it way to other countries.. So, pretty cool, right? The video you just saw demoed both face recognition as well as liveness detection. The latter meaning making sure that you are a live human. It turns out liveness detection can be implemented using supervised learning as well to predict live human versus not live human but I want to spend less time on that. Instead, I want to focus our time on talking about how to build the face recognition portion of the system. First, let’s start by going over some of the terminology used in face recognition. In the face recognition literature, people often talk about face verification and face recognition. This is the face verification problem which is if you’re given an input image as well as a name or ID of a person and the job of the system is to verify whether or not the input image is that of the claimed person. So, sometimes this is also called a one to one problem where you just want to know if the person is the person they claim to be. So, the recognition problem is much harder than the verification problem. To see why, let’s say, you have a verification system that’s 99 percent accurate. So, 99 percent might not be too bad but now suppose that K is equal to 100 in a recognition system. If you apply this system to a recognition task with a 100 people in your database, you now have a hundred times of chance of making a mistake and if the chance of making mistakes on each person is just one percent. So, if you have a database of a 100 persons and if you want an acceptable recognition error, you might actually need a verification system with maybe 99.9 or even higher accuracy before you can run it on a database of 100 persons that have a high chance and still have a high chance of getting incorrect. In fact, if you have a database of 100 persons currently just be even quite a bit higher than 99 percent for that to work well. But what we do in the next few videos is focus on building a face verification system as a building block and then if the accuracy is high enough, then you probably use that in a recognition system as well.

So in the next video, we’ll start describing how you can build a face verification system. It turns out one of the reasons that is a difficult problem is you need to solve a one shot learning problem. Let’s see in the next video what that means.

02_one-shot-learning

One of the challenges of face recognition is that you need to solve the one-shot learning problem. What that means is that for most face recognition applications you need to be able to recognize a person given just one single image, or given just one example of that person’s face. And, historically, deep learning algorithms don’t work well if you have only one training example. Let’s see an example of what this means and talk about how to address this problem.


Let’s say you have a database of four pictures of employees in you’re organization. These are actually some of my colleagues at Deeplearning.AI, Khan, Danielle, Younes and Thian. Now let’s say someone shows up at the office and they want to be let through the turnstile. What the system has to do is, despite ever having seen only one image of Danielle, to recognize that this is actually the same person. And, in contrast, if it sees someone that’s not in this database, then it should recognize that this is not any of the four persons in the database. So in the one shot learning problem, you have to learn from just one example to recognize the person again. And you need this for most face recognition systems because you might have only one picture of each of your employees or of your team members in your employee database. So one approach you could try is to input the image of the person, feed it too a ConvNet. And have it output a label, y, using a softmax unit with four outputs or maybe five outputs corresponding to each of these four persons or none of the above. So that would be 5 outputs in the softmax. But this really doesn’t work well. Because if you have such a small training set it is really not enough to train a robust neural network for this task. And also what if a new person joins your team? So now you have 5 persons you need to recognize, so there should now be six outputs. Do you have to retrain the ConvNet every time? That just doesn’t seem like a good approach. So to carry out face recognition, to carry out one-shot learning.


So instead, to make this work, what you’re going to do instead is learn a similarity function. In particular, you want a neural network to learn a function which going to denote d, which inputs two images and outputs the degree of difference between the two images. So if the two images are of the same person, you want this to output a small number. And if the two images are of two very different people you want it to output a large number. So during recognition time, if the degree of difference between them is less than some threshold called tau, which is a hyperparameter. Then you would predict that these two pictures are the same person. And if it is greater than tau, you would predict that these are different persons. And so this is how you address the face verification problem. To use this for a recognition task, what you do is, given this new picture, you will use this function d to compare these two images. And maybe I’ll output a very large number, let’s say 10, for this example. And then you compare this with the second image in your database. And because these two are the same person, hopefully you output a very small number. You do this for the other images in your database and so on. And based on this, you would figure out that this is actually that person, which is Danielle. And in contrast, if someone not in your database shows up, as you use the function d to make all of these pairwise comparisons, hopefully d will output have a very large number for all four pairwise comparisons. And then you say that this is not any one of the four persons in the database. Notice how this allows you to solve the one-shot learning problem. So long as you can learn this function d, which inputs a pair of images and tells you, basically, if they’re the same person or different persons. Then if you have someone new join your team, you can add a fifth person to your database, and it just works fine.

So you’ve seen how learning this function d, which inputs two images, allows you to address the one-shot learning problem. In the next video, let’s take a look at how you can actually train the neural network to learn dysfunction d.

03_siamese-network

The job of the function d, which you learned about in the last video, is to input two faces and tell you how similar or how different they are. A good way to do this is to use a Siamese network. Let’s take a look.


You’re used to seeing pictures of confidence like these where you input an image, let’s say x1. And through a sequence of convolutional and pulling and fully connected layers, end up with a feature vector like that. And sometimes this is fed to a softmax unit to make a classification. We’re not going to use that in this video. Instead, we’re going to focus on this vector of let’s say 128 numbers computed by some fully connected layer that is deeper in the network. And I’m going to give this list of 128 numbers a name. I’m going to call this f of x1, and you should think of f of x1 as an encoding of the input image x1. So it’s taken the input image, here this picture of Kian, and is re-representing it as a vector of 128 numbers. The way you can build a face recognition system is then that if you want to compare two pictures, let’s say this first picture with this second picture here. What you can do is feed this second picture to the same neural network with the same parameters and get a different vector of 128 numbers, which encodes this second picture. So I’m going to call this second picture. So I’m going to call this encoding of this second picture f of x2, and here I’m using x1 and x2 just to denote two input images. They don’t necessarily have to be the first and second examples in your training sets. It can be any two pictures. Finally, if you believe that these encodings are a good representation of these two images, what you can do is then define the image d of distance between x1 and x2 as the norm of the difference between the encodings of these two images. So this idea of running two identical, convolutional neural networks on two different inputs and then comparing them, sometimes that’s called a Siamese neural network architecture. And a lot of the ideas I’m presenting here came from this paper due to Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf in the research system that they developed called DeepFace. And many of the ideas I’m presenting here came from a paper due to Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf in a system that they developed called DeepFace.


So how do you train this Siamese neural network? Remember that these two neural networks have the same parameters. So what you want to do is really train the neural network so that the encoding that it computes results in a function d that tells you when two pictures are of the same person. So more formally, the parameters of the neural network define an encoding f of xi. So given any input image xi, the neural network outputs this 128 dimensional encoding f of xi. So more formally, what you want to do is learn parameters so that if two pictures, xi and xj, are of the same person, then you want that distance between their encodings to be small. And in the previous slide, l was using x1 and x2, but it’s really any pair xi and xj from your training set. And in contrast, if xi and xj are of different persons, then you want that distance between their encodings to be large. So as you vary the parameters in all of these layers of the neural network, you end up with different encodings. And what you can do is use back propagation and vary all those parameters in order to make sure these conditions are satisfied.

So you’ve learned about the Siamese network architecture and have a sense of what you want the neural network to output for you in terms of what would make a good encoding. But how do you actually define an objective function to make a neural network learn to do what we just discussed here? Let’s see how you can do that in the next video using the triplet loss function.

04_triplet-loss

One way to learn the parameters of the neural network so that it gives you a good encoding for your pictures of faces is to define an applied gradient descent on the triplet loss function. Let’s see what that means.


To apply the triplet loss, you need to compare pairs of images. For example, given this picture, to learn the parameters of the neural network, you have to look at several pictures at the same time. For example, given this pair of images, you want their encodings to be similar because these are the same person. Whereas, given this pair of images, you want their encodings to be quite different because these are different persons. In the terminology of the triplet loss, what you’re going do is always look at one anchor image and then you want to distance between the anchor and the positive image, really a positive example, meaning as the same person to be similar. Whereas, you want the anchor when pairs are compared to the negative example for their distances to be much further apart. So, this is what gives rise to the term triplet loss, which is that you’ll always be looking at three images at a time. You’ll be looking at an anchor image, a positive image, as well as a negative image. And I’m going to abbreviate anchor positive and negative as A, P, and N. So to formalize this, what you want is for the parameters of your neural network of your encodings to have the following property, which is that you want the encoding between the anchor minus the encoding of the positive example, you want this to be small and in particular, you want this to be less than or equal to the distance of the squared norm between the encoding of the anchor and the encoding of the negative, where of course, this is d of A, P and this is d of A, N. And you can think of d as a distance function, which is why we named it with the alphabet d. Now, if we move to term from the right side of this equation to the left side, what you end up with is f of A minus f of P squared minus, let’s take the right-hand side now, minus F of N squared, you want this to be less than or equal to zero. But now, we’re going to make a slight change to this expression, which is one trivial way to make sure this is satisfied, is to just learn everything equals zero. If f always equals zero, then this is zero minus zero, which is zero, this is zero minus zero which is zero. And so, well, by saying f of any image equals a vector of all zeroes, you can almost trivially satisfy this equation. So, to make sure that the neural network doesn’t just output zero for all the encoding, so to make sure that it doesn’t set all the encodings equal to each other. Another way for the neural network to give a trivial output is if the encoding for every image was identical to the encoding to every other image, in which case, you again get zero minus zero. So to prevent a neural network from doing that, what we’re going to do is modify this objective to say that, this doesn’t need to be just less than or equal to zero, it needs to be quite a bit smaller than zero. So, in particular, if we say this needs to be less than negative alpha, where alpha is another hyperparameter, then this prevents a neural network from outputting the trivial solutions. And by convention, usually, we write plus alpha instead of negative alpha there. And this is also called, a margin, which is terminology that you’d be familiar with if you’ve also seen the literature on support vector machines, but don’t worry about it if you haven’t. And we can also modify this equation on top by adding this margin parameter. So to give an example, let’s say the margin is set to 0.2. If in this example, d of the anchor and the positive is equal to 0.5, then you won’t be satisfied if d between the anchor and the negative was just a little bit bigger, say 0.51. Even though 0.51 is bigger than 0.5, you’re saying, that’s not good enough, we want a dfA, N to be much bigger than dfA, P and in particular, you want this to be at least 0.7 or higher. Alternatively, to achieve this margin or this gap of at least 0.2, you could either push this up or push this down so that there is at least this gap of this alpha, hyperparameter alpha 0.2 between the distance between the anchor and the positive versus the anchor and the negative. So that’s what having a margin parameter here does, which is it pushes the anchor positive pair and the anchor negative pair further away from each other. So, let’s take this equation we have here at the bottom, and on the next slide, formalize it, and define the triplet loss function. So, the triplet loss function is defined on triples of images.


So, given three images, A, P, and N, the anchor positive and negative examples. So the positive examples is of the same person as the anchor, but the negative is of a different person than the anchor. We’re going to define the loss as follows. The loss on this example, which is really defined on a triplet of images is, let me first copy over what we had on the previous slide. So, that was fA minus fP squared minus fA minus fN squared, and then plus alpha, the margin parameter. And what you want is for this to be less than or equal to zero. So, to define the loss function, let’s take the max between this and zero. So, the effect of taking the max here is that, so long as this is less than zero, then the loss is zero, because the max is something less than equal to zero, when zero is going to be zero. So, so long as you achieve the goal of making this thing I’ve underlined in green, so long as you’ve achieved the objective of making that less than or equal to zero, then the loss on this example is equals to zero. But if on the other hand, if this is greater than zero, then if you take the max, the max we end up selecting, this thing I’ve underlined in green, and so you would have a positive loss. So by trying to minimize this, this has the effect of trying to send this thing to be zero, less than or equal to zero. And then, so long as there’s zero or less than or equal to zero, the neural network doesn’t care how much further negative it is. So, this is how you define the loss on a single triplet and the overall cost function for your neural network can be sum over a training set of these individual losses on different triplets. So, if you have a training set of say 10,000 pictures with 1,000 different persons, what you’d have to do is take your 10,000 pictures and use it to generate, to select triplets like this and then train your learning algorithm using gradient descent on this type of cost function, which is really defined on triplets of images drawn from your training set. Notice that in order to define this dataset of triplets, you do need some pairs of A and P. Pairs of pictures of the same person. So the purpose of training your system, you do need a dataset where you have multiple pictures of the same person. That’s why in this example, I said if you have 10,000 pictures of 1,000 different person, so maybe have 10 pictures on average of each of your 1,000 persons to make up your entire dataset. If you had just one picture of each person, then you can’t actually train this system. But of course after training, if you’re applying this, but of course after having trained the system, you can then apply it to your one shot learning problem where for your face recognition system, maybe you have only a single picture of someone you might be trying to recognize. But for your training set, you do need to make sure you have multiple images of the same person at least for some people in your training set so that you can have pairs of anchor and positive images.


Now, how do you actually choose these triplets to form your training set? One of the problems if you choose A, P, and N randomly from your training set subject to A and P being from the same person, and A and N being different persons, one of the problems is that if you choose them so that they’re at random, then this constraint is very easy to satisfy. Because given two randomly chosen pictures of people, chances are A and N are much different than A and P. I hope you still recognize this notation, this d(A, P) was what we had written on the last few slides as this encoding. So this is just equal to this squared known distance between the encodings that we have on the previous slide. But if A and N are two randomly chosen different persons, then there is a very high chance that this will be much bigger more than the margin alpha that that term on the left. And so, the neural network won’t learn much from it. So to construct a training set, what you want to do is to choose triplets A, P, and N that are hard to train on. So in particular, what you want is for all triplets that this constraint be satisfied. So, a triplet that is hard will be if you choose values for A, P, and N so that maybe d(A, P) is actually quite close to d(A,N). So in that case, the learning algorithm has to try extra hard to take this thing on the right and try to push it up or take this thing on the left and try to push it down so that there is at least a margin of alpha between the left side and the right side. And the effect of choosing these triplets is that it increases the computational efficiency of your learning algorithm. If you choose your triplets randomly, then too many triplets would be really easy, and so, gradient descent won’t do anything because your neural network will just get them right, pretty much all the time. And it’s only by using hard triplets that the gradient descent procedure has to do some work to try to push these quantities further away from those quantities. And if you’re interested, the details are presented in this paper by Florian Schroff, Dmitry Kalinichenko, and James Philbin, where they have a system called FaceNet, which is where a lot of the ideas I’m presenting in this video come from.

By the way, this is also a fun fact about how algorithms are often named in the deep learning world, which is if you work in a certain domain, then we call that blank. You often have a system called blank net or deep blank. So, we’ve been talking about face recognition. So this paper is called FaceNet, and in the last video, you just saw deep face. But this idea of a blank net or deep blank is a very popular way of naming algorithms in the deep learning world. And you should feel free to take a look at that paper if you want to learn some of these other details for speeding up your algorithm by choosing the most useful triplets to train on, it is a nice paper.


So, just to wrap up, to train on triplet loss, you need to take your training set and map it to a lot of triples. So, here is our triple with an anchor and a positive, both for the same person and the negative of a different person. Here’s another one where the anchor and positive are of the same person but the anchor and negative are of different persons and so on. And what you do having defined this training sets of anchor positive and negative triples is use gradient descent to try to minimize the cost function J we defined on an earlier slide, and that will have the effect of that propagating to all of the parameters of the neural network in order to learn an encoding so that d of two images will be small when these two images are of the same person, and they’ll be large when these are two images of different persons.

. Now, it turns out that today’s face recognition systems especially the large scale commercial face recognition systems are trained on very large datasets. Datasets north of a million images is not uncommon, some companies are using north of 10 million images and some companies have north of 100 million images with which to try to train these systems. So these are very large datasets even by modern standards, these dataset assets are not easy to acquire. Fortunately, some of these companies have trained these large networks and posted parameters online. So, rather than trying to train one of these networks from scratch, this is one domain where because of the share data volume sizes, this is one domain where often it might be useful for you to download someone else’s pre-train model, rather than do everything from scratch yourself. But even if you do download someone else’s pre-train model, I think it’s still useful to know how these algorithms were trained or in case you need to apply these ideas from scratch yourself for some application. So that’s it for the triplet loss. In the next video, I want to show you also some other variations on siamese networks and how to train these systems. Let’s go onto the next video.

05_face-verification-and-binary-classification


The Triplet Loss is one good way to learn the parameters of a continent for face recognition. There’s another way to learn these parameters. Let me show you how face recognition can also be posed as a straight binary classification problem. Another way to train a neural network, is to take this pair of neural networks to take this Siamese Network and have them both compute these embeddings, maybe 128 dimensional embeddings, maybe even higher dimensional, and then have these be input to a logistic regression unit to then just make a prediction. Where the target output will be one if both of these are the same persons, and zero if both of these are of different persons. So, this is a way to treat face recognition just as a binary classification problem. And this is an alternative to the triplet loss for training a system like this. Now, what does this final logistic regression unit actually do? The output y hat will be a sigmoid function, applied to some set of features but rather than just feeding in, these encodings, what you can do is take the differences between the encodings. So, let me show you what I mean. Let’s say, I write a sum over K equals 1 to 128 of the absolute value, taken element wise between the two different encodings. Let me just finish writing this out and then we’ll see what this means. In this notation, f of x i is the encoding of the image $x_i$ and the substitute k means to just select out the kth components of this vector. This is taking the element Y’s difference in absolute values between these two encodings. And what you might do is think of these 128 numbers as features that you then feed into logistic regression. And, you’ll find that logistic regression can add additional parameters $w_i$, and $b$ similar to a normal logistic regression unit. And you would train appropriate weighting on these 128 features in order to predict whether or not these two images are of the same person or of different persons. So, this will be one pretty useful way to learn to predict zero or one whether these are the same person or different persons.


And there are a few other variations on how you can compute this formula that I had underlined in green. For example, another formula could be this k minus f of $x_j$, k squared divided by f of x i on plus f of x j k. This is sometimes called the chi square form. This is the Greek alphabet chi. But this is sometimes called a $\chi$ square similarity. And this and other variations are explored in this deep face paper, which I referenced earlier as well. So in this learning formulation, the input is a pair of images, so this is really your training input x and the output y is either zero or one depending on whether you’re inputting a pair of similar or dissimilar images. And same as before, you’re training is Siamese Network so that means that, this neural network up here has parameters that are what they’re really tied to the parameters in this lower neural network. And this system can work pretty well as well. Lastly, just to mention, one computational trick that can help neural deployment significantly, which is that, if this is the new image, so this is an employee walking in hoping that the turnstile the doorway will open for them and that this is from your database image. Then instead of having to compute, this embedding every single time, where you can do is actually pre-compute that, so, when the new employee walks in, what you can do is use this upper components to compute that encoding and use it, then compare it to your pre-computed encoding and then use that to make a prediction y hat. Because you don’t need to store the raw images and also because if you have a very large database of employees, you don’t need to compute these encodings every single time for every employee database. This idea of free computing, some of these encodings can save a significant computation. And this type of pre-computation works both for this type of Siamese Central architecture where you treat face recognition as a binary classification problem, as well as, when you were learning encodings maybe using the Triplet Loss function as described in the last couple of videos.

And so just to wrap up, to treat face verification supervised learning, you create a training set of just pairs of images now is of triplets of pairs of images where the target label is one. When these are a pair of pictures of the same person and where the tag label is zero, when these are pictures of different persons and you use different pairs to train the neural network to train the scientists that were using back propagation.

So, this version that you just saw of treating face verification and by extension face recognition as a binary classification problem, this works quite well as well. As sort of that, I hope that you now know, whether it would take to train your own face verification or your own face recognition system one that can do one.

02_neural-style-transfer

01_what-is-neural-style-transfer

One of the most fun and exciting applications of ConvNet recently has been Neural Style Transfer. You get to implement this yourself and generate your own artwork in the problem exercise. But what is Neural Style Transfer? Let me show you a few examples.


Let’s say you take this image, this is actually taken from the Stanford University not far from my Stanford office and you want this picture recreated in the style of this image on the right. This is actually Van Gogh’s, Starry Night painting. What Neural Style Transfer allows you to do is generated new image like the one below which is a picture of the Stanford University Campus that painted but drawn in the style of the image on the right. In order to describe how you can implement this yourself, I’m going to use C to denote the content image, S to denote the style image, and G to denote the image you will generate. Here’s another example, let’s say you have this content image so let’s see this is of the Golden Gate Bridge in San Francisco and you have this style image, this is actually Pablo Picasso image. You can then combine these to generate this image G which is the Golden Gate painted in the style of that Picasso shown on the right. The examples shown on this slide were generated by Justin Johnson.

What you’ll learn in the next few videos is how you can generate these images yourself. In order to implement Neural Style Transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers of a ConvNet. Before diving into how you can implement a Neural Style Transfer, what I want to do in the next video is try to give you better intuition about whether all these layers of a ConvNet really computing. Let’s take a look at that in the next video.

02_what-are-deep-convnets-learning

What are deep ConvNets really learning? In this video, I want to share with you some visualizations that will help you hone your intuition about what the deeper layers of a ConvNet really are doing. And this will help us think through how you can implement neural style transfer as well.


Let’s start with an example. Lets say you’ve trained a ConvNet, this is an alex net like network, and you want to visualize what the hidden units in different layers are computing. Here’s what you can do. Let’s start with a hidden unit in layer 1. And suppose you scan through your training sets and find out what are the images or what are the image patches that maximize that unit’s activation. So in other words pause your training set through your neural network, and figure out what is the image that maximizes that particular unit’s activation. Now, notice that a hidden unit in layer 1, will see only a relatively small portion of the neural network. And so if you visualize, if you plot what activated unit’s activation, it makes makes sense to plot just a small image patches, because all of the image that that particular unit sees. So if you pick one hidden unit and find the nine input images that maximizes that unit’s activation, you might find nine image patches like this.

So looks like that in the lower region of an image that this particular hidden unit sees, it’s looking for an egde or a line that looks like that. So those are the nine image patches that maximally activate one hidden unit’s activation. Now, you can then pick a different hidden unit in layer 1 and do the same thing.

So that’s a different hidden unit, and looks like this second one, represented by these 9 image patches here. Looks like this hidden unit is looking for a line sort of in that portion of its input region, we’ll also call this receptive field. And if you do this for other hidden units, you’ll find other hidden units, tend to activate in image patches that look like that.

This one seems to have a preference for a vertical light edge, but with a preference that the left side of it be green.

This one really prefers orange colors, and this is an interesting image patch. This red and green together will make a brownish or a brownish-orangish color, but the neuron is still happy to activate with that, and so on.

So this is nine different representative neurons and for each of them the nine image patches that they maximally activate on. So this gives you a sense that, units, train hidden units in layer 1, they’re often looking for relatively simple features such as edge or a particular shade of color. And all of the examples I’m using in this video come from this paper by Mathew Zeiler and Rob Fergus, titled visualizing and understanding convolutional networks. And I’m just going to use one of the simpler ways to visualize what a hidden unit in a neural network is computing. If you read their paper, they have some other more sophisticated ways of visualizing when the ConvNet is running as well.


But now you have repeated this procedure several times for nine hidden units in layer 1. What if you do this for some of the hidden units in the deeper layers of the neuron network. And what does the neural network then learning at a deeper layers. So in the deeper layers, a hidden unit will see a larger region of the image. Where at the extreme end each pixel could hypothetically affect the output of these later layers of the neural network. So later units are actually seen larger image patches, I’m still going to plot the image patches as the same size on these slides. But if we repeat this procedure, this is what you had previously for layer 1, and this is a visualization of what maximally activates nine different hidden units in layer 2. So I want to be clear about what this visualization is. These are the nine patches that cause one hidden unit to be highly activated. And then each grouping, this is a different set of nine image patches that cause one hidden unit to be activated. So this visualization shows nine hidden units in layer 2, and for each of them shows nine image patches that causes that hidden unit to have a very large output, a very large activation. And you can repeat these for deeper layers as well.

Now, on this slide, I know it’s kind of hard to see these tiny little image patches, so let me zoom in for some of them. For layer 1, this is what you saw. So for example, this is that first unit we saw which was highly activated, if in the region of the input image, you can see there’s an edge maybe at that angle.

Now let’s zoom in for layer 2 as well, to that visualization. So this is interesting, layer 2 looks it’s detecting more complex shapes and patterns. So for example, this hidden unit looks like it’s looking for a vertical texture with lots of vertical lines. This hidden unit looks like its highly activated when there’s a rounder shape to the left part of the image. Here’s one that is looking for very thin vertical lines and so on. And so the features the second layer is detecting are getting more complicated.

How about layer 3? Let’s zoom into that, in fact let me zoom in even bigger, so you can see this better, these are the things that maximally activate layer 3. But let’s zoom in even bigger, and so this is pretty interesting again. It looks like there is a hidden unit that seems to respond highly to a rounder shape in the lower left hand portion of the image, maybe. So that ends up detecting a lot of cars, dogs and wonders is even starting to detect people. And this one look like it is detecting certain textures like honeycomb shapes, or square shapes, this irregular texture. And some of these it’s difficult to look at and manually figure out what is it detecting, but it is clearly starting to detect more complex patterns.

How about the next layer? Well, here is layer 4, and you’ll see that the features or the patterns is detecting or even more complex. It looks like this has learned almost a dog detector, but all these dogs likewise similar, right? Is this, I don’t know what dog species or dog breed this is. But now all those are dogs, but they look relatively similar as dogs go. Looks like this hidden unit and therefore it is detecting water. This looks like it is actually detecting the legs of a bird and so on.

And then layer 5 is detecting even more sophisticated things. So you’ll notice there’s also a neuron that seems to be a dog detector, but set of dogs detecting here seems to be more varied. And then this seems to be detecting keyboards and things with a keyboard like texture, although maybe lots of dots against background. I think this neuron here may be detecting text, it’s always hard to be sure. And then this one here is detecting flowers. So we’ve gone a long way from detecting relatively simple things such as edges in layer 1 to textures in layer 2, up to detecting very complex objects in the deeper layers.

So I hope this gives you some better intuition about what the shallow and deeper layers of a neural network are computing. Next, let’s use this intuition to start building a neural-style transfer algorithm.

03_cost-function

To build a Neural Style Transfer system, let’s define a cost function for the generated image. What you see later is that by minimizing this cost function, you can generate the image that you want.


Remember what the problem formulation is. You’re given a content image C, given a style image S and you goal is to generate a new image G. In order to implement neural style transfer, what you’re going to do is define a cost function J of G that measures how good is a particular generated image and we’ll use gradient to descent to minimize J of G in order to generate this image. How good is a particular image? Well, we’re going to define two parts to this cost function. The first part is called the content cost. This is a function of the content image and of the generated image and what it does is it measures how similar is the contents of the generated image to the content of the content image C. And then going to add that to a style cost function which is now a function of S,G and what this does is it measures how similar is the style of the image G to the style of the image S. Finally, we’ll weight these with two hyper parameters alpha and beta to specify the relative weighting between the content costs and the style cost. It seems redundant to use two different hyper parameters to specify the relative cost of the weighting. One hyper parameter seems like it would be enough but the original authors of the Neural Style Transfer Algorithm, use two different hyper parameters. I’m just going to follow their convention here.

The Neural Style Transfer Algorithm I’m going to present in the next few videos is due to Leon Gatys, Alexander Ecker and Matthias. Their papers is not too hard to read so after watching these few videos if you wish, I certainly encourage you to take a look at their paper as well if you want.


The way the algorithm would run is as follows, having to find the cost function J of G in order to actually generate a new image what you do is the following. You would initialize the generated image G randomly so it might be 100 by 100 by 3 or 500 by 500 by 3 or whatever dimension you want it to be. Then we’ll define the cost function J of G on the previous slide. What you can do is use gradient descent to minimize this so you can update G as G minus the derivative respect to the cost function of J of G. In this process, you’re actually updating the pixel values of this image G which is a 100 by 100 by 3 maybe rgb channel image. Here’s an example, let’s say you start with this content image and this style image. This is a another probably Picasso image. Then when you initialize G randomly, you’re initial randomly generated image is just this white noise image with each pixel value chosen at random. As you run gradient descent, you minimize the cost function J of G slowly through the pixel value so then you get slowly an image that looks more and more like your content image rendered in the style of your style image.

In this video, you saw the overall outline of the Neural Style Transfer Algorithm where you define a cost function for the generated image G and minimize it. Next, we need to see how to define the content cost function as well as the style cost function. Let’s take a look at that starting in the next video.

04_content-cost-function

The cost function of the neural style transfer algorithm had a content cost component and a style cost component. Let’s start by defining the content cost component.


Remember that this is the overall cost function of the neural style transfer algorithm. So, let’s figure out what should the content cost function be. Let’s say that you use hidden layer l to compute the content cost. If l is a very small number, if you use hidden layer one, then it will really force your generated image to pixel values very similar to your content image. Whereas, if you use a very deep layer, then it’s just asking, “Well, if there is a dog in your content image, then make sure there is a dog somewhere in your generated image. “ So in practice, layer l chosen somewhere in between. It’s neither too shallow nor too deep in the neural network. And because you program this yourself, in the problem exercise that you did at the end of this week, I’ll leave you to gain some intuitions with the concrete examples in the problem exercise as well. But usually, I was chosen to be somewhere in the middle of the layers of the neural network, neither too shallow nor too deep. What you can do is then use a pre-trained ConvNet, maybe a VGG network, or could be some other neural network as well. And now, you want to measure, given a content image and given a generated image, how similar are they in content. So let’s let this a_superscript_l and this be the activations of layer l on these two images, on the images C and G. So, if these two activations are similar, then that would seem to imply that both images have similar content. So, what we’ll do is define J_content(C,G) as just how soon or how different are these two activations. So, we’ll take the element-wise difference between these hidden unit activations in layer l, between when you pass in the content image compared to when you pass in the generated image, and take that squared. And you could have a normalization constant in front or not, so it’s just one of the two or something else. It doesn’t really matter since this can be adjusted as well by this hyperparameter alpha. So, just be clear on using this notation as if both of these have been unrolled into vectors, so then, this becomes the square root of the l_2 norm between this and this, after you’ve unrolled them both into vectors. There’s really just the element-wise sum of squared differences between these two activation. But it’s really just the element-wise sum of squares of differences between the activations in layer l, between the images in C and G. And so, when later you perform gradient descent on J_of_G to try to find a value of G, so that the overall cost is low, this will incentivize the algorithm to find an image G, so that these hidden layer activations are similar to what you got for the content image.

So, that’s how you define the content cost function for the neural style transfer. Next, let’s move on to the style cost function.

05_style-cost-function

In the last video, you saw how to define the content cost function for the neural style transfer. Next, let’s take a look at the style cost function.


So, what is the style of an image mean? Let’s say you have an input image like this, they used to seeing a convnet like that, compute features that there’s different layers. And let’s say you’ve chosen some layer L, maybe that layer to define the measure of the style of an image. What we need to do is define the style as the correlation between activations across different channels in this layer L activation. So here’s what I mean by that. Let’s say you take that layer L activation. So this is going to be nh by nw by nc block of activations, and we’re going to ask how correlated are the activations across different channels. So to explain what I mean by this may be slightly cryptic phrase, let’s take this block of activations and let me shade the different channels by a different colors. So in this below example, we have say five channels and which is why I have five shades of color here. In practice, of course, in neural network we usually have a lot more channels than five, but using just five makes it drawing easier. But to capture the style of an image, what you’re going to do is the following. Let’s look at the first two channels. Let’s see for the red channel and the yellow channel and say how correlated are activations in these first two channels. So, for example, in the lower right hand corner, you have some activation in the first channel and some activation in the second channel. So that gives you a pair of numbers. And what you do is look at different positions across this block of activations and just look at those two pairs of numbers, one in the first channel, the red channel, one in the yellow channel, the second channel. And you just look at these two pairs of numbers and see when you look across all of these positions, all of these nh by nw positions, how correlated are these two numbers. So, why does this capture style?


Let’s look another example. Here’s one of the visualizations from the earlier video. This comes from again the paper by Matthew Zeiler and Rob Fergus that I have reference earlier. And let’s say for the sake of arguments, that the red neuron corresponds to, and let’s say for the sake of arguments, that the red channel corresponds to this neurons (at the second grid cell which is circled in red color), so we’re trying to figure out if there’s this little vertical texture in a particular position in the nh and let’s say that this second channel, this yellow second channel corresponds to this neuron (at the 4th grid cell which is circled in yellow color), which is vaguely looking for orange colored patches. What does it mean for these two channels to be highly correlated? Well, if they’re highly correlated what that means is whatever part of the image has this type of subtle vertical texture, that part of the image will probably have these orange-ish tint. And what does it mean for them to be uncorrelated? Well, it means that whenever there is this vertical texture, it’s probably won’t have that orange-ish tint. And so the correlation tells you which of these high level texture components tend to occur or not occur together in part of an image and that’s the degree of correlation that gives you one way of measuring how often these different high level features, such as vertical texture or this orange tint or other things as well, how often they occur and how often they occur together and don’t occur together in different parts of an image. And so, if we use the degree of correlation between channels as a measure of the style, then what you can do is measure the degree to which in your generated image, this first channel is correlated or uncorrelated with the second channel and that will tell you in the generated image how often this type of vertical texture occurs or doesn’t occur with this orange-ish tint and this gives you a measure of how similar is the style of the generated image to the style of the input style image.


So let’s now formalize this intuition. So what you can to do is given an image computes something called a style matrix, which will measure all those correlations we talks about on the last slide. So, more formally, let’s let a superscript l, subscript i, j,k denote the activation at position i,j,k in hidden layer l. So i indexes into the height, j indexes into the width, and k indexes across the different channels. So, in the previous slide, we had five channels that k will index across those five channels. So what the style matrix will do is you’re going to compute a matrix clauses G superscript square bracketed l. This is going to be an nc by nc dimensional matrix, so it’d be a square matrix. Remember you have nc channels and so you have an nc by nc dimensional matrix in order to measure how correlated each pair of them is. So particular G, l, k, k prime will measure how correlated are the activations in channel k compared to the activations in channel k prime. Well here, k and k prime will range from 1 through nc, the number of channels they’re all up in that layer. So more formally, the way you compute G, l and I’m just going to write down the formula for computing one elements. So the k, k prime elements of this. This is going to be sum of a i, sum of a j, of deactivation and that layer i, j, k times the activation at i, j, k prime. So, here, remember i and j index across to a different positions in the block, indexes over the height and width. So i is the sum from one to nh and j is a sum from one to nw and k here and k prime index over the channel so k and k prime range from one to the total number of channels in that layer of the neural network. So all this is doing is summing over the different positions that the image over the height and width and just multiplying the activations together of the channels k and k prime and that’s the definition of G,k,k prime. And you do this for every value of k and k prime to compute this matrix G, also called the style matrix. And so notice that if both of these activations tend to be large together, then G, k, k prime will be large, whereas if they are uncorrelated then g,k, k prime might be small. And technically, I’ve been using the term correlation to convey intuition but this is actually the unnormalized cross-variance of the areas because we’re not subtracting out the mean and this is just multiplied by these elements directly. So this is how you compute the style of an image. And you’d actually do this for both the style image s,n for the generated image G. So just to distinguish that this is the style image, maybe let me add a round bracket S there, just to denote that this is the style image for the image S and those are the activations on the image S. And what you do is then compute the same thing for the generated image. So it’s really the same thing summarized sum of a j, a, i, j, k, l, a, i, j,k,l and the summation indices are the same. Let’s follow this and you want to just denote this is for the generated image, I’ll just put the round brackets G there. So, now, you have two matrices they capture what is the style with the image s and what is the style of the image G. And, by the way, we’ve been using the alphabet capital G to denote these matrices. In linear algebra, these are also called the Gram matrix of these in called grand matrices but in this video, I’m just going to use the term style matrix because this term Gram matrix that most of these using capital G to denote these matrices. Finally, the cost function, the style cost function. If you’re doing this on layer l between s and G, you can now define that to be just the difference between these two matrices, G l, G square and these are matrices. So just take it from the previous one. This is just the sum of squares of the element wise differences between these two matrices and just divides this out this is going to be sum over k, sum over k prime of these differences of s, k, k prime minus G l, G, k, k prime and then the sum of square of the elements. The authors actually used this for the normalization constants two times of nh, nw, in that layer, nc in that layer and I’ll square this and you can put this up here as well. But a normalization constant doesn’t matter that much because this causes multiplied by some hyperparameter b anyway.


So just to finish up, this is the style cost function defined using layer l and as you saw on the previous slide, this is basically the Frobenius norm between the two star matrices computed on the image s and on the image G Frobenius on squared and never by the just low normalization constants, which isn’t that important. And, finally, it turns out that you get more visually pleasing results if you use the style cost function from multiple different layers. So, the overall style cost function, you can define as sum over all the different layers of the style cost function for that layer. We should define them all weighted by some set of parameters, by some set of additional hyperparameters, which we’ll denote as lambda l here. So what it does is allows you to use different layers in a neural network. Well of the early ones, which measure relatively simpler low level features like edges as well as some later layers, which measure high level features and cause a neural network to take both low level and high level correlations into account when computing style. And, in the following exercise, you gain more intuition about what might be reasonable choices for this type of parameter lambda as well. And so just to wrap this up, you can now define the overall cost function as alpha times the content cost between c and G plus beta times the style cost between s and G and then just create in the sense or a more sophisticated optimization algorithm if you want in order to try to find an image G that normalize, that tries to minimize this cost function j of G. And if you do that, you can generate pretty good looking neural artistic and if you do that you’ll be able to generate some pretty nice novel artwork.

So that’s it for neural style transfer and I hope you have fun implementing it in this week’s printing exercise. Before wrapping up this week, there’s just one last thing I want to share of you, which is how to do convolutions over 1D or 3D data rather than over only 2D images. Let’s go into the last video.

06_1d-and-3d-generalizations

You have learned a lot about ConvNets, everything ranging from the architecture of the ConvNet to how to use it for image recognition, to object detection, to face recognition and neural-style transfer. And even though most of the discussion has focused on images, on sort of 2D data, because images are so pervasive. It turns out that many of the ideas you’ve learned about also apply, not just to 2D images but also to 1D data as well as to 3D data. Let’s take a look.


In the first week of this course, you learned about the 2D convolution, where you might input a 14 x 14 image and convolve that with a 5 x 5 filter. And you saw how 14 x 14 convolved with 5 x 5, this gives you a 10 x 10 output. And if you have multiple channels, maybe those 14 x 14 x 3, then it would be 5 x 5 that matches the same 3. And then if you have multiple filters, say 16 filters, you end up with 10 x 10 x 16. It turns out that a similar idea can be applied to 1D data as well. For example, on the left is an EKG signal, also called an electrocardioagram. Basically if you place an electrode over your chest, this measures the little voltages that vary across your chest as your heart beats. Because the little electric waves generated by your heart’s beating can be measured with a pair of electrodes. And so this is an EKG of someone’s heart beating. And so each of these peaks corresponds to one heartbeat. So if you want to use EKG signals to make medical diagnoses, for example, then you would have 1D data because what EKG data is, is it’s a time series showing the voltage at each instant in time. So rather than a 14 x 14 dimensional input, maybe you just have a 14 dimensional input. And in that case, you might want to convolve this with a 1 dimensional filter. So rather than the 5 by 5, you just have 5 dimensional filter. So with 2D data what a convolution will allow you to do was to take the same 5 x 5 feature detector and apply it across at different positions throughout the image. And that’s how you wound up with your 10 x 10 output. What a 1D filter allows you to do is take your 5 dimensional filter and similarly apply that in lots of different positions throughout this 1D signal. And so if you apply this convolution, what you find is that a 14 dimensional thing convolved with this 5 dimensional thing, this would give you a 10 dimensional output. And again, if you have multiple channels, you might have in this case you can use just 1 channel, if you have 1 lead or 1 electrode for EKG, so times 5 x 1. And if you have 16 filters, maybe end up with 10 x 16 over there, and this could be one layer of your ConvNet. And then for the next layer of your ConvNet, if you input a 10 x 16 dimensional input and you might convolve that with a 5 dimensional filter again. Then these have 16 channels, so that has a match. And we have 32 filters, then the output of another layer would be 6 x 32, if you have 32 filters, right? And the analogy to the the 2D data, this is similar to all of the 10 x 10 x 16 data and convolve it with a 5 x 5 x 16, and that has to match. That will give you a 6 by 6 dimensional output, and you have 32 filters, that’s where the 32 comes from. So all of these ideas apply also to 1D data, where you can have the same feature detector, such as this, apply to a variety of positions. For example, to detect the different heartbeats in an EKG signal. But to use the same set of features to detect the heartbeats even at different positions along these time series, and so ConvNet can be used even on 1D data. For along with 1D data applications, you actually use a recurrent neural network, which you learn about in the next course. But some people can also try using ConvNets in these problems. And in the next course on sequence models, which we will talk about recurring neural networks and LCM and other models like that. We’ll talk about the pros and cons of using 1D ConvNets versus some of those other models that are explicitly designed to sequenced data. So that’s the generalization from 2D to 1D.


How about 3D data? Well, what is three dimensional data? It is that, instead of having a 1D list of numbers or a 2D matrix of numbers, you now have a 3D block, a three dimensional input volume of numbers. So here’s the example of that which is if you take a CT scan, this is a type of X-ray scan that gives a three dimensional model of your body. But what a CT scan does is it takes different slices through your body. So as you scan through a CT scan which I’m doing here, you can look at different slices of the human torso to see how they look and so this data is fundamentally three dimensional. And one way to think of this data is if your data now has some height, some width, and then also some depth. Where this is the different slices through this volume, are the different slices through the torso.

So if you want to apply a ConvNet to detect features in this three dimensional CAT scan or CT scan, then you can generalize the ideas from the first slide to three dimensional convolutions as well. So if you have a 3D volume, and for the sake of simplicity let’s say is 14 x 14 x 14 and so this is the height, width, and depth of the input CT scan. And again, just like images they’ll all have to be square, a 3D volume doesn’t have to be a perfect cube as well. So the height and width of a image can be different, and in the same way the height and width and the depth of a CT scan can be different. But I’m just using 14 x 14 x 14 here to simplify the discussion. And if you convolve this with a now a 5 x 5 x 5 filter, so you’re filters now are also three dimensional then this would give you a 10 x 10 x 10 volume. And technically, you could also have by 1, if this is the number of channels. So this is just a 3D volume, but your data can also have different numbers of channels, then this would be times 1 as well. Because the number of channels here and the number of channels here has to match. And then if you have 16 filters did a 5 x 5 x 5 x 1 then the next output will be a 10 x 10 x 10 x 16. So this could be one layer of your ConvNet over 3D data, and if the next layer of the ConvNet convolves this again with a 5 x 5 x 5 x 16 dimensional filter. So this number of channels has to match data as usual, and if you have 32 filters then similar to what you saw was ConvNet of the images. Now you’ll end up with a 6 x 6 x 6 volume across 32 channels. So 3D data can also be learned on, sort of directly using a three dimensional ConvNet. And what these filters do is really detect features across your 3D data, CAT scans, medical scans as one example of 3D volumes. But another example of data, you could treat as a 3D volume would be movie data, where the different slices could be different slices in time through a movie. And you could use this to detect motion or people taking actions in movies.

So that’s it on generalization of ConvNets from 2D data to also 1D as well as 3D data. Image data is so pervasive that the vast majority of ConvNets are on 2D data, on image data, but I hope that these other models will be helpful to you as well. So this is it, this is the last video of this week and the last video of this course on ConvNets. You’ve learned a lot about ConvNets and I hope you find many of these ideas useful for your future work. So congratulations on finishing these videos. I hope you enjoyed this week’s exercise and I look forward also to seeing you in the next course on sequence models.

查看本网站请使用全局科学上网
欢迎打赏来支持我的免费分享
0%