01_practical-aspects-of-deep-learning

Note

This is my personal note at the first week after studying the course Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization and the copyright belongs to deeplearning.ai.

01_setting-up-your-machine-learning-application

01_train-dev-test-sets

Welcome to this course on the practical aspects of deep learning. Perhaps now you’ve learned how to implement a neural network. In this week you’ll learn the practical aspects of how to make your neural network work well. Ranging from things like hyperparameter tuning to how to set up your data, to how to make sure your optimization algorithm runs quickly so that you get your learning algorithm to learn in a reasonable time. In this first week we’ll first talk about the cellular machine learning problem, then we’ll talk about randomization. And we’ll talk about some tricks for making sure your neural network implementation is correct.

With that, let’s get started. Making good choices in how you set up your training, development, and test sets can make a huge difference in helping you quickly find a good high performance neural network.

When training a neural network you have to make a lot of decisions, such as how many layers will your neural network have? And how many hidden units do you want each layer to have? And what’s the learning rates? And what are the activation functions you want to use for the different layers? When you’re starting on a new application, it’s almost impossible to correctly guess the right values for all of these, and for other hyperparameter choices, on your first attempt. So in practice applied machine learning is a highly iterative process, in which you often start with an idea, such as you want to build a neural network of a certain number of layers, certain number of hidden units, maybe on certain data sets and so on. And then you just have to code it up and try it by running your code. You run and experiment and you get back a result that tells you how well this particular network, or this particular configuration works. And based on the outcome, you might then refine your ideas and change your choices and maybe keep iterating in order to try to find a better and a better neural network.

So in practice applied machine learning is a highly iterative process, in which you often start with an idea, such as you want to build a neural network of a certain number of layers, certain number of hidden units, maybe on certain data sets and so on. And then you just have to code it up and try it by running your code. You run and experiment and you get back a result that tells you how well this particular network, or this particular configuration works. And based on the outcome, you might then refine your ideas and change your choices and maybe keep iterating in order to try to find a better and a better neural network. Today, deep learning has found great success in a lot of areas. Ranging from natural language processing to computer vision to speech recognition to a lot of applications on also structured data. And structured data includes everything from advertisements to web search, which isn’t just Internet search engines it’s also, for example, shopping websites. Already any websites that wants deliver great search results when you enter terms into a search bar. To computer security, to logistics, such as figuring out where to send drivers to pick up and drop off things, to many more. So what I’m seeing is that sometimes a researcher with a lot of experience in NLP might try to do something in computer vision. Or maybe a researcher with a lot of experience in speech recognition might jump in and try to do something on advertising. Or someone from security might want to jump in and do something on logistics. And what I’ve seen is that intuitions from one domain or from one application area often do not transfer to other application areas. And the best choices may depend on the amount of data you have, the number of input features you have through your computer configuration and whether you’re training on GPUs or CPUs. And if so, exactly what configuration of GPUs and CPUs, and many other things. So for a lot of applications I think it’s almost impossible. Even very experienced deep learning people find it almost impossible to correctly guess the best choice of hyperparameters the very first time. And so today, applied deep learning is a very iterative process where you just have to go around this cycle many times to hopefully find a good choice of network for your application. So one of the things that determine how quickly you can make progress is how efficiently you can go around this cycle. And setting up your data sets well in terms of your train, development and test sets can make you much more efficient at that.


So if this is your training data, let’s draw that as a big box. Then traditionally you might take all the data you have and carve off some portion of it to be your training set. Some portion of it to be your hold-out cross validation set, and this is sometimes also called the development set. And for brevity I’m just going to call this the dev set, but all of these terms mean roughly the same thing. And then you might carve out some final portion of it to be your test set. And so the workflow is that you keep on training algorithms on your training sets. And use your dev set or your hold-out cross validation set to see which of many different models performs best on your dev set. And then after having done this long enough, when you have a final model that you want to evaluate, you can take the best model you have found and evaluate it on your test set. In order to get an unbiased estimate of how well your algorithm is doing. So in the previous era of machine learning, it was common practice to take all your data and split it according to maybe a 70/30% in terms of a people often talk about the 70/30 train test splits. If you don’t have an explicit dev set or maybe a 60/20/20% split in terms of 60% train, 20% dev and 20% test. And several years ago this was widely considered best practice in machine learning. If you have maybe 100 examples in total, maybe 1000 examples in total, maybe after 10,000 examples. These sorts of ratios were perfectly reasonable rules of thumb. But in the modern big data era, where, for example, you might have a million examples in total, then the trend is that your dev and test sets have been becoming a much smaller percentage of the total. Because remember, the goal of the dev set or the development set is that you’re going to test different algorithms on it and see which algorithm works better. So the dev set just needs to be big enough for you to evaluate, say, two different algorithm choices or ten different algorithm choices and quickly decide which one is doing better. And you might not need a whole 20% of your data for that. So, for example, if you have a million training examples you might decide that just having 10,000 examples in your dev set is more than enough to evaluate which one or two algorithms does better. And in a similar vein, the main goal of your test set is, given your final classifier, to give you a pretty confident estimate of how well it’s doing. And again, if you have a million examples maybe you might decide that 10,000 examples is more than enough in order to evaluate a single classifier and give you a good estimate of how well it’s doing. So in this example where you have a million examples, if you need just 10,000 for your dev and 10,000 for your test, your ratio will be more like his 10,000 is 1% of 1 million so you’ll have 98% train, 1% dev, 1% test. And I’ve also seen applications where, if you have even more than a million examples, you might end up with 99.5% train and 0.25% dev, 0.25% test. Or maybe a 0.4% dev, 0.1% test. So just to recap, when setting up your machine learning problem, I’ll often set it up into a train, dev and test sets, and if you have a relatively small dataset, these traditional ratios might be okay. But if you have a much larger data set, it’s also fine to set your dev and test sets to be much smaller than your 20% or even 10% of your data. We’ll give more specific guidelines on the sizes of dev and test sets later in this specialization.

One other trend we’re seeing in the era of modern deep learning is that more and more people train on mismatched train and test distributions.

Let’s say you’re building an app that lets users upload a lot of pictures and your goal is to find pictures of cats in order to show your users. Maybe all your users are cat lovers. Maybe your training set comes from cat pictures downloaded off the Internet, but your dev and test sets might comprise cat pictures from users using our app. So maybe your training set has a lot of pictures crawled off the Internet but the dev and test sets are pictures uploaded by users. Turns out a lot of webpages have very high resolution, very professional, very nicely framed pictures of cats. But maybe your users are uploading blurrier, lower resolution images just taken with a cell phone camera in a more casual condition. And so these two distributions of data may be different. The rule of thumb I’d encourage you to follow in this case is to make sure that the dev and test sets come from the same distribution. We’ll say more about this particular guideline as well, but because you will be using the dev set to evaluate a lot of different models and trying really hard to improve performance on the dev set. It’s nice if your dev set comes from the same distribution as your test set. But because deep learning algorithms have such a huge larger for training data, one trend I’m seeing is that you might use all sorts of creative tactics, such as crawling webpages, in order to acquire a much bigger training set than you would otherwise have. Even if part of the cost of that is then that your training set data might not come from the same distribution as your dev and test sets. But you find that so long as you follow this rule of thumb, that progress in your machine learning algorithm will be faster. And I’ll give a more detailed explanation for this particular rule of thumb later in the specialization as well.

Finally, it might be okay to not have a test set. Remember the goal of the test set is to give you a unbiased estimate of the performance of your final network, of the network that you selected. But if you don’t need that unbiased estimate, then it might be okay to not have a test set. So what you do, if you have only a dev set but not a test set, is you train on the training set and then you try different model architectures. Evaluate them on the dev set, and then use that to iterate and try to get to a good model. Because you’ve fit your data to the dev set, this no longer gives you an unbiased estimate of performance. But if you don’t need one, that might be perfectly fine. In the machine learning world, when you have just a train and a dev set but no separate test set. Most people will call this a training set and they will call the dev set the test set. But what they actually end up doing is using the test set as a hold-out cross validation set. Which maybe isn’t completely a great use of terminology, because they’re then overfitting to the test set. So when the team tells you that they have only a train and a test set, I would just be cautious and think, do they really have a train dev set? Because they’re overfitting to the test set. Culturally, it might be difficult tochange some of these team’s terminology and get them to call it a trained devset rather than a trained test set. Even though I think calling it a train and development set would be more correct terminology. And this is actually okay practice if you don’t need a completely unbiased estimate of the performance of your algorithm. So having set up a train dev and test set will allow you to integrate more quickly. It will also allow you to more efficiently measure the bias and variance of your algorithm so you can more efficiently select ways to improve your algorithm. Let’s start to talk about that in the next video.

02_bias-variance

I’ve noticed that almost all the really good machine learning practitioners tend to be very sophisticated in understanding of Bias and Variance. Bias and Variance is one of those concepts that’s easily learned but difficult to master. Even if you think you’ve seen the basic concepts of Bias and Variance, there’s often more new ones to it than you’d expect. In the Deep Learning Error, another trend is that there’s been less discussion of what’s called the bias-variance trade-off. You might have heard this thing called the bias-variance trade-off. But in Deep Learning Error there’s less of a trade-off, so we’d still still solve the bias, we still solve the variance, but we just talk less about the bias-variance trade-off. Let’s see what this means.

Let’s see the data set that looks like this. If you fit a straight line to the data, maybe get a logistic regression fit to that. This is not a very good fit to the data. And so this is class of a high bias, what we say that this is underfitting the data.

On the opposite end, if you fit an incredibly complex classifier, maybe deep neural network, or neural network with all the hidden units, maybe you can fit the data perfectly, but that doesn’t look like a great fit either. So there’s a classifier of high variance and this is overfitting the data.

And there might be some classifier in between, with a medium level of complexity, that maybe fits it correctly like that.

That looks like a much more reasonable fit to the data, so we call that just right. It’s somewhere in between. So in a 2D example like this, with just two features, X-1 and X-2, you can plot the data and visualize bias and variance. In high dimensional problems, you can’t plot the data and visualize division boundary.

Instead, there are couple of different metrics, that we’ll look at, to try to understand bias and variance. So continuing our example of cat picture classification, where that’s a positive example and that’s a negative example, the two key numbers to look at to understand bias and variance will be the train set error and the dev set or the development set error. So for the sake of argument, let’s say that you’re recognizing cats in pictures, is something that people can do nearly perfectly, right?

So let’s say, your training set error is 1% and your dev set error is, for the sake of argument, let’s say is 11%.

So in this example, you’re doing very well on the training set, but you’re doing relatively poorly on the development set. So this looks like you might have overfit the training set, that somehow you’re not generalizing well, to this whole cross-validation set in the development set. And so if you have an example like this, we would say this has high variance.

So by looking at the training set error and the development set error, you would be able to render a diagnosis of your algorithm having high variance. Now, let’s say, that you measure your training set and your dev set error, and you get a different result.

Let’s say, that your training set error is 15%. I’m writing your training set error in the top row, and your dev set error is 16%.

In this case, assuming that humans achieve roughly 0% error, that humans can look at these pictures and just tell if it’s cat or not, then it looks like the algorithm is not even doing very well on the training set. So if it’s not even fitting the training data set well, then this is underfitting the data. And so this algorithm has high bias. But in contrast, this actually generalizing at a reasonable level to the dev set, whereas performance in the dev set is only 1% worse than performance in the training set. So this algorithm has a problem of high bias, because it was not even fitting the training set. Well, this is similar to the leftmost plots we had on the previous slide.

Now, here’s another example. Let’s say that you have 15% training set error, so that’s pretty high bias, but when you evaluate to the dev set it does even worse, maybe it does 30%.

In this case, I would diagnose this algorithm as having high bias, because it’s not doing that well on the training set, and high variance. So this has really the worst of both worlds. And one last example, if you have 0.5 training set error, and 1% dev set error, then maybe our users are quite happy, that you have a cat classifier with only 1%, than just we have low bias and low variance.

One subtlety, that I’ll just briefly mention that we’ll leave to a later video to discuss in detail, is that this analysis is predicated on the assumption, that human level performance gets nearly 0% error or, more generally, that the optimal error, sometimes called base error, so the base in optimal error is nearly 0%. I don’t want to go into detail on this in this particular video, but it turns out that if the optimal error or the base error were much higher, say, it were 15%, then if you look at this classifier, 15% is actually perfectly reasonable for training set and you wouldn’t see it as high bias and also a pretty low variance. So the case of how to analyze bias and variance, when no classifier can do very well, for example, if you have really blurry images, so that even a human or just no system could possibly do very well, then maybe base error is much higher, and then there are some details of how this analysis will change. But leaving aside this subtlety for now, the takeaway is that by looking at your training set error you can get a sense of how well you are fitting, at least the training data, and so that tells you if you have a bias problem. And then looking at how much higher your error goes, when you go from the training set to the dev set, that should give you a sense of how bad is the variance problem, so you’ll be doing a good job generalizing from a training set to the dev set, that gives you sense of your variance. All this is under the assumption that the base error is quite small and that your training and your dev sets are drawn from the same distribution. If those assumptions are violated, there’s a more sophisticated analysis you could do, which we’ll talk about in the later video.

Now, on the previous slide, you saw what high bias, high variance looks like? and I guess you have the sense of what it a good class can look like. What does high bias and high variance looks like? This is kind of the worst of both worlds. So you remember, we said that a classifier like this, then your classifier has high bias, because it underfits the data.

So this would be a classifier that is mostly linear and therefore underfits the data, we’re drawing this is purple. But if somehow your classifier does some weird things, then it is actually overfitting parts of the data as well.

So this would be a classifier that is mostly linear and therefore underfits the data, we’re drawing this is purple. But if somehow your classifier does some weird things, then it is actually overfitting parts of the data as well. So the classifier that I drew in purple, has both high bias and high variance. Where it has high bias, because, by being a mostly linear classifier, is just not fitting. You know, this quadratic line shape that well, but by having too much flexibility in the middle, it somehow gets this example, and this example overfits those two examples as well. So this classifier kind of has high bias because it was mostly linear, but you need maybe a curve function or quadratic function. And it has high variance, because it had too much flexibility to fit those two mislabel, or those live examples in the middle as well. In case this seems contrived, well, this example is a little bit contrived in two dimensions, but with very high dimensional inputs. You actually do get things with high bias in some regions and high variance in some regions, and so it is possible to get classifiers like this in high dimensional inputs that seem less contrived.

So to summarize, you’ve seen how by looking at your algorithm’s error on the training set and your algorithm’s error on the dev set you can try to diagnose, whether it has problems of high bias or high variance, or maybe both, or maybe neither. And depending on whether your algorithm suffers from bias or variance, it turns out that there are different things you could try. So in the next video, I want to present to you, what I call a basic recipe for Machine Learning, that lets you more systematically try to improve your algorithm, depending on whether it has high bias or high variance issues. So let’s go on to the next video.

03_basic-recipe-for-machine-learning

In the previous video, you saw how looking at training error and depth error can help you diagnose whether your algorithm has a bias or a variance problem, or maybe both. It turns out that this information that lets you much more systematically using what they call a basic recipe for machine learning and lets you much more systematically go about improving your algorithms’ performance. Let’s take a look.


When training a neural network, here’s a basic recipe I will use. After having trained an initial model, I will first ask, does your algorithm have high bias? And so to try and evaluate if there is high bias, you should look at, really, the training set or the training data performance. Right. And so, if it does have high bias, does not even fit in the training set that well, some things you could try would be to try pick a network, such as more hidden layers or more hidden units, or you could train it longer. Maybe run trains longer or try some more advanced optimization algorithms, which we’ll talk about later in this course. Or you can also try, this is kind of a, maybe it work, maybe it won’t. But we’ll see later that there are a lot of different neural network architectures and maybe you can find a new network architecture that’s better suited for this problem. Putting this in parentheses because one of those things that, you just have to try. Maybe you can make it work, maybe not. Whereas getting a bigger network almost always helps. And training longer doesn’t always help, but it certainly never hurts. So when training a learning algorithm, I would try these things until I can at least get rid of the bias problems, as in go back after I’ve tried this and keep doing that until I can fit, at least, fit the training set pretty well.

And usually if you have a big enough network, you should usually be able to fit the training data well so long as it’s a problem that is possible for someone to do, alright? If the image is very blurry, it may be impossible to fit it. But if at least a human can do well on the task, if you think base error is not too high, then by training a big enough network you should be able to, hopefully, do well, at least on the training set. To at least fit or overfit the training set. Once you reduce bias to acceptable amounts then ask, do you have a variance problem? And so to evaluate that I would look at dev set performance. Are you able to generalize from a pretty good training set performance to having a pretty good dev set performance? And if you have high variance, well, best way to solve a high variance problem is to get more data. If you can get it this, you know, can only help. But sometimes you can’t get more data. Or you could try regularization, which we’ll talk about in the next video, to try to reduce overfitting. And then also, again, sometimes you just have to try it. But if you can find a more appropriate neural network architecture, sometimes that can reduce your variance problem as well, as well as reduce your bias problem.

But how to do that? It’s harder to be totally systematic how you do that. But so I try these things and I kind of keep going back, until hopefully you find something with both low bias and low variance, whereupon you would be done. So a couple of points to notice. First is that, depending on whether you have high bias or high variance, the set of things you should try could be quite different. So I’ll usually use the training dev set to try to diagnose if you have a bias or variance problem, and then use that to select the appropriate subset of things to try. So for example, if you actually have a high bias problem, getting more training data is actually not going to help. Or at least it’s not the most efficient thing to do. So being clear on how much of a bias problem or variance problem or both can help you focus on selecting the most useful things to try. Second, in the earlier era of machine learning, there used to be a lot of discussion on what is called the bias variance tradeoff. And the reason for that was that, for a lot of the things you could try, you could increase bias and reduce variance, or reduce bias and increase variance. But back in the pre-deep learning era, we didn’t have many tools, we didn’t have as many tools that just reduce bias or that just reduce variance without hurting the other one. But in the modern deep learning, big data era, so long as you can keep training a bigger network, and so long as you can keep getting more data, which isn’t always the case for either of these, but if that’s the case, then getting a bigger network almost always just reduces your bias without necessarily hurting your variance, so long as you regularize appropriately. And getting more data pretty much always reduces your variance and doesn’t hurt your bias much. So what’s really happened is that, with these two steps, the ability to train, pick a network, or get more data, we now have tools to drive down bias and just drive down bias, or drive down variance and just drive down variance, without really hurting the other thing that much. And I think this has been one of the big reasons that deep learning has been so useful for supervised learning, that there’s much less of this tradeoff where you have to carefully balance bias and variance, but sometimes you just have more options for reducing bias or reducing variance without necessarily increasing the other one.

And, in fact, [inaudible] you have a well regularized network. We’ll talk about regularization starting from the next video. Training a bigger network almost never hurts. And the main cost of training a neural network that’s too big is just computational time, so long as you’re regularizing.

So I hope this gives you a sense of the basic structure of how to organize your machine learning problem to diagnose bias and variance, and then try to select the right operation for you to make progress on your problem. One of the things I mentioned several times in the video is regularization, is a very useful technique for reducing variance. There is a little bit of a bias variance tradeoff when you use regularization. It might increase the bias a little bit, although often not too much if you have a huge enough network. But let’s dive into more details in the next video so you can better understand how to apply regularization to your neural network.

02_regularizing-your-neural-network

01_regularization

If you suspect your neural network is over fitting your data. That is you have a high variance problem, one of the first things you should try per probably regularization. The other way to address high variance, is to get more training data that’s also quite reliable. But you can’t always get more training data, or it could be expensive to get more data. But adding regularization will often help to prevent overfitting, or to reduce the errors in your network. So let’s see how regularization works.

regularization for logistic regression

Let’s develop these ideas using logistic regression. Recall that for logistic regression, you try to minimize the cost function J, which is defined as this cost function. Some of your training examples of the losses of the individual predictions in the different examples, where you recall that w and b in the logistic regression, are the parameters.


So w is an x-dimensional parameter vector, and b is a real number. And so to add regularization to the logistic regression, what you do is add to it this thing, lambda, which is called the regularization parameter. I’ll say more about that in a second. But lambda/2m times the norm of w squared. So here, the norm of w squared is just equal to sum from j equals 1 to nx of wj squared, or this can also be written w transpose w, it’s just a square Euclidean norm of the prime to vector w. And this is called L2 regularization. Because here, you’re using the Euclidean normals, or else the L2 norm with the prime to vector w. Now, why do you regularize just the parameter w? Why don’t we add something here about b as well? In practice, you could do this, but I usually just omit this. Because if you look at your parameters, w is usually a pretty high dimensional parameter vector, especially with a high variance problem. Maybe w just has a lot of parameters, so you aren’t fitting all the parameters well, whereas b is just a single number. So almost all the parameters are in w rather b. And if you add this last term in practice, it won’t make much of a difference, because b is just one parameter over a very large number of parameters. In practice, I usually just don’t bother to include it. But you can if you want. So L2 regularization is the most common type of regularization. You might have also heard of some people talk about L1 regularization. And that’s when you add, instead of this L2 norm, you instead add a term that is lambda/m of sum over of this. And this is also called the L1 norm of the parameter vector w, so the little subscript 1 down there, right? And I guess whether you put m or 2m in the denominator, is just a scaling constant. If you use L1 regularization, then w will end up being sparse. And what that means is that the w vector will have a lot of zeros in it. And some people say that this can help with compressing the model, because the set of parameters are zero, and you need less memory to store the model. Although, I find that, in practice, L1 regularization to make your model sparse, helps only a little bit. So I don’t think it’s used that much, at least not for the purpose of compressing your model. And when people train your networks, L2 regularization is just used much much more often. Sorry, just fixing up some of the notation here. So one last detail. Lambda here is called the regularization, Parameter. And usually, you set this using your development set, or using [INAUDIBLE] cross validation. When you a variety of values and see what does the best, in terms of trading off between doing well in your training set versus also setting that two normal of your parameters to be small. Which helps prevent over fitting. So lambda is another hyper parameter that you might have to tune.

And by the way, for the programming exercises, lambda is a reserved keyword in the Python programming language. So in the programming exercise, we’ll have lambd, without the a, so as not to clash with the reserved keyword in Python. So we use lambd to represent the lambda regularization parameter. So this is how you implement L2 regularization for logistic regression.

regularization for neural network

How about a neural network? In a neural network, you have a cost function that’s a function of all of your parameters, w[1], b[1] through w[L], b[L], where capital L is the number of layers in your neural network. And so the cost function is this, sum of the losses, summed over your m training examples.


And says at regularization, you add lambda over 2m of sum over all of your parameters W, your parameter matrix is w, of their, that’s called the squared norm. Where this norm of a matrix, meaning the squared norm is defined as the sum of the i sum of j, of each of the elements of that matrix, squared. And if you want the indices of this summation. This is sum from i=1 through n[l-1]. Sum from j=1 through n[l], because w is an n[l-1] by n[l] dimensional matrix, where these are the number of units in layers [l-1] in layer l. So this matrix norm, it turns out is called the Frobenius norm of the matrix, denoted with a F in the subscript. So for arcane linear algebra technical reasons, this is not called the l2 normal of a matrix. Instead, it’s called the Frobenius norm of a matrix. I know it sounds like it would be more natural to just call the l2 norm of the matrix, but for really arcane reasons that you don’t need to know, by convention, this is called the Frobenius norm. It just means the sum of square of elements of a matrix. So how do you implement gradient descent with this? Previously, we would complete dw using backprop, where backprop would give us the partial derivative of J with respect to w, or really w for any given [l]. And then you update w[l], as w[l]- the learning rate times d. So this is before we added this extra regularization term to the objective. Now that we’ve added this regularization term to the objective, what you do is you take dw and you add to it, lambda/m times w. And then you just compute this update, same as before. And it turns out that with this new definition of dw[l], this new dw[l] is still a correct definition of the derivative of your cost function, with respect to your parameters, now that you’ve added the extra regularization term at the end. And it’s for this reason that L2 regularization is sometimes also called weight decay. So if I take this definition of dw[l] and just plug it in here, then you see that the update is w[l] = w[l] times the learning rate alpha times the thing from backprop, +lambda of m times w[l]. Throw the minus sign there. And so this is equal to w[l]- alpha lambda / m times w[l]- alpha times the thing you got from backpop. And so this term shows that whatever the matrix w[l] is, you’re going to make it a little bit smaller, right? This is actually as if you’re taking the matrix w and you’re multiplying it by 1-alpha lambda/m. You’re really taking the matrix w and subtracting alpha lambda/m times this. Like you’re multiplying matrix w by this number, which is going to be a little bit less than 1. So this is why L2 norm regularization is also called weight decay. Because it’s just like the ordinally gradient descent, where you update w by subtracting alpha times the original gradient you got from backprop. But now you’re also multiplying w by this thing, which is a little bit less than 1. So the alternative name for L2 regularization is weight decay. I’m not really going to use that name, but the intuition for it’s called weight decay is that this first term here, is equal to this. So you’re just multiplying the weight metrics by a number slightly less than 1. So that’s how you implement L2 regularization in neural network. Now, one question that [INAUDIBLE] has asked me is, hey, Andrew, why does regularization prevent over-fitting? Let’s look at the next video, and gain some intuition for how regularization prevents over-fitting.

02_why-regularization-reduces-overfitting

Why does regularization help with overfitting? Why does it help with reducing variance problems? Let’s go through a couple examples to gain some intuition about how it works. So, recall that high bias, high variance. And I just write pictures from our earlier video that looks something like this. Now, let’s see a fitting large and deep neural network.

I know I haven’t drawn this one too large or too deep, unless you think some neural network and this currently overfitting. So you have some cost function like J of W, B equals sum of the losses. So what we did for regularization was add this extra term that penalizes the weight matrices from being too large. So that was the Frobenius norm. So why is it that shrinking the L two norm or the Frobenius norm or the parameters might cause less overfitting?

One piece of intuition is that if you crank regularisation lambda to be really, really big, they’ll be really incentivized to set the weight matrices W to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units. And if that’s the case, then this much simplified neural network becomes a much smaller neural network. In fact, it is almost like a logistic regression unit, but stacked most probably as deep. And so that will take you from this overfitting case much closer to the left to other high bias case.

But hopefully there’ll be an intermediate value of lambda that results in a result closer to this just right case in the middle. But the intuition is that by cranking up lambda to be really big they’ll set W close to zero, which in practice this isn’t actually what happens. We can think of it as zeroing out or at least reducing the impact of a lot of the hidden units so you end up with what might feel like a simpler network. They get closer and closer as if you’re just using logistic regression. The intuition of completely zeroing out of a bunch of hidden units isn’t quite right. It turns out that what actually happens is they’ll still use all the hidden units, but each of them would just have a much smaller effect. But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting. So a lot of this intuition helps better when you implement regularization in the program exercise, you actually see some of these variance reduction results yourself.

Here’s another attempt at additional intuition for why regularization helps prevent overfitting. And for this, I’m going to assume that we’re using the tanh activation function which looks like this. This is a g of z equals tanh of z. So if that’s the case, notice that so long as Z is quite small, so if Z takes on only a smallish range of parameters, maybe around here, then you’re just using the linear regime of the tanh function. Is only if Z is allowed to wander up to larger values or smaller values like so, that the activation function starts to become less linear. So the intuition you might take away from this is that if lambda, the regularization parameter, is large, then you have that your parameters will be relatively small, because they are penalized being large into a cos function. And so if the blades W are small then because Z is equal to W and then technically is plus b, but if W tends to be very small, then Z will also be relatively small. And in particular, if Z ends up taking relatively small values, just in this whole range, then G of Z will be roughly linear. So it’s as if every layer will be roughly linear. As if it is just linear regression. And we saw in course one that if every layer is linear then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is at the end they are only able to compute a linear function. So it’s not able to fit those very very complicated decision. Very non-linear decision boundaries that allow it to really overfit right to data sets like we saw on the overfitting high variance case on the previous slide. So just to summarize, if the regularization becomes very large, the parameters W very small, so Z will be relatively small, kind of ignoring the effects of b for now, so Z will be relatively small or, really, I should say it takes on a small range of values. And so the activation function if is tanh, say, will be relatively linear. And so your whole neural network will be computing something not too far from a big linear function which is therefore pretty simple function rather than a very complex highly non-linear function. And so is also much less able to overfit.

And again, when you enter in regularization for yourself in the program exercise, you’ll be able to see some of these effects yourself. Before wrapping up our discussion on regularization, I just want to give you one implementational tip. Which is that, when implanting regularization, we took our definition of the cost function J and we actually modified it by adding this extra term that penalizes the weight being too large.


And so if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function J as a function of the number of elevations of gradient descent and you want to see that the cost function J decreases monotonically after every elevation of gradient descent. And if you’re implementing regularization then please remember that J now has this new definition. If you plot the old definition of J, just this first term, then you might not see a decrease monotonically. So to debug gradient descent make sure that you’re plotting this new definition of J that includes this second term as well. Otherwise you might not see J decrease monotonically on every single elevation. So that’s it for L two regularization which is actually a regularization technique that I use the most in training deep learning modules. In deep learning there is another sometimes used regularization technique called dropout regularization. Let’s take a look at that in the next video.

03_dropout-regularization

In addition to L2 regularization, another very powerful regularization techniques is called “dropout.”

Let’s see how that works. Let’s say you train a neural network like the one on the left and there’s over-fitting. Here’s what you do with dropout. Let me make a copy of the neural network.

With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network. Let’s say that for each of these layers, we’re going to- for each node, toss a coin and have a 0.5 chance of keeping each node and 0.5 chance of removing each node.

So, after the coin tosses, maybe we’ll decide to eliminate those nodes, then what you do is actually remove all the outgoing things from that no as well. So you end up with a much smaller, really much diminished network.

And then you do back propagation training. There’s one example on this much diminished network. And then on different examples, you would toss a set of coins again and keep a different set of nodes and then dropout or eliminate different than nodes. And so for each training example, you would train it using one of these neural based networks.

So, maybe it seems like a slightly crazy technique. They just go around coding those are random, but this actually works. But you can imagine that because you’re training a much smaller network on each example or maybe just give a sense for why you end up able to regularize the network, because these much smaller networks are being trained.

Inverted dropout

Let’s look at how you implement dropout. There are a few ways of implementing dropout. I’m going to show you the most common one, which is technique called inverted dropout.

For the sake of completeness, let’s say we want to illustrate this with layer l=3. So, in the code I’m going to write- there will be a bunch of 3s here. I’m just illustrating how to represent dropout in a single layer.

So, what we are going to do is set a vector d and d^3 is going to be the dropout vector for the layer 3. That’s what the d3 is to be np.random.rand(a). And this is going to be the same shape as a3. And when I see if this is less than some number, which I’m going to call keep.prob. And so, keep.prob is a number. It was 0.5 on the previous time, and maybe now I’ll use 0.8 in this example, and there will be the probability that a given hidden unit will be kept. So to set keep.prob = 0.8, then this means that there’s a 0.2 chance of eliminating any hidden unit. So, what it does is it generates a random matrix. And this works as well if you have factorized. So d3 will be a matrix. Therefore, each example have a each hidden unit there’s a 0.8 chance that the corresponding d3 will be one, and a 20% chance there will be zero. So, this random numbers being less than 0.8 it has a 0.8 chance of being one or be true, and 20% or 0.2 chance of being false, of being zero.

1
2
3
4
keep_prob = 0.8  # 设置神经元保留概率
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3) # equal to a3 *= d3;
a3 /= keep_prob

There was a 20% chance of each of the elements of d3 being zero, just multiply operation ends up zeroing out, the corresponding element of d3. If you do this in python, technically d3 will be a boolean array where value is true and false, rather than one and zero. But the multiply operation works and will interpret the true and false values as one and zero.

Then finally, we’re going to take a3 and scale it up by dividing by 0.8 or really dividing by our keep.prob parameter.

So, let me explain what this final step is doing. Let’s say for the sake of argument that you have 50 units or 50 neurons in the third hidden layer. So maybe a3 is 50 by one dimensional or if you- factorization maybe it’s 50 by m dimensional. So, if you have a 80% chance of keeping them and 20% chance of eliminating them. This means that on average, you end up with 10 units shut off or 10 units zeroed out. And so now, if you look at the value of z^4, $Z^{[4]}=W^{[4]}\cdot a^{[3]}+b^{[4]}$ . And so, on expectation, this will be reduced by 20%. By which I mean that 20% of the elements of a3 will be zeroed out. So, in order to not reduce the expected value of z^4, what you do is you need to take this, and divide it by 0.8 because this will correct or just a bump that back up by roughly 20% that you need. So it’s not changed the expected value of a3. And, so this line here is what’s called the inverted dropout technique. And its effect is that, no matter what you set to keep.prob to, whether it’s 0.8 or 0.9 or even one, if it’s set to one then there’s no dropout, because it’s keeping everything or 0.5 or whatever, this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same.


And it turns out that at test time, when you trying to evaluate a neural network, which we’ll talk about on the next slide, this inverted dropout technique, there is line to are due to the green box at dropping out. This makes test time easier because you have less of a scaling problem.

By far the most common implementation of dropouts today as far as I know is inverted dropouts. I recommend you just implement this. But there were some early iterations of dropout that missed this divide by keep.prob line, and so at test time the average becomes more and more complicated. But again, people tend not to use those other versions. So, what you do is you use the d vector, and you’ll notice that for different training examples, you zero out different hidden units. And in fact, if you make multiple passes through the same training set, then on different pauses through the training set, you should randomly zero out different hidden units. So, it’s not that for one example, you should keep zeroing out the same hidden units is that, on iteration one of gradient descent, you might zero out some hidden units. And on the second iteration of gradient descent where you go through the training set the second time, maybe you’ll zero out a different pattern of hidden units.

And the vector d or d3, for the third layer, is used to decide what to zero out, both in foreprob as well as in that backprob. We are just showing for prob here.

Don’t use dropout at test time

At test time, you’re given some x or which you want to make a prediction. And using our standard notation, I’m going to use a^0, the activations of the zeroes layer to denote just test example x. So what we’re going to do is not to use dropout at test time in particular which is in a sense. Z^1= w^1.a^0 + b^1. a^1 = g^1(z^1 Z). Z^2 = w^2.a^1 + b^2. a^2 =… And so on. Until you get to the last layer and that you make a prediction y^. But notice that the test time you’re not using dropout explicitly and you’re not tossing coins at random, you’re not flipping coins to decide which hidden units to eliminate. And that’s because when you are making predictions at the test time, you don’t really want your output to be random. If you are implementing dropout at test time, that just add noise to your predictions. In theory, one thing you could do is run a prediction process many times with different hidden units randomly dropped out and have it across them. But that’s computationally inefficient and will give you roughly the same result; very, very similar results to this different procedure as well. And just to mention, the inverted dropout thing, you remember the step on the previous line when we divided by the keep_prob. The effect of that was to ensure that even when you don’t see men dropout at test time to the scaling, the expected value of these activations don’t change. So, you don’t need to add in an extra funny scaling parameter at test time. That’s different than when you have that training time.

So that’s dropouts. And when you implement this in week’s premier exercise, you gain more firsthand experience with it as well. But why does it really work? What I want to do the next video is give you some better intuition about what dropout really is doing. Let’s go on to the next video.

04_understanding-dropout

Drop out does this seemingly crazy thing of randomly knocking out units on your network. Why does it work so well with a regularizer? Let’s gain some better intuition. In the previous video, I gave this intuition that drop-out randomly knocks out units in your network. So it’s as if on every iteration you’re working with a smaller neural network, and so using a smaller neural network seems like it should have a regularizing effect.

spread out weights

Here’s a second intuition which is, let’s look at it from the perspective of a single unit. Let’s say this one. Now, for this unit to do his job as for inputs and it needs to generate some meaningful output. Now with drop out, the inputs can get randomly eliminated.

Sometimes those two units will get eliminated, sometimes a different unit will get eliminated. So, what this means is that this unit, which I’m circling in purple, it can’t rely on any one feature because any one feature could go away at random or any one of its own inputs could go away at random. Some particular would be reluctant to put all of its bets on, say, just this input, right? The weights, we’re reluctant to put too much weight on any one input because it can go away. So this unit will be more motivated to spread out this way and give you a little bit of weight to each of the four inputs to this unit. And by spreading all the weights, this will tend to have an effect of shrinking the squared norm of the weights. And so, similar to what we saw with L2 regularization, the effect of implementing drop out is that it shrinks the weights and does some of those outer regularization that helps prevent over-fitting. But it turns out that drop out can formally be shown to be an adaptive form without a regularization. But L2 penalty on different weights are different, depending on the size of the activations being multiplied that way.

But to summarize, it is possible to show that drop out has a similar effect to L2 regularization. Only to L2 regularization applied to different ways can be a little bit different and even more adaptive to the scale of different inputs.

One more detail for when you’re implementing drop out

Here’s a network where you have three input features.

This is seven hidden units here, seven, three, two, one. So, one of the parameters we had to choose was the keep_prop which has a chance of keeping a unit in each layer. So, it is also feasible to vary keep_prop by layer. So for the first layer, your matrix W1 will be three by seven. Your second weight matrix will be seven by seven. W3 will be seven by three and so on. And so W2 is actually the biggest weight matrix, because they’re actually the largest set of parameters would be in W2 which is seven by seven. So to prevent, to reduce over-fitting of that matrix, maybe for this layer, I guess this is layer two, you might have a keep_prop that’s relatively low, say $0.5$, whereas for different layers where you might worry less about over-fitting, you could have a higher keep_prop, maybe just $0.7$ . And if a layers we don’t worry about over-fitting at all, you can have a key prop of one point zero.

For clarity, these are numbers I’m drawing on the purple boxes. These could be different keep_props for different layers. Notice that the keep_prop of one point zero means that you’re keeping every unit and so, you’re really not using drop out for that layer. But for layers where you’re more worried about over-fitting, really the layers with a lot of parameters, you can set the key prop to be smaller to apply a more powerful form of drop out. It’s kind of like cranking up the regularization parameter lambda of L2 regularization where you try to regularize some layers more than others. And technically, you can also apply drop out to the input layer, where you can have some chance of just maxing out one or more of the input features. Although in practice, usually don’t do that that often. And so, a key prop of one point zero was quite common for the input there. You can also use a very high value, maybe zero point nine, but it’s much less likely that you want to eliminate half of the input features. So usually keep_prop, if you apply the law, will be a number close to one if you even apply drop out at all to the input there.

So just to summarize, if you’re more worried about some layers overfitting than others, you can set a lower key prop for some layers than others. The downside is, this gives you even more hyper parameters to search for using cross-validation. One other alternative might be to have some layers where you apply drop out and some layers where you don’t apply drop out and then just have one hyper parameter, which is a keep_prop for the layers for which you do apply drop outs.

And before we wrap up, just a couple implementational tips. Many of the first successful implementations of drop outs were to computer vision. So in computer vision, the input size is so big, inputting all these pixels that you almost never have enough data. And so drop out is very frequently used by computer vision. And there’s some computer vision researchers that pretty much always use it, almost as a default. But really the thing to remember is that drop out is a regularization technique, it helps prevent over-fitting. And so, unless my algorithm is over-fitting, I wouldn’t actually bother to use drop out. So it’s used somewhat less often than other application areas. There’s just with computer vision, you usually just don’t have enough data, so you’re almost always overfitting, which is why there tends to be some computer vision researchers who swear by drop out. But their intuition doesn’t always generalize I think to other disciplines.

One big downside of drop out is that the cost function J is no longer well-defined. On every iteration, you are randomly killing off a bunch of nodes. And so, if you are double checking the performance of gradient descent, it’s actually harder to double check that you have a well defined cost function J that is going downhill on every iteration. Because the cost function J that you’re optimizing is actually less. Less well defined, or is certainly hard to calculate. So you lose this debugging tool to will a plot, a graph like this.

So what I usually do is turn off drop out, you will set keep_prop equals one, and I run my code and make sure that it is monotonically decreasing J, and then turn on drop out and hope that I didn’t introduce bugs into my code during drop out. Because you need other ways, I guess, but not plotting these figures to make sure that your code is working to greatness and it’s working even with drop outs. So with that, there’s still a few more regularization techniques that are worth your knowing. Let’s talk about a few more such techniques in the next video.

05_other-regularization-methods

In addition to L2 regularization and drop out regularization there are few other techniques to reducing over fitting in your neural network.

cat recognition

Let’s take a look. Let’s say you fitting a cat classifier. If you are overfitting getting more training data can help, but getting more training data can be expensive and sometimes you just can’t get more data. But what you can do is augment your training set by taking image like this.

And for example, flipping it horizontally and adding that also with your training set. So now instead of just this one example in your training set, you can add this to your training example. So by flipping the images horizontally, you could double the size of your training set. Because you’re training set is now a bit redundant this isn’t as good as if you had collected an additional set of brand new independent examples. But you could do this Without needing to pay the expense of going out to take more pictures of cats. And then other than flipping horizontally, you can also take random crops of the image. So here we’re rotated and sort of randomly zoom into the image and this still looks like a cat. So by taking random distortions and translations of the image you could augment your data set and make additional fake training examples. Again, these extra fake training examples they don’t add as much information as they were to call. They get a brand new independent example of a cat. But because you can do this, almost for free, other than for some confrontational costs. This can be an inexpensive way to give your algorithm more data and therefore sort of regularize it and reduce overfitting. And by synthesizing examples like this what you’re really telling your algorithm is that If something is a cat then flipping it horizontally is still a cat. Notice I didn’t flip it vertically, because maybe we don’t want upside down cats, right? And then also maybe randomly zooming in to part of the image it’s probably still a cat.

optical character recognition

For optical character recognition you can also bring your data set by taking digits and imposing random rotations and distortions to it. So If you add these things to your training set, these are also still digit force. For illustration I applied a very strong distortion. So this look very wavy for, in practice you don’t need to distort the four quite as aggressively, but just a more subtle distortion than what I’m showing here, to make this example clearer for you, right? But a more subtle distortion is usually used in practice, because this looks like really warped fours. So data augmentation can be used as a regularization technique, in fact similar to regularization.

early stopping

There’s one other technique that is often used called early stopping. So what you’re going to do is as you run gradient descent you’re going to plot your, either the training error, you’ll use 01 classification error on the training set. Or just plot the cost function J optimizing, and that should decrease monotonically, like so, all right? Because as you trade, hopefully, you’re trading around your cost function J should decrease. So with early stopping, what you do is you plot this, and you also plot your dev set error. And again, this could be a classification error in a development sense, or something like the cost function, like the logistic loss or the log loss of the dev set. Now what you find is that your dev set error will usually go down for a while, and then it will increase from there.

So what early stopping does is, you will say well, it looks like your neural network was doing best around that iteration, so we just want to stop trading on your neural network halfway and take whatever value achieved this dev set error. So why does this work? Well when you’ve haven’t run many iterations for your neural network yet your parameters w will be close to zero. Because with random initialization you probably initialize w to small random values so before you train for a long time, w is still quite small. And as you iterate, as you train, w will get bigger and bigger and bigger until here maybe you have a much larger value of the parameters w for your neural network. So what early stopping does is by stopping halfway you have only a mid-size rate w. And so similar to L2 regularization by picking a neural network with smaller norm for your parameters w, hopefully your neural network is over fitting less. And the term early stopping refers to the fact that you’re just stopping the training of your neural network earlier. I sometimes use early stopping when training a neural network.

But it does have one downside, let me explain. I think of the machine learning process as comprising several different steps. One, is that you want an algorithm to optimize the cost function j and we have various tools to do that, such as gradient descent. And then we’ll talk later about other algorithms, like momentum and RMSprop and Adam and so on. But after optimizing the cost function j, you also wanted to not over-fit. And we have some tools to do that such as your regularization, getting more data and so on. Now in machine learning, we already have so many hyper-parameters. It surge over. It’s already very complicated to choose among the space of possible algorithms. And so I find machine learning easier to think about when you have one set of tools for optimizing the cost function J, and when you’re focusing on optimizing the cost function J. All you care about is finding w and b, so that J(w,b) is as small as possible. You just don’t think about anything else other than reducing this. And then it’s completely separate task to not over fit, in other words, to reduce variance. And when you’re doing that, you have a separate set of tools for doing it. And this principle is sometimes called orthogonalization. And there’s this idea, that you want to be able to think about one task at a time. I’ll say more about orthorganization in a later video, so if you don’t fully get the concept yet, don’t worry about it.

But to me the main downside of early stopping is that this couples, these two tasks. So you no longer can work on these two problems independently, because by stopping gradient decent early, you’re sort of breaking whatever you’re doing to optimize cost function J, because now you’re not doing a great job reducing the cost function J. You’ve sort of not done that that well. And then you also simultaneously trying to not over fit. So instead of using different tools to solve the two problems, you’re using one that kind of mixes the two. And this just makes the set of things you could try are more complicated to think about.

Rather than using early stopping, one alternative is just use L2 regularization then you can just train the neural network as long as possible. I find that this makes the search space of hyper parameters easier to decompose, and easier to search over. But the downside of this though is that you might have to try a lot of values of the regularization parameter lambda. And so this makes searching over many values of lambda more computationally expensive.

And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda. If this concept doesn’t completely make sense to you yet, don’t worry about it. We’re going to talk about orthogonalization in greater detail in a later video, I think this will make a bit more sense. Despite it’s disadvantages, many people do use it. I personally prefer to just use L2 regularization and try different values of lambda. That’s assuming you can afford the computation to do so. But early stopping does let you get a similar effect without needing to explicitly try lots of different values of lambda.

So you’ve now seen how to use data augmentation as well as if you wish early stopping in order to reduce variance or prevent over fitting your neural network. Next let’s talk about some techniques for setting up your optimization problem to make your training go quickly.

03_setting-up-your-optimization-problem

01_normalizing-inputs

When training a neural network, one of the techniques that will speed up your training is if you normalize your inputs. Let’s see what that means.

Let’s see if a training sets with two input features. So the input features x are two dimensional, and here’s a scatter plot of your training set. Normalizing your inputs corresponds to two steps. The first is to subtract out or to zero out the mean. So you set $\mu = \dfrac{1}{m}\sum_{i=1}^{m}x^{(i)}$. So this is a vector, and then $x : =x-\mu$ for every training example, so this means you just move the training set until it has 0 mean. And then the second step is to normalize the variances. So notice here that the feature X1 has a much larger variance than the feature X2 here. So what we do is set $\sigma^{2} = \dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)^{2}}$, sigma_square = np.sum(Xi**2) in python. I guess this is a element y squaring. And so now sigma squared is a vector with the variances of each of the features, and notice we’ve already subtracted out the mean, so Xi squared, element y squared is just the variances. And you take each example and divide it by this vector sigma squared. And so in pictures, you end up with this. Where now the variance of X1 and X2 are both equal to one.

And one tip, if you use this to scale your training data, then use the same mu and sigma squared to normalize your test set, right? In particular, you don’t want to normalize the training set and the test set differently. Whatever this value $\mu$ is and whatever this value $\alpha$ is, use them in these two formulas so that you scale your test set in exactly the same way, rather than estimating mu and sigma squared separately on your training set and test set. Because you want your data, both training and test examples, to go through the same transformation defined by the same mu and sigma squared calculated on your training data.

So, why do we do this? Why do we want to normalize the input features? Recall that a cost function is defined as written on the top right. It turns out that if you use unnormalized input features, it’s more likely that your cost function will look like this, it’s a very squished out bowl, very elongated cost function, where the minimum you’re trying to find is maybe over there.

But if your features are on very different scales, say the feature X1 ranges from 1 to 1,000, and the feature X2 ranges from 0 to 1, then it turns out that the ratio or the range of values for the parameters w1 and w2 will end up taking on very different values. And so maybe these axes should be w1 and w2, but I’ll plot w and b, then your cost function can be a very elongated bowl like that. So if you part the contours of this function, you can have a very elongated function like that. Whereas if you normalize the features, then your cost function will on average look more symmetric. And if you’re running gradient descent on the cost function like the one on the left, then you might have to use a very small learning rate because if you’re here that gradient descent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum.

Whereas if you have a more spherical contours, then wherever you start gradient descent can pretty much go straight to the minimum. You can take much larger steps with gradient descent rather than needing to oscillate around like like the picture on the left.

Of course in practice w is a high-dimensional vector, and so trying to plot this in 2D doesn’t convey all the intuitions correctly. But the rough intuition that your cost function will be more round and easier to optimize when your features are all on similar scales. Not from one to 1000, zero to one, but mostly from minus one to one or of about similar variances of each other. That just makes your cost function J easier and faster to optimize. In practice if one feature, say X1, ranges from zero to one, and X2 ranges from minus one to one, and X3 ranges from one to two, these are fairly similar ranges, so this will work just fine. It’s when they’re on dramatically different ranges like ones from 1 to a 1000, and the another from 0 to 1, that that really hurts your optimization algorithm. But by just setting all of them to a 0 mean and say, variance 1, like we did in the last slide, that just guarantees that all your features on a similar scale and will usually help your learning algorithm run faster. So, if your input features came from very different scales, maybe some features are from 0 to 1, some from 1 to 1,000, then it’s important to normalize your features. If your features came in on similar scales, then this step is less important. Although performing this type of normalization pretty much never does any harm, so I’ll often do it anyway if I’m not sure whether or not it will help with speeding up training for your algebra.

So that’s it for normalizing your input features. Next, let’s keep talking about ways to speed up the training of your new network.

02_vanishing-exploding-gradients

One of the problems of training neural network, especially very deep neural networks, is vanishing and exploding gradients. What that means is that when you’re training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult. In this video you see what this problem of exploding and vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem.

let’s say you’re training a very deep neural network like this, to save space on the slide, I’ve drawn it as if you have only two hidden units per layer, but it could be more as well. But this neural network will have parameters W1, W2, W3 and so on up to WL.

For the sake of simplicity, let’s say we’re using an activaton function $g(z) = z, b = 0$, so linear activation function. So in that case you can show that the output $\hat{y} = W^{[L]}\times W^{[L-1]}\times\cdots\times W^{[2]}\times W^{[1]}X$. Now, let’s say that each of you weight matrices $W^{[L]}$ is just a little bit larger than one times the identity. For example, it’s $W^{[l]}=\left[ \begin{array}{l} 1.5 & 0 \ 0 & 1.5\end{array} \right]$ . Technically, the last one has different dimensions so maybe this is just the rest of these weight matrices. Then Y-hat will be, $\hat y = W^{[L]}\left[ \begin{array}{l}1.5 & 0 \\ 0 & 1.5\end{array} \right]^{L-1}X$, because we assume that each one of these matrices is equal to this thing. It’s really 1.5 times the identity matrix, then you end up with this calculation, $\hat y = W^{[L]}1.5^{L-1}\left[ \begin{array}{l} 1 & 0 \\ 0 & 1\end{array} \right]X $. And so Y-hat will be essentially $\hat y = W^{[L]}1.5^{L-1}X$, and if L was large for very deep neural network, Y-hat will be very large. In fact, it just grows exponentially, it grows like 1.5 to the number of layers. And so if you have a very deep neural network, the value of Y will explode.

Now, conversely, if we replace 1.5 with 0.5, so something less than 1, then this becomes 0.5 to the power of L. This matrix becomes $\hat y = W^{[L]}0.5^{L-1}X$. And so each of your matrices are less than 1, then let’s say X1, X2 were one one, then the activations will be one half, one half, one fourth, one fourth, one eighth, one eighth, and so on until this becomes one over two to the L.

So the activation values will decrease exponentially as a function of the def, as a function of the number of layers L of the network. So in the very deep network, the activations end up decreasing exponentially.

So the intuition I hope you can take away from this is that at the weights W, if they’re all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode. And if W is just a little bit less than identity. So this maybe here’s 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially.

And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of L, a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers. With some of the modern neural networks, L equals 150. Microsoft recently got great results with 152 layer neural network. But with such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.

To summarize, you’ve seen how deep networks suffer from the problems of vanishing or exploding gradients. In fact, for a long time this problem was a huge barrier to training deep neural networks. It turns out there’s a partial solution that doesn’t completely solve this problem but it helps a lot which is careful choice of how you initialize the weights. To see that, let’s go to the next video.

03_weight-initialization-for-deep-networks

In the last video you saw how very deep neural networks can have the problems of vanishing and exploding gradients. It turns out that a partial solution to this, doesn’t solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network. To understand this lets start with the example of initializing the ways for a single neuron and then we’re going to generalize this to a deep network.

Let’s go through this with an example with just a single neuron and then we’ll talk about the deep net later. So a single neuron you might input four features x1 through x4 and then you have some a=g(z) and end it up with some y and later on for a deeper net you know these inputs will be right, some layer a(l), but for now let’s just call this x for now. So z is going to be equal to w1x1 + w2x2 +… + it goes WnXn and let’s set b=0 so you know lets just ignore b for now. So in order to make z not blow up and not become too small you notice that the larger n is, the smaller you want Wi to be, right?

Because z is the sum of the WiXi and so if you’re adding up a lot of these terms you want each of these terms to be smaller. One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n, $Variance(W^{[i]}) = {1 \over n}$, where n is the number of input features that’s going into a neuron. So in practice, what you can do is set the weight matrix WL = np.random.randn(WL.shape[0],WL.shape[1]) * np.sqrt(1/n) whose n is the number of features that I fed into each neuron and there else is going to be $n^{(l-1)}$ because that’s the number of units that I’m feeding into each of the units and they are l. It turns out that if you’re using a ReLU activation function, to set in the variance that 2 over n works a little bit better.

So you often see that in initialization especially if you’re using a ReLU activation function so if g(z) is ReLu(z), oh and it depend on how familiar you are with random variables. It turns out that something, a Gaussian random variable and then multiplying it by a square root of this, that says the variance to be quoted this way, to be to 2 over n.

And the reason I went from n to this n superscript l-1 was, in this example with logistic regression which is to input features but the more general case they are l would have an l-1 inputs each of the units in that layer.

So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale and this doesn’t solve, but it definitely helps reduce the vanishing, exploding gradients problem because it’s trying to set each of the weight matrices w not too much bigger than 1 and not too much less than 1 so it doesn’t explode or vanish too quickly.

I’ve just mention some other variants. The version we just described is assuming a ReLU activation function and this by a paper by Herd at al. A few other variants, if you are using a tanh activation function then there’s a paper that shows that instead of using the constant 2 it’s better use the constant 1 and so 1 over this instead of 2 and so you multiply it by the square root of this. So this square root term will plays this term and you use this if you’re using a TanH activation function. This is called Xavier initialization.

And another version we’re taught by Yoshua Bengio and his colleagues, you might see in some papers, but is to use this formula, $\sqrt{\frac{2}{n^{[l-1]}+n^{[l]}}}$, which you know has some other theoretical justification. But I would say if you’re using a ReLU activation function, which is really the most common activation function, I would use this formula, $\sqrt{\frac{2}{n^{[l-1]}}}$ . If you’re using tanh you could try this version, $\sqrt{\frac{1}{n^{[l-1]}}}$, instead and some authors will also use $\sqrt{\frac{2}{n^{[l-1]}}}$, But in practice I think all of these formulas just give you a starting point, it gives you a default value to use for the variance of the initialization of your weight matrices. If you wish the variance here, this variance parameter could be another thing that you could tune of your hyperparameters so you could have another parameter that multiplies into this formula (pointed by the green arrow) and tune that multiplier as part of your hyperparameter surge.

Sometimes tuning the hyperparameter has a modest size effect. It’s not one of the first hyperparameters I would usually try to tune but I’ve also seen some problems with tuning this you know helps a reasonable amount but this is usually lower down for me in terms of how important it is relative to the other hyperparameters you can tune.

So I hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weights. Hopefully that makes your weights you know not explode too quickly and not decay to zero too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much. When you train deep networks this is another trick that will help you make your neural networks trained much.

04_numerical-approximation-of-gradients

When you implement back propagation you’ll find that there’s a test called creating checking that can really help you make sure that your implementation of back prop is correct. Because sometimes you write all these equations and you’re just not 100% sure if you’ve got all the details right and internal back propagation. So in order to build up to gradient and checking, let’s first talk about how to numerically approximate computations of gradients and in the next video, we’ll talk about how you can implement gradient checking to make sure the implementation of backdrop is correct.

one sided difference


So lets take the function f and replot it here and remember this is f of theta equals theta cubed, and let’s again start off to some value of theta. Let’s say theta equals 1. Now instead of just nudging theta to the right to get theta plus epsilon, we’re going to nudge it to the right and nudge it to the left to get theta minus epsilon, as was theta plus epsilon. So this is 1, this is 1.01, this is 0.99 where, again, epsilon is same as before, it is 0.01. It turns out that rather than taking this little triangle and computing the height over the width, you can get a much better estimate of the gradient if you take this point, f of theta minus epsilon and this point, and you instead compute the height over width of this bigger triangle. So for technical reasons which I won’t go into, the height over width of this bigger green triangle gives you a much better approximation to the derivative at theta. And you saw it yourself, taking just this lower triangle in the upper right is as if you have two triangles, right? This one on the upper right and this one on the lower left. And you’re kind of taking both of them into account by using this bigger green triangle.

two sided difference


So rather than a one sided difference, you’re taking a two sided difference. So let’s work out the math. This point here is F of theta plus epsilon. This point here is F of theta minus epsilon. So the height of this big green triangle is f of theta plus epsilon minus f of theta minus epsilon. And then the width, this is 1 epsilon, this is 2 epsilon. So the width of this green triangle is 2 epsilon. So the height of the width is going to be first the height, so that’s F of theta plus epsilon minus F of theta minus epsilon divided by the width. So that was 2 epsilon which we write that down here. And this should hopefully be close to g of theta. So plug in the values, remember f of theta is theta cubed. So this is theta plus epsilon is 1.01. So I take a cube of that minus 0.99 theta cube of that divided by 2 times 0.01. Feel free to pause the video and practice in the calculator. You should get that this is 3.0001. Whereas from the previous slide, we saw that g of theta, this was 3 theta squared so when theta was 1, so these two values are actually very close to each other. The approximation error is now 0.0001. Whereas on the previous slide, we’ve taken the one sided of difference just theta + theta + epsilon we had gotten 3.0301 and so the approximation error was 0.03 rather than 0.0001. So this two sided difference way of approximating the derivative you find that this is extremely close to 3. And so this gives you a much greater confidence that g of theta is probably a correct implementation of the derivative of F. When you use this method for grading, checking and back propagation, this turns out to run twice as slow as you were to use a one-sided difference. It turns out that in practice I think it’s worth it to use this other method because it’s just much more accurate. The little bit of optional theory for those of you that are a little bit more familiar of Calculus, it turns out that, and it’s okay if you don’t get what I’m about to say here. But it turns out that the formal definition of a derivative is for very small values of epsilon is f of theta plus epsilon minus f of theta minus epsilon over 2 epsilon. And the formal definition of derivative is in the limits of exactly that formula on the right as epsilon those as 0. And the definition of unlimited is something that you learned if you took a Calculus class but I won’t go into that here. And it turns out that for a non zero value of epsilon, you can show that the error of this approximation, $f’(\theta) = \frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{\epsilon}$ , is on the order of epsilon squared, $O(\epsilon^2)$, and remember epsilon is a very small number. So if epsilon is 0.01 which it is here then epsilon squared is 0.0001. The big O notation means the error is actually some constant times this, but this is actually exactly our approximation error. So the big O constant happens to be 1. Whereas in contrast if we were to use this formula, $f’(\theta) = \frac{f(\theta - \epsilon)}{\epsilon}$, the other one, then the error is on the order of epsilon $O(\epsilon)$.

And again, when epsilon is a number less than 1, then epsilon is actually much bigger than epsilon squared which is why this formula here is actually much less accurate approximation than this formula on the left. Which is why when doing gradient checking, we rather use this two-sided difference when you compute $f’(\theta) = \frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{\epsilon}$ rather than just one sided difference, $f’(\theta) = \frac{f(\theta - \epsilon)}{\epsilon}$, which is less accurate.

If you didn’t understand my last two comments, all of these things are on here. Don’t worry about it. That’s really more for those of you that are a bit more familiar with Calculus, and with numerical approximations. But the takeaway is that this two-sided difference formula is much more accurate.

And so that’s what we’re going to use when we do gradient checking in the next video. So you’ve seen how by taking a two sided difference, you can numerically verify whether or not a function g, g of theta that someone else gives you is a correct implementation of the derivative of a function f. Let’s now see how we can use this to verify whether or not your back propagation implementation is correct or if there might be a bug in there that you need to go and tease out

05_gradient-checking

Gradient checking is a technique that’s helped me save tons of time, and helped me find bugs in my implementations of back propagation many times. Let’s see how you could use it too to debug, or to verify that your implementation and back process correct.

So your new network will have some sort of parameters, W1, B1 and so on up to WL bL. So to implement gradient checking, the first thing you should do is take all your parameters and reshape them into a giant vector data. So what you should do is take W which is a matrix, and reshape it into a vector. You gotta take all of these Ws and reshape them into vectors, and then concatenate all of these things, so that you have a giant vector theta. Giant vector pronounced as theta. So we say that the cos function J being a function of the Ws and Bs, You would now have the cost function J being just a function of theta. Next, with W and B ordered the same way, you can also take dW[1], db[1] and so on, and initiate them into big, giant vector d theta of the same dimension as theta. So same as before, we shape dW[1] into the matrix, db[1] is already a vector. We shape dW[L], all of the dW’s which are matrices. Remember, dW1 has the same dimension as W1. db1 has the same dimension as b1. So the same sort of reshaping and concatenation operation, you can then reshape all of these derivatives into a giant vector d theta. Which has the same dimension as theta.

So the question is, now, is the theta the gradient or the slope of the cost function J? So here’s how you implement gradient checking, and often abbreviate gradient checking to grad check. So first we remember that J Is now a function of the giant parameter, theta, right? So expands to j is a function of theta 1, theta 2, theta 3, and so on. Whatever’s the dimension of this giant parameter vector theta.


So to implement grad check, what you’re going to do is implements a loop so that for each I, so for each component of theta, let’s compute D theta approx i to b. And let me take a two sided difference. So I’ll take J of theta. Theta 1, theta 2, up to theta i. And we’re going to nudge theta i to add epsilon to this. So just increase theta i by epsilon, and keep everything else the same. And because we’re taking a two sided difference, we’re going to do the same on the other side with theta i, but now minus epsilon. And then all of the other elements of theta are left alone. And then we’ll take this, and we’ll divide it by 2 theta. And what we saw from the previous video is that this should be approximately equal to d theta i. Of which is supposed to be the partial derivative of J or of respect to, I guess theta i, if d theta i is the derivative of the cost function J. So what you going to do is you’re going to compute to this for every value of i. And at the end, you now end up with two vectors. You end up with this d theta approx, and this is going to be the same dimension as d theta. And both of these are in turn the same dimension as theta. And what you want to do is check if these vectors , $d_{approx}\theta, d\theta $, are approximately equal to each other.

So, in detail, well how you do you define whether or not two vectors are really reasonably close to each other? What I do is the following. I would compute the Euclidean distance between these two vectors, d theta approx minus d theta, so just the Euclidean norm of this. Notice there’s no square on top, so this is the sum of squares of elements of the differences, and then you take a square root, as you get the Euclidean distance. And then just to normalize by the lengths of these vectors, divide by d theta approx plus d theta. Just take the Euclidean lengths of these vectors. And the row for the denominator is just in case any of these vectors are really small or really large, your the denominator turns this formula into a ratio.

So we implement this in practice, I use epsilon equals maybe 10 to the minus 7, so minus 7. And with this range of epsilon, if you find that this formula gives you a value like 10 to the minus 7 or smaller, then that’s great. It means that your derivative approximation is very likely correct. This is just a very small value. If it’s maybe on the range of 10 to the -5, I would take a careful look. Maybe this is okay. But I might double-check the components of this vector, and make sure that none of the components are too large. And if some of the components of this difference are very large, then maybe you have a bug somewhere. And if this formula on the left is on the other is -3, then I would wherever you have would be much more concerned that maybe there’s a bug somewhere. But you should really be getting values much smaller then 10 minus 3. If any bigger than 10 to minus 3, then I would be quite concerned. I would be seriously worried that there might be a bug. And I would then, you should then look at the individual components of data to see if there’s a specific value of i for which d theta across i is very different from d theta i. And use that to try to track down whether or not some of your derivative computations might be incorrect. And after some amounts of debugging, it finally, it ends up being this kind of very small value, then you probably have a correct implementation.

So when implementing a neural network, what often happens is I’ll implement foreprop, implement backprop. And then I might find that this grad check has a relatively big value. And then I will suspect that there must be a bug, go in debug, debug, debug. And after debugging for a while, If I find that it passes grad check with a small value, then you can be much more confident that it’s then correct.

So you now know how gradient checking works. This has helped me find lots of bugs in my implementations of neural nets, and I hope it’ll help you too. In the next video, I want to share with you some tips or some notes on how to actually implement gradient checking. Let’s go onto the next video.

06_gradient-checking-implementation-notes

In the last video you learned about gradient checking. In this video, I want to share with you some practical tips or some notes on how to actually go about implementing this for your neural network.

First, don’t use grad check in training, only to debug. So what I mean is that, computing d theta approx i, for all the values of i, this is a very slow computation. So to implement gradient descent, you’d use backprop to compute d theta and just use backprop to compute the derivative. And it’s only when you’re debugging that you would compute this to make sure it’s close to d theta. But once you’ve done that, then you would turn off the grad check, and don’t run this during every iteration of gradient descent, because that’s just much too slow.

Second, if an algorithm fails grad check, look at the components, look at the individual components, and try to identify the bug. So what I mean by that is if d theta approx is very far from d theta, what I would do is look at the different values of i to see which are the values of d theta approx that are really very different than the values of d theta. So for example, if you find that the values of theta or d theta, they’re very far off, all correspond to dbl for some layer or for some layers, but the components for dw are quite close, right? Remember, different components of theta correspond to different components of b and w. When you find this is the case, then maybe you find that the bug is in how you’re computing db, the derivative with respect to parameters b. And similarly, vice versa, if you find that the values that are very far, the values from d theta approx that are very far from d theta, you find all those components came from dw or from dw in a certain layer, then that might help you hone in on the location of the bug. This doesn’t always let you identify the bug right away, but sometimes it helps you give you some guesses about where to track down the bug.

Next, when doing grad check, remember your regularization term if you’re using regularization. So if your cost function is J of theta equals 1 over m sum of your losses and then plus this regularization term. And sum over l of wl squared, then this is the definition of J. And you should have that d theta is gradient of J with respect to theta, including this regularization term. So just remember to include that term.

Next, grad check doesn’t work with dropout, because in every iteration, dropout is randomly eliminating different subsets of the hidden units. There isn’t an easy to compute cost function J that dropout is doing gradient descent on. It turns out that dropout can be viewed as optimizing some cost function J, but it’s cost function J defined by summing over all exponentially large subsets of nodes they could eliminate in any iteration. So the cost function J is very difficult to compute, and you’re just sampling the cost function every time you eliminate different random subsets in those we use dropout. So it’s difficult to use grad check to double check your computation with dropouts. So what I usually do is implement grad check without dropout. So if you want, you can set keep-prob and dropout to be equal to 1.0. And then turn on dropout and hope that my implementation of dropout was correct. There are some other things you could do, like fix the pattern of nodes dropped and verify that grad check for that pattern of [INAUDIBLE] is correct, but in practice I don’t usually do that. So my recommendation is turn off dropout, use grad check to double check that your algorithm is at least correct without dropout, and then turn on dropout.

Finally, this is a subtlety. It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large. So one thing you could do, I don’t do this very often, but one thing you could do is run grad check at random initialization and then train the network for a while so that w and b have some time to wander away from 0, from your small random initial values. And then run grad check again after you’ve trained for some number of iterations. So that’s it for gradient checking.

conclusion of this week

And congratulations for coming to the end of this week’s materials. In this week, you’ve learned about how to set up your train, dev, and test sets, how to analyze bias and variance and what things to do if you have high bias versus high variance versus maybe high bias and high variance. You also saw how to apply different forms of regularization, like L2 regularization and dropout on your neural network. So some tricks for speeding up the training of your neural network. And then finally, gradient checking. So I think you’ve seen a lot in this week and you get to exercise a lot of these ideas in this week’s programming exercise. So best of luck with that, and I look forward to seeing you in the week two materials.

查看本网站请使用全局科学上网
欢迎打赏来支持我的免费分享
0%