17_large-scale-machine-learning note17

01_gradient-descent-with-large-datasets

01_learning-with-large-datasets

In the next few videos, we’ll talk about large scale machine learning. That is, algorithms but viewing with big data sets. If you look back at a recent 5 or 10-year history of machine learning. One of the reasons that learning algorithms work so much better now than even say, 5-years ago, is just the sheer amount of data that we have now and that we can train our algorithms on. In these next few videos, we’ll talk about algorithms for dealing when we have such massive data sets.

machine learning and data

So why do we want to use such large data sets? We’ve already seen that one of the best ways to get a high performance machine learning system, is if you take a low-bias learning algorithm, and train that on a lot of data. And so, one early example we have already seen was this example of classifying between confusable words. So, for breakfast, I ate two (TWO) eggs and we saw in this example, these sorts of results, where, you know, so long as you feed the algorithm a lot of data, it seems to do very well. And so it’s results like these that has led to the saying in machine learning that often it’s not who has the best algorithm that wins. It’s who has the most data. So you want to learn from large data sets, at least when we can get such large data sets.

learning with large datasets

But learning with large data sets comes with its own unique problems, specifically, computational problems. Let’s say your training set size is M equals 100,000,000. And this is actually pretty realistic for many modern data sets. If you look at the US Census data set, if there are, you know, 300 million people in the US, you can usually get hundreds of millions of records. If you look at the amount of traffic that popular websites get, you easily get training sets that are much larger than hundreds of millions of examples. And let’s say you want to train a linear regression model, or maybe a logistic regression model, in which case this is the gradient descent rule. And if you look at what you need to do to compute the gradient, which is this term over here, then when M is a hundred million, you need to carry out a summation over a hundred million terms, in order to compute these derivatives terms and to perform a single step of decent. Because of the computational expense of summing over a hundred million entries in order to compute just one step of gradient descent, in the next few videos we’ve talk about techniques for either replacing this with something else or to find more efficient ways to compute this derivative. By the end of this sequence of videos on large scale machine learning, you know how to fit models, linear regression, logistic regression, neural networks and so on even today’s data sets with, say, a hundred million examples. Of course, before we put in the effort into training a model with a hundred million examples, We should also ask ourselves, well, why not use just a thousand examples. Maybe we can randomly pick the subsets of a thousand examples out of a hundred million examples and train our algorithm on just a thousand examples. So before investing the effort into actually developing and the software needed to train these massive models is often a good sanity check, if training on just a thousand examples might do just as well. The way to sanity check of using a much smaller training set might do just as well, that is if using a much smaller n equals 1000 size training set, that might do just as well, it is the usual method of plotting the learning curves, so if you were to plot the learning curves and if your training objective were to look like this, that’s J train theta. And if your cross-validation set objective, Jcv of theta would look like this, then this looks like a high-variance learning algorithm, and we will be more confident that adding extra training examples would improve performance. Whereas in contrast if you were to plot the learning curves, if your training objective were to look like this, and if your cross-validation objective were to look like that, then this looks like the classical high-bias learning algorithm. And in the latter case, you know, if you were to plot this up to, say, m equals 1000 and so that is m equals 500 up to m equals 1000, then it seems unlikely that increasing m to a hundred million will do much better and then you’d be just fine sticking to n equals 1000, rather than investing a lot of effort to figure out how the scale of the algorithm. Of course, if you were in the situation shown by the figure on the right, then one natural thing to do would be to add extra features, or add extra hidden units to your neural network and so on, so that you end up with a situation closer to that on the left, where maybe this is up to n equals 1000, and this then gives you more confidence that trying to add infrastructure to change the algorithm to use much more than a thousand examples that might actually be a good use of your time.

So in large-scale machine learning, we like to come up with computationally reasonable ways, or computationally efficient ways, to deal with very big data sets. In the next few videos, we’ll see two main ideas. The first is called stochastic gradient descent and the second is called Map Reduce, for viewing with very big data sets. And after you’ve learned about these methods, hopefully that will allow you to scale up your learning algorithms to big data and allow you to get much better performance on many different applications.

02_stochastic-gradient-descent

For many learning algorithms, among them linear regression, logistic regression and neural networks, the way we derive the algorithm was by coming up with a cost function or coming up with an optimization objective. And then using an algorithm like gradient descent to minimize that cost function. We have a very large training set gradient descent becomes a computationally very expensive procedure. In this video, we’ll talk about a modification to the basic gradient descent algorithm called Stochastic gradient descent, which will allow us to scale these algorithms to much bigger training sets.

linear regression with gradient descent

Suppose you are training a linear regression model using gradient descent. As a quick recap, the hypothesis will look like this, and the cost function will look like this, which is the sum of one half of the average square error of your hypothesis on your m training examples, and the cost function we’ve already seen looks like this sort of bow-shaped function. So, plotted as function of the parameters theta 0 and theta 1, the cost function J is a sort of a bow-shaped function. And gradient descent looks like this, where in the inner loop of gradient descent you repeatedly update the parameters theta using that expression. Now in the rest of this video, I’m going to keep using linear regression as the running example. But the ideas here, the ideas of Stochastic gradient descent is fully general and also applies to other learning algorithms like logistic regression, neural networks and other algorithms that are based on training gradient descent on a specific training set.

linear regression with gradient descent

So here’s a picture of what gradient descent does, if the parameters are initialized to the point there then as you run gradient descent different iterations of gradient descent will take the parameters to the global minimum. So take a trajectory that looks like that and heads pretty directly to the global minimum. Now, the problem with gradient descent is that if m is large. Then computing this derivative term can be very expensive, because the surprise, summing over all m examples. So if m is 300 million, alright. So in the United States, there are about 300 million people. And so the US or United States census data may have on the order of that many records. So you want to fit the linear regression model to that then you need to sum over 300 million records. And that’s very expensive. To give the algorithm a name, this particular version of gradient descent is also called Batch gradient descent. And the term Batch refers to the fact that we’re looking at all of the training examples at a time. We call it sort of a batch of all of the training examples. And it really isn’t the, maybe the best name but this is what machine learning people call this particular version of gradient descent. And if you imagine really that you have 300 million census records stored away on disc. The way this algorithm works is you need to read into your computer memory all 300 million records in order to compute this derivative term. You need to stream all of these records through computer because you can’t store all your records in computer memory. So you need to read through them and slowly, you know, accumulate the sum in order to compute the derivative. And then having done all that work, that allows you to take one step of gradient descent. And now you need to do the whole thing again. You know, scan through all 300 million records, accumulate these sums. And having done all that work, you can take another little step using gradient descent. And then do that again. And then you take yet a third step. And so on. And so it’s gonna take a long time in order to get the algorithm to converge.

batch gradient descent vs stochastic gradient descent

In contrast to Batch gradient descent, what we are going to do is come up with a different algorithm that doesn’t need to look at all the training examples in every single iteration, but that needs to look at only a single training example in one iteration. Before moving on to the new algorithm, here’s just a Batch gradient descent algorithm written out again with that being the cost function and that being the update and of course this term here, that’s used in the gradient descent rule, that is the partial derivative with respect to the parameters theta J of our optimization objective, J train of theta. Now, let’s look at the more efficient algorithm that scales better to large data sets. In order to work off the algorithms called Stochastic gradient descent, this vectors the cost function in a slightly different way then they define the cost of the parameter theta with respect to a training example x(i), y(i) to be equal to one half times the squared error that my hypothesis incurs on that example, x(i), y(i). So this cost function term really measures how well is my hypothesis doing on a single example x(i), y(i). Now you notice that the overall cost function j train can now be written in this equivalent form. So j train is just the average over my m training examples of the cost of my hypothesis on that example x(i), y(i). Armed with this view of the cost function for linear regression, let me now write out what Stochastic gradient descent does. The first step of Stochastic gradient descent is to randomly shuffle the data set. So by that I just mean randomly shuffle, or randomly reorder your m training examples. It’s sort of a standard pre-processing step, come back to this in a minute. But the main work of Stochastic gradient descent is then done in the following. We’re going to repeat for i equals 1 through m. So we’ll repeatedly scan through my training examples and perform the following update. Gonna update the parameter theta j as theta j minus alpha times h of x(i) minus y(i) times x(i)j. And we’re going to do this update as usual for all values of j. Now, you notice that this term over here is exactly what we had inside the summation for Batch gradient descent. In fact, for those of you that are calculus is possible to show that that term here, that’s this term here, is equal to the partial derivative with respect to my parameter theta j of the cost of the parameters theta on x(i), y(i). Where cost is of course this thing that was defined previously. And just the wrap of the algorithm, let me close my curly braces over there. So what Stochastic gradient descent is doing is it is actually scanning through the training examples. And first it’s gonna look at my first training example x(1), y(1). And then looking at only this first example, it’s gonna take like a basically a little gradient descent step with respect to the cost of just this first training example. So in other words, we’re going to look at the first example and modify the parameters a little bit to fit just the first training example a little bit better. Having done this inside this inner for-loop is then going to go on to the second training example. And what it’s going to do there is take another little step in parameter space, so modify the parameters just a little bit to try to fit just a second training example a little bit better. Having done that, is then going to go onto my third training example. And modify the parameters to try to fit just the third training example a little bit better, and so on until you know, you get through the entire training set. And then this ultra repeat loop may cause it to take multiple passes over the entire training set. This view of Stochastic gradient descent also motivates why we wanted to start by randomly shuffling the data set. This doesn’t show us that when we scan through the training site here, that we end up visiting the training examples in some sort of randomly sorted order. Depending on whether your data already came randomly sorted or whether it came originally sorted in some strange order, in practice this would just speed up the conversions to Stochastic gradient descent just a little bit. So in the interest of safety, it’s usually better to randomly shuffle the data set if you aren’t sure if it came to you in randomly sorted order. But more importantly another view of Stochastic gradient descent is that it’s a lot like descent but rather than wait to sum up these gradient terms over all m training examples, what we’re doing is we’re taking this gradient term using just one single training example and we’re starting to make progress in improving the parameters already. So rather than, you know, waiting ‘till taking a path through all 300,000 United States Census records, say, rather than needing to scan through all of the training examples before we can modify the parameters a little bit and make progress towards a global minimum. For Stochastic gradient descent instead we just need to look at a single training example and we’re already starting to make progress in this case of parameters towards, moving the parameters towards the global minimum.

So, here’s the algorithm written out again where the first step is to randomly shuffle the data and the second step is where the real work is done, where that’s the update theta with respect to a single training example x(i), y(i).

stochastic gradient descent

So, let’s see what this algorithm does to the parameters. Previously, we saw that when we are using Batch gradient descent, that is the algorithm that looks at all the training examples in time, Batch gradient descent will tend to, you know, take a reasonably straight line trajectory to get to the global minimum like that. In contrast with Stochastic gradient descent every iteration is going to be much faster because we don’t need to sum up over all the training examples. But every iteration is just trying to fit single training example better. So, if we were to start stochastic gradient descent, oh, let’s start stochastic gradient descent at a point like that. The first iteration, you know, may take the parameters in that direction and maybe the second iteration looking at just the second example maybe just by chance, we get more unlucky and actually head in a bad direction with the parameters like that. In the third iteration where we tried to modify the parameters to fit just the third training examples better, maybe we’ll end up heading in that direction. And then we’ll look at the fourth training example and we will do that. The fifth example, sixth example, 7th and so on. And as you run Stochastic gradient descent, what you find is that it will generally move the parameters in the direction of the global minimum, but not always. And so take some more random-looking, circuitous path to watch the global minimum. And in fact as you run Stochastic gradient descent it doesn’t actually converge in the same same sense as Batch gradient descent does and what it ends up doing is wandering around continuously in some region that’s in some region close to the global minimum, but it doesn’t just get to the global minimum and stay there. But in practice this isn’t a problem because, you know, so long as the parameters end up in some region there maybe it is pretty close to the global minimum. So, as parameters end up pretty close to the global minimum, that will be a pretty good hypothesis and so usually running Stochastic gradient descent we get a parameter near the global minimum and that’s good enough for, you know, essentially any, most practical purposes. Just one final detail. In Stochastic gradient descent, we had this outer loop repeat which says to do this inner loop multiple times. So, how many times do we repeat this outer loop? Depending on the size of the training set, doing this loop just a single time may be enough. And up to, you know, maybe 10 times may be typical so we may end up repeating this inner loop anywhere from once to ten times. So if we have a you know, truly massive data set like the this US census gave us that example that I’ve been talking about with 300 million examples, it is possible that by the time you’ve taken just a single pass through your training set. So, this is for i equals 1 through 300 million. It’s possible that by the time you’ve taken a single pass through your data set you might already have a perfectly good hypothesis. In which case, you know, this inner loop you might need to do only once if m is very, very large. But in general taking anywhere from 1 through 10 passes through your data set, you know, maybe fairly common. But really it depends on the size of your training set. And if you contrast this to Batch gradient descent. With Batch gradient descent, after taking a pass through your entire training set, you would have taken just one single gradient descent steps. So one of these little baby steps of gradient descent where you just take one small gradient descent step and this is why Stochastic gradient descent can be much faster.

So, that was the Stochastic gradient descent algorithm. And if you implement it, hopefully that will allow you to scale up many of your learning algorithms to much bigger data sets and get much more performance that way.

03_mini-batch-gradient-descent

In the previous video, we talked about Stochastic gradient descent, and how that can be much faster than Batch gradient descent. In this video, let’s talk about another variation on these ideas is called Mini-batch gradient descent they can work sometimes even a bit faster than stochastic gradient descent.

mini-batch gradient descent

To summarize the algorithms we talked about so far. In Batch gradient descent we will use all m examples in each generation. Whereas in Stochastic gradient descent we will use a single example in each generation. What Mini-batch gradient descent does is somewhere in between. Specifically, with this algorithm we’re going to use b examples in each iteration where b is a parameter called the “mini batch size” so the idea is that this is somewhat in-between Batch gradient descent and Stochastic gradient descent. This is just like batch gradient descent, except that I’m going to use a much smaller batch size. A typical choice for the value of b might be b equals 10, lets say, and a typical range really might be anywhere from b equals 2 up to b equals 100. So that will be a pretty typical range of values for the Mini-batch size. And the idea is that rather than using one example at a time or m examples at a time we will use b examples at a time. So let me just write this out informally, we’re going to get, let’s say, b. For this example, let’s say b equals 10. So we’re going to get, the next 10 examples from my training set so that may be some set of examples xi, yi. If it’s 10 examples then the indexing will be up to x (i+9), y (i+9) so that’s 10 examples altogether and then we’ll perform essentially a gradient descent update using these 10 examples. So, that’s any rate times one tenth times sum over k equals i through i+9 of h subscript theta of x(k) minus y(k) times x(k)j. And so in this expression, where summing the gradient terms over my ten examples. So, that’s number ten, that’s, you know, my mini batch size and just i+9 again, the 9 comes from the choice of the parameter b, and then after this we will then increase, you know, i by tenth, we will go on to the next ten examples and then keep moving like this. So just to write out the entire algorithm in full.

Mini-batch gradient descent algorithm

In order to simplify the indexing for this one at the right top, I’m going to assume we have a mini-batch size of ten and a training set size of a thousand, what we’re going to do is have this sort of form, for i equals 1 and that in 21’s the stepping, in steps of 10 because we look at 10 examples at a time. And then we perform this sort of gradient descent update using ten examples at a time so this 10 and this i+9 those are consequence of having chosen my mini-batch to be ten. And you know, this ultimate four-loop, this ends at 991 here because if I have 1000 training samples then I need 100 steps of size 10 in order to get through my training set. So this is mini-batch gradient descent. Compared to batch gradient descent, this also allows us to make progress much faster. So we have again our running example of, you know, U.S. Census data with 300 million training examples, then what we’re saying is after looking at just the first 10 examples we can start to make progress in improving the parameters theta so we don’t need to scan through the entire training set. We just need to look at the first 10 examples and this will start letting us make progress and then we can look at the second ten examples and modify the parameters a little bit again and so on. So, that is why Mini-batch gradient descent can be faster than batch gradient descent. Namely, you can start making progress in modifying the parameters after looking at just ten examples rather than needing to wait ‘till you’ve scan through every single training example of 300 million of them. So, how about Mini-batch gradient descent versus Stochastic gradient descent. So, why do we want to look at b examples at a time rather than look at just a single example at a time as the Stochastic gradient descent? The answer is in vectorization. In particular, Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if you have a good vectorized implementation. In that case, the sum over 10 examples can be performed in a more vectorized way which will allow you to partially parallelize your computation over the ten examples. So, in other words, by using appropriate vectorization to compute the rest of the terms, you can sometimes partially use the good numerical algebra libraries and parallelize your gradient computations over the b examples, whereas if you were looking at just a single example of time with Stochastic gradient descent then, you know, just looking at one example at a time their isn’t much to parallelize over. At least there is less to parallelize over. One disadvantage of Mini-batch gradient descent is that there is now this extra parameter b, the Mini-batch size which you may have to fiddle with, and which may therefore take time. But if you have a good vectorized implementation this can sometimes run even faster that Stochastic gradient descent.

So that was Mini-batch gradient descent which is an algorithm that in some sense does something that’s somewhat in between what Stochastic gradient descent does and what Batch gradient descent does. And if you choose their reasonable value of b. I usually use b equals 10, but, you know, other values, anywhere from say 2 to 100, would be reasonably common. So we choose value of b and if you use a good vectorized implementation, sometimes it can be faster than both Stochastic gradient descent and faster than Batch gradient descent.

04_stochastic-gradient-descent-convergence

You now know about the stochastic gradient descent algorithm. But when you’re running the algorithm, how do you make sure that it’s completely debugged and is converging okay? Equally important, how do you tune the learning rate alpha with Stochastic Gradient Descent. In this video we’ll talk about some techniques for doing these things, for making sure it’s converging and for picking the learning rate alpha.

Mini-batch gradient descent algorithm

Back when we were using batch gradient descent, our standard way for making sure that gradient descent was converging was we would plot the optimization cost function as a function of the number of iterations. So that was the cost function and we would make sure that this cost function is decreasing on every iteration. When the training set sizes were small, we could do that because we could compute the sum pretty efficiently. But when you have a massive training set size then you don’t want to have to pause your algorithm periodically. You don’t want to have to pause stochastic gradient descent periodically in order to compute this cost function since it requires a sum of your entire training set size. And the whole point of stochastic gradient was that you wanted to start to make progress after looking at just a single example without needing to occasionally scan through your entire training set right in the middle of the algorithm, just to compute things like the cost function of the entire training set. So for stochastic gradient descent, in order to check the algorithm is converging, here’s what we can do instead. Let’s take the definition of the cost that we had previously. So the cost of the parameters theta with respect to a single training example is just one half of the square error on that training example. Then, while stochastic gradient descent is learning, right before we train on a specific example. So, in stochastic gradient descent we’re going to look at the examples xi, yi, in order, and then sort of take a little update with respect to this example. And we go on to the next example, xi plus 1, yi plus 1, and so on, right? That’s what stochastic gradient descent does. So, while the algorithm is looking at the example xi, yi, but before it has updated the parameters theta using that an example, let’s compute the cost of that example. Just to say the same thing again, but using slightly different words. A stochastic gradient descent is scanning through our training set right before we have updated theta using a specific training example x(i) comma y(i) let’s compute how well our hypothesis is doing on that training example. And we want to do this before updating theta because if we’ve just updated theta using example, you know, that it might be doing better on that example than what would be representative. Finally, in order to check for the convergence of stochastic gradient descent, what we can do is every, say, every thousand iterations, we can plot these costs that we’ve been computing in the previous step. We can plot those costs average over, say, the last thousand examples processed by the algorithm. And if you do this, it kind of gives you a running estimate of how well the algorithm is doing. on, you know, the last 1000 training examples that your algorithm has seen. So, in contrast to computing Jtrain periodically which needed to scan through the entire training set. With this other procedure, well, as part of stochastic gradient descent, it doesn’t cost much to compute these costs as well right before updating to parameter theta. And all we’re doing is every thousand integrations or so, we just average the last 1,000 costs that we computed and plot that. And by looking at those plots, this will allow us to check if stochastic gradient descent is converging.

checking for convergence of SGD

So here are a few examples of what these plots might look like. Suppose you have plotted the cost average over the last thousand examples, because these are averaged over just a thousand examples, they are going to be a little bit noisy and so, it may not decrease on every single iteration. Then if you get a figure that looks like this, So the plot is noisy because it’s average over, you know, just a small subset, say a thousand training examples. If you get a figure that looks like this, you know that would be a pretty decent run with the algorithm, maybe, where it looks like the cost has gone down and then this plateau that looks kind of flattened out, you know, starting from around that point. look like, this is what your cost looks like then maybe your learning algorithm has converged. If you want to try using a smaller learning rate, something you might see is that the algorithm may initially learn more slowly so the cost goes down more slowly. But then eventually you have a smaller learning rate is actually possible for the algorithm to end up at a, maybe very slightly better solution. So the red line may represent the behavior of stochastic gradient descent using a slower, using a smaller leaning rate. And the reason this is the case is because, you remember, stochastic gradient descent doesn’t just converge to the global minimum, is that what it does is the parameters will oscillate a bit around the global minimum. And so by using a smaller learning rate, you’ll end up with smaller oscillations. And sometimes this little difference will be negligible and sometimes with a smaller than you can get a slightly better value for the parameters.

checking for convergence of SGD
Here are some other things that might happen. Let’s say you run stochastic gradient descent and you average over a thousand examples when plotting these costs. So, you know, here might be the result of another one of these plots. Then again, it kind of looks like it’s converged. If you were to take this number, a thousand, and increase to averaging over 5 thousand examples. Then it’s possible that you might get a smoother curve that looks more like this. And by averaging over, say 5,000 examples instead of 1,000, you might be able to get a smoother curve like this. And so that’s the effect of increasing the number of examples you average over. The disadvantage of making this too big of course is that now you get one date point only every 5,000 examples. And so the feedback you get on how well your learning learning algorithm is doing is, sort of, maybe it’s more delayed because you get one data point on your plot only every 5,000 examples rather than every 1,000 examples.

checking for convergence of SGD

Along a similar vein some times you may run a gradient descent and end up with a plot that looks like this. And with a plot that looks like this, you know, it looks like the cost just is not decreasing at all. It looks like the algorithm is just not learning. It’s just, looks like this here a flat curve and the cost is just not decreasing. But again if you were to increase this to averaging over a larger number of examples it is possible that you see something like this red line it looks like the cost actually is decreasing, it’s just that the blue line averaging over 2, 3 examples, the blue line was too noisy so you couldn’t see the actual trend in the cost actually decreasing and possibly averaging over 5,000 examples instead of 1,000 may help. Of course we averaged over a larger number examples that we’ve averaged here over 5,000 examples, I’m just using a different color, it is also possible that you that see a learning curve ends up looking like this. That it’s still flat even when you average over a larger number of examples. And as you get that, then that’s maybe just a more firm verification that unfortunately the algorithm just isn’t learning much for whatever reason. And you need to either change the learning rate or change the features or change something else about the algorithm.

checking for convergence of SGD

Finally, one last thing that you might see would be if you were to plot these curves and you see a curve that looks like this, where it actually looks like it’s increasing. And if that’s the case then this is a sign that the algorithm is diverging. And what you really should do is use a smaller value of the learning rate alpha.

checking for convergence of SGD

So hopefully this gives you a sense of the range of phenomena you might see when you plot these cost average over some range of examples as well as suggests the sorts of things you might try to do in response to seeing different plots. So if the plots looks too noisy, or if it wiggles up and down too much, then try increasing the number of examples you’re averaging over so you can see the overall trend in the plot better. And if you see that the errors are actually increasing, the costs are actually increasing, try using a smaller value of alpha.

decreasing the learning rate of SGD

Finally, it’s worth examining the issue of the learning rate just a little bit more. We saw that when we run stochastic gradient descent, the algorithm will start here and sort of meander towards the minimum And then it won’t really converge, and instead it’ll wander around the minimum forever. And so you end up with a parameter value that is hopefully close to the global minimum that won’t be exact at the global minimum. In most typical implementations of stochastic gradient descent, the learning rate alpha is typically held constant. And so what you we end up is exactly a picture like this. If you want stochastic gradient descent to actually converge to the global minimum, there’s one thing which you can do which is you can slowly decrease the learning rate alpha over time. So, a pretty typical way of doing that would be to set alpha equals some constant 1 divided by iteration number plus constant 2. So, iteration number is the number of iterations you’ve run of stochastic gradient descent, so it’s really the number of training examples you’ve seen And const 1 and const 2 are additional parameters of the algorithm that you might have to play with a bit in order to get good performance. One of the reasons people tend not to use this is because you end up needing to spend time playing with these 2 extra parameters, constant 1 and constant 2, and so this makes the algorithm more finicky. You know, it’s just more parameters able to fiddle with in order to make the algorithm work well. But if you manage to tune the parameters well, then the picture you can get is that the algorithm will actually meander around towards the minimum, but as it gets closer because you’re decreasing the learning rate the meanderings will get smaller and smaller until it pretty much just to the global minimum. I hope this makes sense, right? And the reason this formula makes sense is because as the algorithm runs, the iteration number becomes large So alpha will slowly become small, and so you take smaller and smaller steps until it hopefully converges to the global minimum. So If you do slowly decrease alpha to zero you can end up with a slightly better hypothesis. But because of the extra work needed to fiddle with the constants and because frankly usually we’re pretty happy with any parameter value that is, you know, pretty close to the global minimum. Typically this process of decreasing alpha slowly is usually not done and keeping the learning rate alpha constant is the more common application of stochastic gradient descent although you will see people use either version.

To summarize in this video we talk about a way for approximately monitoring how the stochastic gradient descent is doing in terms for optimizing the cost function. And this is a method that does not require scanning over the entire training set periodically to compute the cost function on the entire training set. But instead it looks at say only the last thousand examples or so. And you can use this method both to make sure the stochastic gradient descent is okay and is converging or to use it to tune the learning rate alpha.

summary

Learning with Large Datasets

We mainly benefit from a very large dataset when our algorithm has high variance when m is small. Recall that if our algorithm has high bias, more data will not have any benefit.
Datasets can often approach such sizes as m = 100,000,000. In this case, our gradient descent step will have to make a summation over all one hundred million examples. We will want to try to avoid this – the approaches for doing so are described below.

Stochastic Gradient Descent

Stochastic gradient descent is an alternative to classic (or batch) gradient descent and is more efficient and scalable to large data sets.
Stochastic gradient descent is written out in a different but similar way:
$$cost(\theta,(x^{(i)}, y^{(i)})) = \dfrac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^2$$
The only difference in the above cost function is the elimination of the m constant within $\dfrac{1}{2}$.
$$J_{train}(\theta) = \dfrac{1}{m} \displaystyle \sum_{i=1}^m cost(\theta, (x^{(i)}, y^{(i)}))$$
$J_{train}$ is now just the average of the cost applied to all of our training examples.
The algorithm is as follows

Randomly ‘shuffle’ the dataset
For $i = 1\dots m$, $\Theta_j := \Theta_j - \alpha (h_{\Theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}_j$

This algorithm will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all m training examples first. Stochastic gradient descent will be unlikely to converge at the global minimum and will instead wander around it randomly, but usually yields a result that is close enough. Stochastic gradient descent will usually take 1-10 passes through your data set to get near the global minimum.

Mini-Batch Gradient Descent

Mini-batch gradient descent can sometimes be even faster than stochastic gradient descent. Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b.
Typical values for b range from 2-100 or so.
For example, with b=10 and m=1000:
Repeat:
For $i = 1,11,21,31,\dots,991$
$$\theta_j := \theta_j - \alpha \dfrac{1}{10} \displaystyle \sum_{k=i}^{i+9} (h_\theta(x^{(k)}) - y^{(k)})x_j^{(k)}$$
We’re simply summing over ten examples at a time. The advantage of computing more than one example at a time is that we can use vectorized implementations over the b examples.

Stochastic Gradient Descent Convergence

How do we choose the learning rate α for stochastic gradient descent? Also, how do we debug stochastic gradient descent to make sure it is getting as close as possible to the global optimum?
One strategy is to plot the average cost of the hypothesis applied to every 1000 or so training examples. We can compute and save these costs during the gradient descent iterations.
With a smaller learning rate, it is possible that you may get a slightly better solution with stochastic gradient descent. That is because stochastic gradient descent will oscillate and jump around the global minimum, and it will make smaller random jumps with a smaller learning rate.
If you increase the number of examples you average over to plot the performance of your algorithm, the plot’s line will become smoother.
With a very small number of examples for the average, the line will be too noisy and it will be difficult to find the trend.
One strategy for trying to actually converge at the global minimum is to slowly decrease α over time . For example $\alpha = \dfrac{const1}{iterationNumber + const2}$
However, this is not often done because people don’t want to have to fiddle with even more parameters.

02_advanced-topics

01_online-learning

In this video, I’d like to talk about a new large-scale machine learning setting called the online learning setting. The online learning setting allows us to model problems where we have a continuous flood or a continuous stream of data coming in and we would like an algorithm to learn from that. Today, many of the largest websites, or many of the largest website companies use different versions of online learning algorithms to learn from the flood of users that keep on coming to, back to the website. Specifically, if you have a continuous stream of data generated by a continuous stream of users coming to your website, what you can do is sometimes use an online learning algorithm to learn user preferences from the stream of data and use that to optimize some of the decisions on your website.

online learning context

Suppose you run a shipping service, so, you know, users come and ask you to help ship their package from location A to location B and suppose you run a website, where users repeatedly come and they tell you where they want to send the package from, and where they want to send it to (so the origin and destination) and your website offers to ship the package for some asking price, so I’ll ship your package for $50, I’ll ship it for $20. And based on the price that you offer to the users, the users sometimes chose to use a shipping service; that’s a positive example and sometimes they go away and they do not choose to purchase your shipping service. So let’s say that we want a learning algorithm to help us to optimize what is the asking price that we want to offer to our users. And specifically, let’s say we come up with some sort of features that capture properties of the users. If we know anything about the demographics, they capture, you know, the origin and destination of the package, where they want to ship the package. And what is the price that we offer to them for shipping the package. and what we want to do is learn what is the probability that they will elect to ship the package, using our shipping service given these features, and again just as a reminder these features X also captures the price that we’re asking for. And so if we could estimate the chance that they’ll agree to use our service for any given price, then we can try to pick a price so that they have a pretty high probability of choosing our website while simultaneously hopefully offering us a fair return, offering us a fair profit for shipping their package. So if we can learn this property of y equals 1 given any price and given the other features we could really use this to choose appropriate prices as new users come to us. So in order to model the probability of y equals 1, what we can do is use logistic regression or neural network or some other algorithm like that. But let’s start with logistic regression.

online learning algorithm

Now if you have a website that just runs continuously, here’s what an online learning algorithm would do. I’m gonna write repeat forever. This just means that our website is going to, you know, keep on staying up. What happens on the website is occasionally a user will come and for the user that comes we’ll get some x,y pair corresponding to a customer or to a user on the website. So the features x are, you know, the origin and destination specified by this user and the price that we happened to offer to them this time around, and y is either one or zero depending one whether or not they chose to use our shipping service. Now once we get this {x,y} pair, what an online learning algorithm does is then update the parameters theta using just this example x,y, and in particular we would update my parameters theta as Theta j get updated as Theta j minus the learning rate alpha times my usual gradient descent rule for logistic regression. So we do this for j equals zero up to n, and that’s my close curly brace. So, for other learning algorithms instead of writing X-Y, right, I was writing things like Xi, Yi but in this online learning setting where actually discarding the notion of there being a fixed training set instead we have an algorithm. Now what happens as we get an example and then we learn using that example like so and then we throw that example away. We discard that example and we never use it again and so that’s why we just look at one example at a time. We learn from that example. We discard it. Which is why, you know, we’re also doing away with this notion of there being this sort of fixed training set indexed by i. And, if you really run a major website where you really have a continuous stream of users coming, then this sort of online learning algorithm is actually a pretty reasonable algorithm. Because of data is essentially free if you have so much data, that data is essentially unlimited then there is really may be no need to look at a training example more than once. Of course if we had only a small number of users then rather than using an online learning algorithm like this, you might be better off saving away all your data in a fixed training set and then running some algorithm over that training set. But if you really have a continuous stream of data, then an online learning algorithm can be very effective. I should mention also that one interesting effect of this sort of online learning algorithm is that it can adapt to changing user preferences. And in particular, if over time because of changes in the economy maybe users start to become more price sensitive and willing to pay, you know, less willing to pay high prices. Or if they become less price sensitive and they’re willing to pay higher prices. Or if different things become more important to users, if you start to have new types of users coming to your website. This sort of online learning algorithm can also adapt to changing user preferences and kind of keep track of what your changing population of users may be willing to pay for. And it does that because if your pool of users changes, then these updates to your parameters theta will just slowly adapt your parameters to whatever your latest pool of users looks like.

online learning algorithm

Here’s another example of a sort of application to which you might apply online learning. this is an application in product search in which we want to apply learning algorithm to learn to give good search listings to a user. Let’s say you run an online store that sells phones - that sells mobile phones or sells cell phones. And you have a user interface where a user can come to your website and type in the query like “Android phone 1080p camera”. So 1080p is a type of a specification for a video camera that you might have on a phone, a cell phone, a mobile phone. Suppose, suppose we have a hundred phones in our store. And because of the way our website is laid out, when a user types in a query, if it was a search query, we would like to find a choice of ten different phones to show what to offer to the user. What we’d like to do is have a learning algorithm help us figure out what are the ten phones out of the 100 we should return the user in response to a user-search query like the one here. Here’s how we can go about the problem. For each phone and given a specific user query; we can construct a feature vector X. So the feature vector X might capture different properties of the phone. It might capture things like, how similar the user search query is in the phones. We capture things like how many words in the user search query match the name of the phone, how many words in the user search query match the description of the phone and so on. So the features x capture properties of the phone and it captures things about how similar or how well the phone matches the user query along different dimensions. What we like to do is estimate the probability that a user will click on the link for a specific phone, because we want to show the user phones that they are likely to want to buy, want to show the user phones that they have high probability of clicking on in the web browser. So I’m going to define y equals one if the user clicks on the link for a phone and y equals zero otherwise and what I would like to do is learn the probability the user will click on a specific phone given, you know, the features x, which capture properties of the phone and how well the query matches the phone. To give this problem a name in the language of people that run websites like this, the problem of learning this is actually called the problem of learning the predicted click-through rate, the predicted CTR. It just means learning the probability that the user will click on the specific link that you offer them, so CTR is an abbreviation for click through rate. And if you can estimate the predicted click-through rate for any particular phone, what we can do is use this to show the user the ten phones that are most likely to click on, because out of the hundred phones, we can compute this for each of the 100 phones and just select the 10 phones that the user is most likely to click on, and this will be a pretty reasonable way to decide what ten results to show to the user. Just to be clear, suppose that every time a user does a search, we return ten results what that will do is it will actually give us ten x,y pairs, this actually gives us ten training examples every time a user comes to our website because, because for the ten phone that we chose to show the user, for each of those 10 phones we get a feature vector X, and for each of those 10 phones we show the user we will also get a value for y, we will also observe the value of y, depending on whether or not we clicked on that url or not and so, one way to run a website like this would be to continuously show the user, you know, your ten best guesses for what other phones they might like and so, each time a user comes you would get ten examples, ten x,y pairs, and then use an online learning algorithm to update the parameters using essentially 10 steps of gradient descent on these 10 examples, and then you can throw the data away, and if you really have a continuous stream of users coming to your website, this would be a pretty reasonable way to learn parameters for your algorithm so as to show the ten phones to your users that may be most promising and the most likely to click on. So, this is a product search problem or learning to rank phones, learning to search for phones example.

So, I’ll quickly mention a few others. One is, if you have a website and you’re trying to decide, you know, what special offer to show the user, this is very similar to phones, or if you have a website and you show different users different news articles. So, if you’re a news aggregator website, then you can again use a similar system to select, to show to the user, you know, what are the news articles that they are most likely to be interested in and what are the news articles that they are most likely to click on. Closely related to special offers, will we profit from recommendations. And in fact, if you have a collaborative filtering system, you can even imagine a collaborative filtering system giving you additional features to feed into a logistic regression classifier to try to predict the click through rate for different products that you might recommend to a user.

Of course, I should say that any of these problems could also have been formulated as a standard machine learning problem, where you have a fixed training set. Maybe, you can run your website for a few days and then save away a training set, a fixed training set, and run a learning algorithm on that. But these are the actual sorts of problems, where you do see large companies get so much data, that there’s really maybe no need to save away a fixed training set, but instead you can use an online learning algorithm to just learn continuously. from the data that users are generating on your website. So, that was the online learning setting and as we saw, the algorithm that we apply to it is really very similar to this schotastic gradient descent algorithm, only instead of scanning through a fixed training set, we’re instead getting one example from a user, learning from that example, then discarding it and moving on**And if you have a continuous stream of data for some application, this sort of algorithm may be well worth considering for your application.

And of course,one advantage of online learning is also that if you have a changing pool of users, or if the things you’re trying to predict are slowly changing like your user taste is slowly changing, the online learning algorithm can slowly adapt your learned hypothesis to whatever the latest sets of user behaviors are like as well.

02_map-reduce-and-data-parallelism

In the last few videos, we talked about stochastic gradient descent, and, you know, other variations of the stochastic gradient descent algorithm, including those adaptations to online learning, but all of those algorithms could be run on one machine, or could be run on one computer. And some machine learning problems are just too big to run on one machine, sometimes maybe you just so much data you just don’t ever want to run all that data through a single computer, no matter what algorithm you would use on that computer.

So in this video I’d like to talk about different approach to large scale machine learning, called the map reduce approach. And even though we have quite a few videos on stochastic gradient descent and we’re going to spend relative less time on map reduce–don’t judge the relative importance of map reduce versus the gradient descent based on the amount amount of time I spend on these ideas in particular. Many people will say that map reduce is at least an equally important, and some would say an even more important idea compared to gradient descent, only it’s relatively simpler to explain, which is why I’m going to spend less time on it, but using these ideas you might be able to scale learning algorithms to even far larger problems than is possible using stochastic gradient descent. Here’s the idea.

online learning algorithm

Let’s say we want to fit a linear regression model or a logistic regression model or some such, and let’s start again with batch gradient descent, so that’s our batch gradient descent learning rule. And to keep the writing on this slide tractable, I’m going to assume throughout that we have m equals 400 examples. Of course, by our standards, in terms of large scale machine learning, you know m might be pretty small and so, this might be more commonly applied to problems, where you have maybe closer to 400 million examples, or some such, but just to make the writing on the slide simpler, I’m going to pretend we have 400 examples. So in that case, the batch gradient descent learning rule has this 400 and the sum from i equals 1 through 400 through my 400 examples here, and if m is large, then this is a computationally expensive step. So, what the MapReduce idea does is the following, and I should say the map reduce idea is due to two researchers, Jeff Dean and Sanjay Gimawat. Jeff Dean, by the way, is one of the most legendary engineers in all of Silicon Valley and he kind of built a large fraction of the architectural infrastructure that all of Google runs on today. But here’s the map reduce idea. So, let’s say I have some training set, if we want to denote by this box here of X Y pairs, where it’s X1, Y1, down to my 400 examples, Xm, Ym. So, that’s my training set with 400 training examples. In the MapReduce idea, one way to do, is split this training set in to different subsets. I’m going to. assume for this example that I have 4 computers, or 4 machines to run in parallel on my training set, which is why I’m splitting this into 4 machines. If you have 10 machines or 100 machines, then you would split your training set into 10 pieces or 100 pieces or what have you. And what the first of my 4 machines is to do, say, is use just the first one quarter of my training set–so use just the first 100 training examples. And in particular, what it’s going to do is look at this summation, and compute that summation for just the first 100 training examples. So let me write that up I’m going to compute a variable temp 1 to superscript 1 the first machine J equals sum from equals 1 through 100, and then I’m going to plug in exactly that term there–so I have X-theta, Xi, minus Yi times Xij, right? So that’s just that gradient descent term up there. And then similarly, I’m going to take the second quarter of my data and send it to my second machine, and my second machine will use training examples 101 through 200 and you will compute similar variables of a temp to j which is the same sum for index from examples 101 through 200. And similarly machines 3 and 4 will use the third quarter and the fourth quarter of my training set. So now each machine has to sum over 100 instead of over 400 examples and so has to do only a quarter of the work and thus presumably it could do it about four times as fast. Finally, after all these machines have done this work, I am going to take these temp variables and put them back together. So I take these variables and send them all to a You know centralized master server and what the master will do is combine these results together. and in particular, it will update my parameters theta j according to theta j gets updated as theta j minus Of the learning rate alpha times one over 400 times temp, 1, J, plus temp 2j plus temp 3j plus temp 4j and of course we have to do this separately for J equals 0. You know, up to and within this number of features. So operating this equation into I hope it’s clear. So what this equation is doing is exactly the same is that when you have a centralized master server that takes the results, the ten one j the ten two j ten three j and ten four j and adds them up and so of course the sum of these four things. Right, that’s just the sum of this, plus the sum of this, plus the sum of this, plus the sum of that, and those four things just add up to be equal to this sum that we’re originally computing a batch stream descent. And then we have the alpha times 1 of 400, alpha times 1 of 100, and this is exactly equivalent to the batch gradient descent algorithm, only, instead of needing to sum over all four hundred training examples on just one machine, we can instead divide up the work load on four machines.

online learning algorithm

So, here’s what the general picture of the MapReduce technique looks like. We have some training sets, and if we want to paralyze across four machines, we are going to take the training set and split it, you know, equally. Split it as evenly as we can into four subsets. Then we are going to take the 4 subsets of the training data and send them to 4 different computers. And each of the 4 computers can compute a summation over just one quarter of the training set, and then finally take each of the computers takes the results, sends them to a centralized server, which then combines the results together. So, on the previous line in that example, the bulk of the work in gradient descent, was computing the sum from i equals 1 to 400 of something. So more generally, sum from i equals 1 to m of that formula for gradient descent. And now, because each of the four computers can do just a quarter of the work, potentially you can get up to a 4x speed up. In particular, if there were no network latencies and no costs of the network communications to send the data back and forth, you can potentially get up to a 4x speed up. Of course, in practice, because of network latencies, the overhead of combining the results afterwards and other factors, in practice you get slightly less than a 4x speedup. But, none the less, this sort of macro juice approach does offer us a way to process much larger data sets than is possible using a single computer.

If you are thinking of applying Map Reduce to some learning algorithm, in order to speed this up. By paralleling the computation over different computers, the key question to ask yourself is, can your learning algorithm be expressed as a summation over the training set? And it turns out that many learning algorithms can actually be expressed as computing sums of functions over the training set and the computational expense of running them on large data sets is because they need to sum over a very large training set. So, whenever your learning algorithm can be expressed as a sum of the training set and whenever the bulk of the work of the learning algorithm can be expressed as the sum of the training set, then map reviews might a good candidate for scaling your learning algorithms through very, very good data sets.

online learning algorithm

Lets just look at one more example. Let’s say that we want to use one of the advanced optimization algorithm. So, things like, you know, L-BFGS constant gradient and so on, and let’s say we want to train a logistic regression of the algorithm. For that, we need to compute two main quantities. One is for the advanced optimization algorithms like, you know, LPF and constant gradient. We need to provide it a routine to compute the cost function of the optimization objective. And so for logistic regression, you remember that a cost function has this sort of sum over the training set, and so if youre paralizing over ten machines, you would split up the training set onto ten machines and have each of the ten machines compute the sum of this quantity over just one tenth of the training data. Then, the other thing that the advanced optimization algorithms need, is a routine to compute these partial derivative terms. Once again, these derivative terms, for which it’s a logistic regression, can be expressed as a sum over the training set, and so once again, similar to our earlier example, you would have each machine compute that summation over just some small fraction of your training data. And finally, having computed all of these things, they could then send their results to a centralized server, which can then add up the partial sums. This corresponds to adding up those tenth i or tenth ij variables, which were computed locally on machine number i, and so the centralized server can sum these things up and get the overall cost function and get the overall partial derivative, which you can then pass through the advanced optimization algorithm. So, more broadly, by taking other learning algorithms and expressing them in sort of summation form or by expressing them in terms of computing sums of functions over the training set, you can use the MapReduce technique to parallelize other learning algorithms as well, and scale them to very large training sets.

online learning algorithm

Finally, as one last comment, so far we have been discussing MapReduce algorithms as allowing you to parallelize over multiple computers, maybe multiple computers in a computer cluster or over multiple computers in the data center. It turns out that sometimes even if you have just a single computer, MapReduce can also be applicable.In particular, on many single computers now, you can have multiple processing cores. You can have multiple CPUs, and within each CPU you can have multiple proc cores. If you have a large training set, what you can do if, say, you have a computer with 4 computing cores, what you can do is, even on a single computer you can split the training sets into pieces and send the training set to different cores within a single box, like within a single desktop computer or a single server and use MapReduce this way to divvy up work load. Each of the cores can then carry out the sum over, say, one quarter of your training set, and then they can take the partial sums and combine them, in order to get the summation over the entire training set. The advantage of thinking about MapReduce this way, as paralyzing over cause within a single machine, rather than parallelizing over multiple machines is that, this way you don’t have to worry about network latency, because all the communication, all the sending of the [xx] back and forth, all that happens within a single machine. And so network latency becomes much less of an issue compared to if you were using this to over different computers within the data sensor.

Finally, one last caveat on parallelizing within a multi-core machine. Depending on the details of your implementation, if you have a multi-core machine and if you have certain numerical linear algebra libraries. It turns out that the sum numerical linear algebra libraries that can automatically parallelize their linear algebra operations across multiple cores within the machine. So if you’re fortunate enough to be using one of those numerical linear algebra libraries and certainly this does not apply to every single library. If you’re using one of those libraries and. If you have a very good vectorizing implementation of the learning algorithm. Sometimes you can just implement you standard learning algorithm in a vectorized fashion and not worry about parallelization and numerical linear algebra libararies could take care of some of it for you. So you don’t need to implement [xx] but. for other any problems, taking advantage of this sort of map reducing commentation, finding and using this MapReduce formulation and to paralelize a cross coarse except yourself might be a good idea as well and could let you speed up your learning algorithm.

In this video, we talked about the MapReduce approach to parallelizing machine learning by taking a data and spreading them across many computers in the data center. Although these ideas are critical to paralysing across multiple cores within a single computer as well. Today there are some good open source implementations of MapReduce, so there are many users in open source system called Hadoop and using either your own implementation or using someone else’s open source implementation, you can use these ideas to parallelize learning algorithms and get them to run on much larger data sets than is possible using just a single machine.

summary

Online Learnin

With a continuous stream of users to a website, we can run an endless loop that gets (x,y), where we collect some user actions for the features in x to predict some behavior y.
You can update θ for each individual (x,y) pair as you collect them. This way, you can adapt to new pools of users, since you are continuously updating theta.

Map Reduce and Data Parallelism

We can divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines so that we can train our algorithm in parallel.
You can split your training set into z subsets corresponding to the number of machines you have. On each of those machines calculate $\displaystyle \sum_{i=p}^{q}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$, where we’ve split the data starting at p and ending at q.
MapReduce will take all these dispatched (or ‘mapped’) jobs and ‘reduce’ them by calculating:
$$\Theta_j := \Theta_j - \alpha \dfrac{1}{z}(temp_j^{(1)} + temp_j^{(2)} + \cdots + temp_j^{(z)})$$
For all $j = 0, \dots, n$.
This is simply taking the computed cost from all the machines, calculating their average, multiplying by the learning rate, and updating theta.
Your learning algorithm is MapReduceable if it can be expressed as computing sums of functions over the training set . Linear regression and logistic regression are easily parallelizable.
For neural networks, you can compute forward propagation and back propagation on subsets of your data on many machines. Those machines can report their derivatives back to a ‘master’ server that will combine them.