01_predicting-movie-ratings
01_predicting-movie-ratings
In this next set of videos, I would like to tell you about recommender systems. There are two reasons, I had two motivations for why I wanted to talk about recommender systems. The first is just that it is an important application of machine learning. Over the last few years, occasionally I visit different, you know, technology companies here in Silicon Valley and I often talk to people working on machine learning applications there and so I’ve asked people what are the most important applications of machine learning or what are the machine learning applications that you would most like to get an improvement in the performance of. And one of the most frequent answers I heard was that there are many groups out in Silicon Valley now, trying to build better recommender systems. So, if you think about what the websites are like Amazon, or what Netflix or what eBay, or what iTunes Genius, made by Apple does, there are many websites or systems that try to recommend new products to use. So, Amazon recommends new books to you, Netflix try to recommend new movies to you, and so on. And these sorts of recommender systems, that look at what books you may have purchased in the past, or what movies you have rated in the past, but these are the systems that are responsible for today, a substantial fraction of Amazon’s revenue and for a company like Netflix, the recommendations that they make to the users is also responsible for a substantial fraction of the movies watched by their users. And so an improvement in performance of a recommender system can have a substantial and immediate impact on the bottom line of many of these companies. Recommender systems is kind of a funny problem, within academic machine learning so that we could go to an academic machine learning conference, the problem of recommender systems, actually receives relatively little attention, or at least it’s sort of a smaller fraction of what goes on within Academia. But if you look at what’s happening, many technology companies, the ability to build these systems seems to be a high priority for many companies. And that’s one of the reasons why I want to talk about them in this class.
The second reason that I want to talk about recommender systems is that as we approach the last few sets of videos of this class I wanted to talk about a few of the big ideas in machine learning and share with you, you know, some of the big ideas in machine learning. And we’ve already seen in this class that features are important for machine learning, the features you choose will have a big effect on the performance of your learning algorithm.
So there’s this big idea in machine learning, which is that for some problems, maybe not all problems, but some problems, there are algorithms that can try to automatically learn a good set of features for you. So rather than trying to hand design, or hand code the features, which is mostly what we’ve been doing so far, there are a few settings where you might be able to have an algorithm, just to learn what feature to use, and the recommender systems is just one example of that sort of setting.
There are many others, but engraved through recommender systems, will be able to go a little bit into this idea of learning the features and you’ll be able to see at least one example of this, I think, big idea in machine learning as well.
So, without further ado, let’s get started, and talk about the recommender system problem formulation. As my running example, I’m going to use the modern problem of predicting movie ratings. So, here’s a problem. Imagine that you’re a website or a company that sells or rents out movies, or what have you. And so, you know, Amazon, and Netflix, and I think iTunes are all examples of companies that do this, and let’s say you let your users rate different movies, using a 1 to 5 star rating.
So, users may, you know, something one, two, three, four or five stars. In order to make this example just a little bit nicer, I’m going to allow 0 to 5 stars as well, because that just makes some of the math come out just nicer. Although most of these websites use the 1 to 5 star scale. So here, I have 5 movies. You know, Love That Lasts, Romance Forever, Cute Puppies of Love, Nonstop Car Chases, and Swords vs. Karate. And we have 4 users, which, calling, you know, Alice, Bob, Carol, and Dave, with initials A, B, C, and D, we’ll call them users 1, 2, 3, and 4. So, let’s say Alice really likes Love That Lasts and rates that 5 stars, likes Romance Forever, rates it 5 stars. She did not watch Cute Puppies of Love, and did rate it, so we don’t have a rating for that, and Alice really did not like Nonstop Car Chases or Swords vs. Karate. And a different user Bob, user two, maybe rated a different set of movies, maybe she likes to Love at Last, did not to watch Romance Forever, just have a rating of 4, a 0, a 0, and maybe our 3rd user, rates this 0, did not watch that one, 0, 5, 5, and, you know, let’s just fill in some of the numbers. And so just to introduce a bit of notation, this notation that we’ll be using throughout, I’m going to use NU to denote the number of users. So in this example, NU will be equal to 4. So the u-subscript stands for users and Nm, going to use to denote the number of movies, so here I have five movies so Nm equals equals 5. And you know for this example, I have for this example, I have loosely 3 maybe romantic or romantic comedy movies and 2 action movies and you know, if you look at this small example, it looks like Alice and Bob are giving high ratings to these romantic comedies or movies about love, and giving very low ratings about the action movies, and for Carol and Dave, it’s the opposite, right? Carol and Dave, users three and four, really like the action movies and give them high ratings, but don’t like the romance and love- type movies as much. Specifically, in the recommender system problem, we are given the following data. Our data comprises the following: we have these values r(i, j), and r(i, j) is 1 if user J has rated movie I. So our users rate only some of the movies, and so, you know, we don’t have ratings for those movies. And whenever r(i, j) is equal to 1, whenever user j has rated movie i, we also get this number y(i, j), which is the rating given by user j to movie i. And so, y(i, j) would be a number from zero to five, depending on the star rating, zero to five stars that user gave that particular movie. So, the recommender system problem is given this data that has give these r(i, j)’s and the y(i, j)’s to look through the data and look at all the movie ratings that are missing and to try to predict what these values of the question marks should be. In the particular example, I have a very small number of movies and a very small number of users and so most users have rated most movies but in the realistic settings your users each of your users may have rated only a minuscule fraction of your movies but looking at this data, you know, if Alice and Bob both like the romantic movies maybe we think that Alice would have given this a five. Maybe we think Bob would have given this a 4.5 or some high value, as we think maybe Carol and Dave were doing these very low ratings. And Dave, well, if Dave really likes action movies, maybe he would have given Swords and Karate a 4 rating or maybe a 5 rating, okay? And so, our job in developing a recommender system is to come up with a learning algorithm that can automatically go fill in these missing values for us so that we can look at, say, the movies that the user has not yet watched, and recommend new movies to that user to watch. You try to predict what else might be interesting to a user. So that’s the formalism of the recommender system problem.
In the next video we’ll start to develop a learning algorithm to address this problem.
summary
Problem Formulation
Recommendation is currently a very popular application of machine learning.
Say we are trying to recommend movies to customers. We can use the following definitions
- $n_u =$ number of users
- $n_m =$ number of movies
- $r(i,j) = 1$ if user j has rated movie i
- $y(i,j) =$ rating given by user j to movie i (defined only if r(i,j)=1)
02_content-based-recommendations
In the last video, we talked about the recommender systems problem where for example you might have a set of movies and you may have a set of users, each who have rated some subset of the movies. They’ve rated the movies one to five stars or zero to five stars. And what we would like to do is look at these users and predict how they would have rated other movies that they have not yet rated. In this video I’d like to talk about our first approach to building a recommender system. This approach is called content based recommendations.
Here’s our data set from before and just to remind you of a bit of notation, I was using nu to denote the number of users and so that’s equal to 4, and nm to denote the number of movies, I have 5 movies. So, how do I predict what these missing values would be? Let’s suppose that for each of these movies I have a set of features for them. In particular, let’s say that for each of the movies have two features which I’m going to denote x1 and x2. Where x1 measures the degree to which a movie is a romantic movie and x2 measures the degree to which a movie is an action movie. So, if you take a movie, Love at last, you know it’s 0.9 rating on the romance scale. This is a highly romantic movie, but zero on the action scale. So, almost no action in that movie. Romance forever is a 1.0, lot of romance and 0.01 action. I don’t know, maybe there’s a minor car crash in that movie or something. So there’s a little bit of action. Skipping one, let’s do Swords vs karate, maybe that has a 0 romance rating and no romance at all in that but plenty of action. And Nonstop car chases, maybe again there’s a tiny bit of romance in that movie but mainly action. And Cute puppies of love mainly a romance movie with no action at all. So if we have features like these, then each movie can be represented with a feature vector. Let’s take movie one. So let’s call these movies 1, 2, 3, 4, and 5. But my first movie, Love at last, I have my two features, 0.9 and 0. And so these are features x1 and x2. And let’s add an extra feature as usual, which is my interceptor feature x0 = 1. And so putting these together I would then have a feature x1. The superscript 1 denotes it’s the feature vector for my first movie, and this feature vector is equal to 1. The first 1 there is this interceptor. And then my two feature is 0.90 like so. So for Love at last I would have a feature vector x1, for the movie Romance forever I may have a software feature of vector x2, and so on, and for Swords vs karate I would have a different feature vector x superscript 5. Also, consistence with our earlier node notation that we were using, we’re going to set n to be the number of features not counting this x0 interceptor. So n is equal to 2 because it’s we have two features x1 and x2 capturing the degree of romance and the degree of action in each movie. Now in order to make predictions here’s one thing that we do which is that we could treat predicting the ratings of each user as a separate linear regression problem. So specifically, let’s say that for each user j, we’re going to learn the parameter vector theta j, which would be an R3 in this case.
More generally, theta (j) would be an R (n+1), where n is the number of features not counting the set term. And we’re going to predict user j as rating movie i with just the inner product between parameters vectors theta and the features xi. So let’s take a specific example. Let’s take user 1, so that would be Alice. And associated with Alice would be some parameter vector theta 1. And our second user, Bob, will be associated a different parameter vector theta 2. Carol will be associated with a different parameter vector theta 3 and Dave a different parameter vector theta 4. So let’s say you want to make a prediction for what Alice will think of the movie Cute puppies of love. Well that movie is going to have some parameter vector x3 where we have that x3 is going to be equal to 1, which is my intercept term and then 0.99 and then 0. And let’s say, for this example, let’s say that we’ve somehow already gotten a parameter vector theta 1 for Alice. We’ll say it later exactly how we come up with this parameter vector. But let’s just say for now that some unspecified learning algorithm has learned the parameter vector theta 1 and is equal to this 0,5,0. So our prediction for this entry is going to be equal to theta 1, that is Alice’s parameter vector, transpose x3, that is the feature vector for the Cute puppies of love movie, number 3. And so the inner product between these two vectors is gonna be 5 times 0.99, which is equal to 4.95. And so my prediction for this value over here is going to be 4.95. And maybe that seems like a reasonable value if indeed this is my parameter vector theta 1. So, all we’re doing here is we’re applying a different copy of this linear regression for each user, and we’re saying that what Alice does is Alice has some parameter vector theta 1 that she uses, that we use to predict her ratings as a function of how romantic and how action packed a movie is. And Bob and Carol and Dave, each of them have a different linear function of the romanticness and actionness, or degree of romance and degree of action in a movie and that that’s how we’re gonna predict that their star ratings. More formally, here’s how we can write down the problem. Our notation is that r(i,j) is equal to 1 if user j has rated movie i and y(i,j) is the rating of that movie, if that rating exists. That is, if that user has actually rated that movie. And, on the previous slide we also defined these, theta j, which is a parameter for the user xi, which is a feature vector for a specific movie. And for each user and each movie, we predict that rating as follows. So let me introduce just temporarily introduce one extra bit of notation mj. We’re gonna use mj to denote the number of users rated by movie j. We don’t need this notation only for this line. Now in order to learn the parameter vector for theta j, well how do we do so. This is basically a linear regression problem. So what we can do is just choose a parameter vector theta j so that the predicted values here are as close as possible to the values that we observed in our training sets and the values we observed in our data. So let’s write that down. In order to learn the parameter vector theta j, let’s minimize over the parameter vector theta j of sum, and I want to sum over all movies that user j has rated. So we write it as sum over all values of i. That’s a :r(i,j) equals 1. So the way to read this summation syntax is this is summation over all the values of i, so the r(i.j) is equal to 1. So you’ll be summing over all the movies that user j has rated. And then I’m going to compute theta j, transpose x i. So that’s the prediction of using j’s rating on movie i,- y (i,j). So that’s the actual observed rating squared. And then, let me just divide by the number of movies that user j has actually rated. So let’s just divide by 1 over 2m j. And so this is just like the least squares regressions. It’s just like linear regression, where we want to choose the parameter vector theta j to minimize this type of squared error term. And if you want, you can also add in irregularization terms so plus lambda over 2m and this is really 2mj because we have mj examples. User j has rated that many movies, it’s not like we have that many data points with which to fit the parameters of theta j. And then let me add in my usual regularization term here of theta j k squared. As usual, this sum is from k equals 1 through n, so here, theta j is going to be an n plus 1 dimensional vector, where in our early example n was equal to 2. But more broadly, more generally n is the number of features we have per movie. And so as usual we don’t regularize over theta 0. We don’t regularize over the bias terms. The sum is from k equals 1 through n. So if you minimize this as a function of theta j you get a good solution, you get a pretty good estimate of a parameter vector theta j with which to make predictions for user j’s movie ratings. For recommender systems, I’m gonna change this notation a little bit. So to simplify the subsequent math, I with to get rid of this term mj. So that’s just a constant, right? So I can delete it without changing the value of theta j that I get out of this optimization. So if you imagine taking this whole equation, taking this whole expression and multiplying it by mj, get rid of that constant.
And when I minimize this, I should still get the same value of theta j as before. So just to repeat what we wrote on the previous slide, here’s our optimization objective. In order to learn theta j which is the parameter for user j, we’re going to minimize over theta j of this optimization objectives. So this is our usual squared error term and then this is our regularizations term. Now of course in building a recommender system, we don’t just want to learn parameters for a single user. We want to learn parameters for all of our users. I have n subscript u users, so I want to learn all of these parameters. And so, what I’m going to do is take this optimization objective and just add the mixture summation there. So this expression here with the one half on top of this is exactly the same as what we had on top. Except that now instead of just doing this for a specific user theta j, I’m going to sum my objective over all of my users and then minimize this overall optimization objective, minimize this overall cost on. And when I minimize this as a function of theta 1, theta 2, up to theta nu, I will get a separate parameter vector for each user. And I can then use that to make predictions for all of my users, for all of my n subscript users.
So putting everything together, this was our optimization objective on top. And to give this thing a name, I’ll just call this J(theta1, …, theta nu). So j as usual is my optimization objective, which I’m trying to minimize. Next, in order to actually do the minimization, if you were to derive the gradient descent update, these are the equations that you would get. So you take theta j, k, and subtract from an alpha, which is the learning rate, times these terms over here on the right. So there’s slightly different cases when k equals 0 and when k does not equal 0. Because our regularization term here regularizes only the values of theta jk for k not equal to 0, so we don’t regularize theta 0, so with slightly different updates when k equals 0 and k is not equal to 0. And this term over here, for example, is just the partial derivative with respect to your parameter, that of your optimization objective. Right and so this is just gradient descent and I’ve already computed the derivatives and plugged them into here. And if this gradient descent update look a lot like what we have here for linear regression. That’s because these are essentially the same as linear regression. The only minor difference is that for linear regression we have these 1 over m terms, this really would’ve been 1 over mj. But because earlier when we are deriving the optimization objective, we got rid of this, that’s why we don’t have this 1 over m term. But otherwise, it’s really some of my training examples of the ever times xk plus that regularization term, plus that term of regularization contributes to the derivative. And so if you’re using gradient descent here’s how you can minimize the cost function j to learn all the parameters. And using these formulas for the derivative if you want, you can also plug them into a more advanced optimization algorithm, like conjugate gradient or LBFGS or what have you. And use that to try to minimize the cost function j as well.
So hopefully you now know how you can apply essentially a deviation on linear regression in order to predict different movie ratings by different users. This particular algorithm is called a content based recommendations, or a content based approach, because we assume that we have available to us features for the different movies. And so where features that capture what is the content of these movies, of how romantic is this movie, how much action is in this movie. And we’re really using features of a content of the movies to make our predictions. But for many movies, we don’t actually have such features. Or maybe very difficult to get such features for all of our movies, for all of whatever items we’re trying to sell. And so, in the next video, we’ll start to talk about an approach to recommender systems that isn’t content based and does not assume that we have someone else giving us all of these features for all of the movies in our data set.
summary
Content Based Recommendations
We can introduce two features, $x_1$ and $x_2$ which represents how much romance or how much action a movie may have (on a scale of 0−1).
One approach is that we could do linear regression for every single user. For each user j, learn a parameter $\theta^{(j)} \in \mathbb{R}^3$. Predict user j as rating movie i with $(\theta^{(j)})^Tx^{(i)}$ stars.
$\theta^{(j)} =$ parameter vector for user j
$x^{(i)} =$ feature vector for movie i
For user j, movie i, predicted rating: $(\theta^{(j)})^T(x^{(i)})$
$m^{(j)} =$ number of movies rated by user j
To learn $\theta^{(j)}$, we do the following
$$min_{\theta^{(j)}} = \dfrac{1}{2}\displaystyle \sum_{i:r(i,j)=1} ((\theta^{(j)})^T(x^{(i)}) - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum_{k=1}^n(\theta_k^{(j)})^2$$
This is our familiar linear regression. The base of the first summation is choosing all i such that $r(i,j) = 1$.
To get the parameters for all our users, we do the following:
$$min_{\theta^{(1)},\dots,\theta^{(n_u)}} = \dfrac{1}{2}\displaystyle \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} ((\theta^{(j)})^T(x^{(i)}) - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n(\theta_k^{(j)})^2$$
We can apply our linear regression gradient descent update using the above cost function.
The only real difference is that we eliminate the constant $\dfrac{1}{m}$.
02_collaborative-filtering
01_collaborative-filtering
In this video we’ll talk about an approach to building a recommender system that’s called collaborative filtering. The algorithm that we’re talking about has a very interesting property that it does what is called feature learning and by that I mean that this will be an algorithm that can start to learn for itself what features to use.
Here was the data set that we had and we had assumed that for each movie, someone had come and told us how romantic that movie was and how much action there was in that movie. But as you can imagine it can be very difficult and time consuming and expensive to actually try to get someone to, you know, watch each movie and tell you how romantic each movie and how action packed is each movie, and often you’ll want even more features than just these two. So where do you get these features from? So let’s change the problem a bit and suppose that we have a data set where we do not know the values of these features. So we’re given the data set of movies and of how the users rated them, but we have no idea how romantic each movie is and we have no idea how action packed each movie is so I’ve replaced all of these things with question marks. But now let’s make a slightly different assumption. Let’s say we’ve gone to each of our users, and each of our users has told has told us how much they like the romantic movies and how much they like action packed movies. So Alice has associated a current of theta 1. Bob theta 2. Carol theta 3. Dave theta 4. And let’s say we also use this and that Alice tells us that she really likes romantic movies and so there’s a five there which is the multiplier associated with X1 and lets say that Alice tells us she really doesn’t like action movies and so there’s a 0 there. And Bob tells us something similar so we have theta 2 over here. Whereas Carol tells us that she really likes action movies which is why there’s a 5 there, that’s the multiplier associated with X2, and remember there’s also X0 equals 1 and let’s say that Carol tells us she doesn’t like romantic movies and so on, similarly for Dave. So let’s assume that somehow we can go to users and each user J just tells us what is the value of theta J for them. And so basically specifies to us of how much they like different types of movies. If we can get these parameters theta from our users then it turns out that it becomes possible to try to infer what are the values of x1 and x2 for each movie. Let’s look at an example. Let’s look at movie 1. So that movie 1 has associated with it a feature vector x1. And you know this movie is called Love at last but let’s ignore that. Let’s pretend we don’t know what this movie is about, so let’s ignore the title of this movie. All we know is that Alice loved this move. Bob loved this movie. Carol and Dave hated this movie. So what can we infer? Well, we know from the feature vectors that Alice and Bob love romantic movies because they told us that there’s a 5 here. Whereas Carol and Dave, we know that they hate romantic movies and that they love action movies. So because those are the parameter vectors that you know, uses 3 and 4, Carol and Dave, gave us. And so based on the fact that movie 1 is loved by Alice and Bob and hated by Carol and Dave, we might reasonably conclude that this is probably a romantic movie, it is probably not much of an action movie. this example is a little bit mathematically simplified but what we’re really asking is what feature vector should X1 be so that theta 1 transpose x1 is approximately equal to 5, that’s Alice’s rating, and theta 2 transpose x1 is also approximately equal to 5, and theta 3 transpose x1 is approximately equal to 0, so this would be Carol’s rating, and theta 4 transpose X1 is approximately equal to 0. And from this it looks like, you know, X1 equals one that’s the intercept term, and then 1.0, 0.0, that makes sense given what we know of Alice, Bob, Carol, and Dave’s preferences for movies and the way they rated this movie. And so more generally, we can go down this list and try to figure out what might be reasonable features for these other movies as well.
Let’s formalize this problem of learning the features XI. Let’s say that our users have given us their preferences. So let’s say that our users have come and, you know, told us these values for theta 1 through theta of NU and we want to learn the feature vector XI for movie number I. What we can do is therefore pose the following optimization problem. So we want to sum over all the indices J for which we have a rating for movie I because we’re trying to learn the features for movie I that is this feature vector XI. So and then what we want to do is minimize this squared error, so we want to choose features XI, so that, you know, the predictive value of how user J rates movie I will be similar, will be not too far in the squared error sense of the actual value YIJ that we actually observe in the rating of user j on movie I. So, just to summarize what this term does is it tries to choose features XI so that for all the users J that have rated that movie, the algorithm also predicts a value for how that user would have rated that movie that is not too far, in the squared error sense, from the actual value that the user had rated that movie. So that’s the squared error term. As usual, we can also add this sort of regularization term to prevent the features from becoming too big. So this is how we would learn the features for one specific movie but what we want to do is learn all the features for all the movies and so what I’m going to do is add this extra summation here so I’m going to sum over all Nm movies, N subscript m movies, and minimize this objective on top that sums of all movies. And if you do that, you end up with the following optimization problem. And if you minimize this, you have hopefully a reasonable set of features for all of your movies.
So putting everything together, what we, the algorithm we talked about in the previous video and the algorithm that we just talked about in this video. In the previous video, what we showed was that you know, if you have a set of movie ratings, so if you have the data the rij’s and then you have the yij’s that will be the movie ratings. Then given features for your different movies we can learn these parameters theta. So if you knew the features, you can learn the parameters theta for your different users. And what we showed earlier in this video is that if your users are willing to give you parameters, then you can estimate features for the different movies. So this is kind of a chicken and egg problem. Which comes first? You know, do we want if we can get the thetas, we can know the Xs. If we have the Xs, we can learn the thetas. And what you can do is, and then this actually works, what you can do is in fact randomly guess some value of the thetas. Now based on your initial random guess for the thetas, you can then go ahead and use the procedure that we just talked about in order to learn features for your different movies. Now given some initial set of features for your movies you can then use this first method that we talked about in the previous video to try to get an even better estimate for your parameters theta. Now that you have a better setting of the parameters theta for your users, we can use that to maybe even get a better set of features and so on. We can sort of keep iterating, going back and forth and optimizing theta, x theta, x theta, nd this actually works and if you do this, this will actually cause your album to converge to a reasonable set of features for you movies and a reasonable set of parameters for your different users. So this is a basic collaborative filtering algorithm. This isn’t actually the final algorithm that we’re going to use. In the next video we are going to be able to improve on this algorithm and make it quite a bit more computationally efficient. But, hopefully this gives you a sense of how you can formulate a problem where you can simultaneously learn the parameters and simultaneously learn the features from the different movies. And for this problem, for the recommender system problem, this is possible only because each user rates multiple movies and hopefully each movie is rated by multiple users. And so you can do this back and forth process to estimate theta and x.
So to summarize, in this video we’ve seen an initial collaborative filtering algorithm. The term collaborative filtering refers to the observation that when you run this algorithm with a large set of users, what all of these users are effectively doing are sort of collaboratively–or collaborating to get better movie ratings for everyone because with every user rating some subset with the movies, every user is helping the algorithm a little bit to learn better features, and then by helping– by rating a few movies myself, I will be helping the system learn better features and then these features can be used by the system to make better movie predictions for everyone else. And so there is a sense of collaboration where every user is helping the system learn better features for the common good. This is this collaborative filtering.
And, in the next video what we going to do is take the ideas that have worked out, and try to develop a better an even better algorithm, a slightly better technique for collaborative filtering.
summary
It can be very difficult to find features such as “amount of romance” or “amount of action” in a movie. To figure this out, we can use feature finders .
We can let the users tell us how much they like the different genres, providing their parameter vector immediately for us.
To infer the features from given parameters, we use the squared error function with regularization over all the users:
$$min_{x^{(1)},\dots,x^{(n_m)}} \dfrac{1}{2} \displaystyle \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2$$
You can also randomly guess the values for theta to guess the features repeatedly. You will actually converge to a good set of features.
02_collaborative-filtering-algorithm
In the last couple videos, we talked about the ideas of how, first, if you’re given features for movies, you can use that to learn parameters data for users. And second, if you’re given parameters for the users, you can use that to learn features for the movies.
In this video we’re going to take those ideas and put them together to come up with a collaborative filtering algorithm. So one of the things we worked out earlier is that if you have features for the movies then you can solve this minimization problem to find the parameters theta for your users. And then we also worked that out, if you are given the parameters theta, you can also use that to estimate the features x, and you can do that by solving this minimization problem. So one thing you could do is actually go back and forth. Maybe randomly initialize the parameters and then solve for theta, solve for x, solve for theta, solve for x.
But, it turns out that there is a more efficient algorithm that doesn’t need to go back and forth between the x’s and the thetas, but that can solve for theta and x simultaneously. And here it is. What we are going to do, is basically take both of these optimization objectives, and put them into the same objective. So I’m going to define the new optimization objective j, which is a cost function, that is a function of my features x and a function of my parameters theta. And, it’s basically the two optimization objectives I had on top, but I put together. So, in order to explain this, first, I want to point out that this term over here, this squared error term, is the same as this squared error term and the summations look a little bit different, but let’s see what the summations are really doing. The first summation is sum over all users J and then sum over all movies rated by that user. So, this is really summing over all pairs IJ, that correspond to a movie that was rated by a user. Sum over J says, for every user, the sum of all the movies rated by that user. This summation down here, just does things in the opposite order. This says for every movie I, sum over all the users J that have rated that movie and so, you know these summations, both of these are just summations over all pairs ij for which r of i J is equal to 1. It’s just something over all the user movie pairs for which you have a rating. and so those two terms up there is just exactly this first term, and I’ve just written the summation here explicitly, where I’m just saying the sum of all pairs IJ, such that RIJ is equal to 1. So what we’re going to do is define a combined optimization objective that we want to minimize in order to solve simultaneously for x and theta. And then the other terms in the optimization objective are this, which is a regularization in terms of theta. So that came down here and the final piece is this term which is my optimization objective for the x’s and that became this. And this optimization objective j actually has an interesting property that if you were to hold the x’s constant and just minimize with respect to the thetas then you’d be solving exactly this problem, whereas if you were to do the opposite, if you were to hold the thetas constant, and minimize j only with respect to the x’s, then it becomes equivalent to this. Because either this term or this term is constant if you’re minimizing only the respective x’s or only respective thetas.
So here’s an optimization objective that puts together my cost functions in terms of x and in terms of theta. And in order to come up with just one optimization problem, what we’re going to do, is treat this cost function, as a function of my features x and of my user pro user parameters data and just minimize this whole thing, as a function of both the Xs and a function of the thetas. And really the only difference between this and the older algorithm is that, instead of going back and forth, previously we talked about minimizing with respect to theta then minimizing with respect to x, whereas minimizing with respect to theta, minimizing with respect to x and so on. In this new version instead of sequentially going between the 2 sets of parameters x and theta, what we are going to do is just minimize with respect to both sets of parameters simultaneously.
Finally one last detail is that when we’re learning the features this way. Previously we have been using this convention that we have a feature x0 equals one that corresponds to an interceptor. When we are using this sort of formalism where we’re are actually learning the features, we are actually going to do away with this convention. And so the features we are going to learn x, will be in Rn. Whereas previously we had features x and Rn + 1 including the intercept term. By getting rid of x0 we now just have x in Rn. And so similarly, because the parameters theta is in the same dimension, we now also have theta in RN because if there’s no x0, then there’s no need parameter theta 0 as well. And the reason we do away with this convention is because we’re now learning all the features, right? So there is no need to hard code the feature that is always equal to one. Because if the algorithm really wants a feature that is always equal to 1, it can choose to learn one for itself. So if the algorithm chooses, it can set the feature X1 equals 1. So there’s no need to hard code the feature of 001, the algorithm now has the flexibility to just learn it by itself.
$$J(x,\theta) = \dfrac{1}{2} \displaystyle \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + \dfrac{\lambda}{2}\sum_{j=1}^{n_u} \sum_{k=1}^{n} (\theta_k^{(j)})^2$$
So, putting everything together, here is our collaborative filtering algorithm. First we are going to initialize x and theta to small random values. And this is a little bit like neural network training, where there we were also initializing all the parameters of a neural network to small random values. Next we’re then going to minimize the cost function using great descent or one of the advance optimization algorithms. So, if you take derivatives you find that the great descent like these and so this term here is the partial derivative of the cost function, I’m not going to write that out, with respect to the feature value Xik and similarly this term here is also a partial derivative value of the cost function with respect to the parameter theta that we’re minimizing. And just as a reminder, in this formula that we no longer have this X0 equals 1 and so we have that x is in Rn and theta is a Rn. In this new formalism, we’re regularizing every one of our perimeters theta, you know, every one of our parameters Xn. There’s no longer the special case theta zero, which was regularized differently, or which was not regularized compared to the parameters theta 1 down to theta. So there is now no longer a theta 0, which is why in these updates, I did not break out a special case for k equals 0. So we then use gradient descent to minimize the cost function j with respect to the features x and with respect to the parameters theta. And finally, given a user, if a user has some parameters, theta, and if there’s a movie with some sort of learned features x, we would then predict that that movie would be given a star rating by that user of theta transpose j. Or just to fill those in, then we’re saying that if user J has not yet rated movie I, then what we do is predict that user J is going to rate movie I according to theta J transpose Xi.
So that’s the collaborative filtering algorithm and if you implement this algorithm you actually get a pretty decent algorithm that will simultaneously learn good features for hopefully all the movies as well as learn parameters for all the users and hopefully give pretty good predictions for how different users will rate different movies that they have not yet rated.
summary
To speed things up, we can simultaneously minimize our features and our parameters:
$$J(x,\theta) = \dfrac{1}{2} \displaystyle \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + \dfrac{\lambda}{2}\sum_{j=1}^{n_u} \sum_{k=1}^{n} (\theta_k^{(j)})^2$$
It looks very complicated, but we’ve only combined the cost function for theta and the cost function for x.
Because the algorithm can learn them itself, the bias units where $x_0=1$ have been removed, therefore $x∈ℝ^n$ and $θ∈ℝ^n$.
These are the steps in the algorithm:
- Initialize $x^{(i)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)}$ to small random values. This serves to break symmetry and ensures that the algorithm learns features $x^{(i)},…,x^{(n_m)}$ that are different from each other.
- Minimize $J(x^{(i)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)})$ using gradient descent (or an advanced optimization algorithm).E.g. for every $j=1,…,n_u,i=1,…n_m$:
$$x_k^{(i)} := x_k^{(i)} - \alpha\left (\displaystyle \sum_{j:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) \theta_k^{(j)}} + \lambda x_k^{(i)} \right)$$
$$\theta_k^{(j)} := \theta_k^{(j)} - \alpha\left (\displaystyle \sum_{i:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) x_k^{(i)}} + \lambda \theta_k^{(j)} \right)$$ - For a user with parameters θ and a movie with (learned) features x, predict a star rating of $\theta^Tx$.
03_low-rank-matrix-factorization
01_vectorization-low-rank-matrix-factorization
In the last few videos, we talked about a collaborative filtering algorithm. In this video I’m going to say a little bit about the vectorization implementation of this algorithm. And also talk a little bit about other things you can do with this algorithm.
For example, one of the things you can do is, given one product can you find other products that are related to this so that for example, a user has recently been looking at one product. Are there other related products that you could recommend to this user? So let’s see what we could do about that. What I’d like to do is work out an alternative way of writing out the predictions of the collaborative filtering algorithm. To start, here is our data set with our five movies and what I’m going to do is take all the ratings by all the users and group them into a matrix.
So, here we have five movies and four users, and so this matrix y is going to be a 5 by 4 matrix. It’s just you know, taking all of the elements, all of this data. Including question marks, and grouping them into this matrix. And of course the elements of this matrix of the (i, j) element of this matrix is really what we were previously writing as y superscript i, j. It’s the rating given to movie i by user j.
Given this matrix y of all the ratings that we have, there’s an alternative way of writing out all the predictive ratings of the algorithm. And, in particular if you look at what a certain user predicts on a certain movie, what user j predicts on movie i is given by this formula. And so, if you have a matrix of the predicted ratings, what you would have is the following matrix where the i, j entry. So this corresponds to the rating that we predict using j will give to movie i is exactly equal to that theta j transpose XI, and so, you know, this is a matrix where this first element the one-one element is a predictive rating of user one or movie one and this element, this is the one-two element is the predicted rating of user two on movie one, and so on, and this is the predicted rating of user one on the last movie and if you want, you know, this rating is what we would have predicted for this value and this rating is what we would have predicted for that value, and so on. Now, given this matrix of predictive ratings there is then a simpler or vectorized way of writing these out. In particular if I define the matrix x, and this is going to be just like the matrix we had earlier for linear regression to be sort of x1 transpose x2 transpose down to x of nm transpose. So I’m take all the features for my movies and stack them in rows. So if you think of each movie as one example and stack all of the features of the different movies and rows. And if we also to find a matrix capital theta, and what I’m going to do is take each of the per user parameter vectors, and stack them in rows, like so. So that’s theta 1, which is the parameter vector for the first user. And, you know, theta 2, and so, you must stack them in rows like this to define a matrix capital theta and so I have nu parameter vectors all stacked in rows like this. Now given this definition for the matrix x and this definition for the matrix theta in order to have a vectorized way of computing the matrix of all the predictions you can just compute x times the matrix theta transpose, and that gives you a vectorized way of computing this matrix over here. To give the collaborative filtering algorithm that you’ve been using another name. The algorithm that we’re using is also called low rank matrix factorization. And so if you hear people talk about low rank matrix factorization that’s essentially exactly the algorithm that we have been talking about. And this term comes from the property that this matrix x times theta transpose has a mathematical property in linear algebra called that this is a low rank matrix and so that’s what gives rise to this name low rank matrix factorization for these algorithms, because of this low rank property of this matrix x theta transpose. In case you don’t know what low rank means or in case you don’t know what a low rank matrix is, don’t worry about it. You really don’t need to know that in order to use this algorithm. But if you’re an expert in linear algebra, that’s what gives this algorithm, this other name of low rank matrix factorization.
Finally, having run the collaborative filtering algorithm here’s something else that you can do which is use the learned features in order to find related movies. Specifically for each product i really for each movie i, we’ve learned a feature vector xi. So, you know, when you learn a certain features without really know that can the advance what the different features are going to be, but if you run the algorithm and perfectly the features will tend to capture what are the important aspects of these different movies or different products or what have you. What are the important aspects that cause some users to like certain movies and cause some users to like different sets of movies. So maybe you end up learning a feature, you know, where x1 equals romance, x2 equals action similar to an earlier video and maybe you learned a different feature x3 which is a degree to which this is a comedy. Then some feature x4 which is, you know, some other thing. And you have N features all together and after you have learned features it’s actually often pretty difficult to go in to the learned features and come up with a human understandable interpretation of what these features really are. But in practice, you know, the features even though these features can be hard to visualize. It can be hard to figure out just what these features are. Usually, it will learn features that are very meaningful for capturing whatever are the most important or the most salient properties of a movie that causes you to like or dislike it. And so now let’s say we want to address the following problem. Say you have some specific movie i and you want to find other movies j that are related to that movie. And so well, why would you want to do this? Right, maybe you have a user that’s browsing movies, and they’re currently watching movie j, than what’s a reasonable movie to recommend to them to watch after they’re done with movie j? Or if someone’s recently purchased movie j, well, what’s a different movie that would be reasonable to recommend to them for them to consider purchasing. So, now that you have learned these feature vectors, this gives us a very convenient way to measure how similar two movies are. In particular, movie i has a feature vector xi. and so if you can find a different movie, j, so that the distance between xi and xj is small, then this is a pretty strong indication that, you know, movies j and i are somehow similar. At least in the sense that some of them likes movie i, maybe more likely to like movie j as well. So, just to recap, if your user is looking at some movie i and if you want to find the 5 most similar movies to that movie in order to recommend 5 new movies to them, what you do is find the five movies j, with the smallest distance between the features between these different movies. And this could give you a few different movies to recommend to your user.
So with that, hopefully, you now know how to use a vectorized implementation to compute all the predicted ratings of all the users and all the movies, and also how to do things like use learned features to find what might be movies and what might be products that aren’t related to each other.
summary
Vectorization: Low Rank Matrix Factorization
Given matrices X (each row containing features of a particular movie) and Θ (each row containing the weights for those features for a given user), then the full matrix Y of all predicted ratings of all movies by all users is given simply by: $$Y = X\Theta^T$$.
Predicting how similar two movies i and j are can be done using the distance between their respective feature vectors x. Specifically, we are looking for a small value of $||x^{(i)} - x^{(j)}||$.
02_implementational-detail-mean-normalization
By now you’ve seen all of the main pieces of the recommender system algorithm or the collaborative filtering algorithm. In this video I want to just share one last implementational detail, namely mean normalization, which can sometimes just make the algorithm work a little bit better.
To motivate the idea of mean normalization, let’s consider an example of where there’s a user that has not rated any movies. So, in addition to our four users, Alice, Bob, Carol, and Dave, I’ve added a fifth user, Eve, who hasn’t rated any movies. Let’s see what our collaborative filtering algorithm will do on this user. Let’s say that n is equal to 2 and so we’re going to learn two features and we are going to have to learn a parameter vector theta 5, which is going to be in R2, remember this is now vectors in Rn not Rn+1, we’ll learn the parameter vector theta 5 for our user number 5, Eve. So if we look in the first term in this optimization objective, well the user Eve hasn’t rated any movies, so there are no movies for which Rij is equal to one for the user Eve and so this first term plays no role at all in determining theta 5 because there are no movies that Eve has rated. And so the only term that effects theta 5 is this term. And so we’re saying that we want to choose vector theta 5 so that the last regularization term is as small as possible. In other words we want to minimize this lambda over 2 theta 5 subscript 1 squared plus theta 5 subscript 2 squared so that’s the component of the regularization term that corresponds to user 5, and of course if your goal is to minimize this term, then what you’re going to end up with is just theta 5 equals 0 0. Because a regularization term is encouraging us to set parameters close to 0 and if there is no data to try to pull the parameters away from 0, because this first term doesn’t effect theta 5, we just end up with theta 5 equals the vector of all zeros. And so when we go to predict how user 5 would rate any movie, we have that theta 5 transpose xi, for any i, that’s just going to be equal to zero. Because theta 5 is 0 for any value of x, this inner product is going to be equal to 0. And what we’re going to have therefore, is that we’re going to predict that Eve is going to rate every single movie with zero stars. But this doesn’t seem very useful does it? I mean if you look at the different movies, Love at Last, this first movie, a couple people rated it 5 stars. And for even the Swords vs. Karate, someone rated it 5 stars. So some people do like some movies. It seems not useful to just predict that Eve is going to rate everything 0 stars. And in fact if we’re predicting that eve is going to rate everything 0 stars, we also don’t have any good way of recommending any movies to her, because you know all of these movies are getting exactly the same predicted rating for Eve so there’s no one movie with a higher predicted rating that we could recommend to her, so, that’s not very good. The idea of mean normalization will let us fix this problem. So here’s how it works. As before let me group all of my movie ratings into this matrix Y, so just take all of these ratings and group them into matrix Y. And this column over here of all question marks corresponds to Eve’s not having rated any movies.
Now to perform mean normalization what I’m going to do is compute the average rating that each movie obtained. And I’m going to store that in a vector that we’ll call mu. So the first movie got two 5-star and two 0-star ratings, so the average of that is a 2.5-star rating. The second movie had an average of 2.5-stars and so on. And the final movie that has 0, 0, 5, 0. And the average of 0, 0, 5, 0, that averages out to an average of 1.25 rating. And what I’m going to do is look at all the movie ratings and I’m going to subtract off the mean rating. So this first element 5 I’m going to subtract off 2.5 and that gives me 2.5. And the second element 5 subtract off of 2.5, get a 2.5. And then the 0, 0, subtract off 2.5 and you get -2.5, -2.5. In other words, what I’m going to do is take my matrix of movie ratings, take this wide matrix, and subtract form each row the average rating for that movie. So, what I’m doing is just normalizing each movie to have an average rating of zero. And so just one last example. If you look at this last row, 0 0 5 0. We’re going to subtract 1.25, and so I end up with these values over here. So now and of course the question marks stay a question mark. So each movie in this new matrix Y has an average rating of 0. What I’m going to do then, is take this set of ratings and use it with my collaborative filtering algorithm. So I’m going to pretend that this was the data that I had gotten from my users, or pretend that these are the actual ratings I had gotten from the users, and I’m going to use this as my data set with which to learn my parameters theta J and my features XI - from these mean normalized movie ratings. When I want to make predictions of movie ratings, what I’m going to do is the following: for user J on movie I, I’m gonna predict theta J transpose XI, where X and theta are the parameters that I’ve learned from this mean normalized data set. But, because on the data set, I had subtracted off the means in order to make a prediction on movie i, I’m going to need to add back in the mean, and so i’m going to add back in mu i. And so that’s going to be my prediction where in my training data subtracted off all the means and so when we make predictions and we need to add back in these means mu i for movie i. And so specifically if you user 5 which is Eve, the same argument as the previous slide still applies in the sense that Eve had not rated any movies and so the learned parameter for user 5 is still going to be equal to 0, 0. And so what we’re going to get then is that on a particular movie i we’re going to predict for Eve theta 5, transpose xi plus add back in mu i and so this first component is going to be equal to zero, if theta five is equal to zero. And so on movie i, we are going to end a predicting mu i. And, this actually makes sense. It means that on movie 1 we’re going to predict Eve rates it 2.5. On movie 2 we’re gonna predict Eve rates it 2.5. On movie 3 we’re gonna predict Eve rates it at 2 and so on. This actually makes sense, because it says that if Eve hasn’t rated any movies and we just don’t know anything about this new user Eve, what we’re going to do is just predict for each of the movies, what are the average rating that those movies got. Finally, as an aside, in this video we talked about mean normalization, where we normalized each row of the matrix y, to have mean 0. In case you have some movies with no ratings, so it is analogous to a user who hasn’t rated anything, but in case you have some movies with no ratings, you can also play with versions of the algorithm, where you normalize the different columns to have means zero, instead of normalizing the rows to have mean zero, although that’s maybe less important, because if you really have a movie with no rating, maybe you just shouldn’t recommend that movie to anyone, anyway. And so, taking care of the case of a user who hasn’t rated anything might be more important than taking care of the case of a movie that hasn’t gotten a single rating.
So to summarize, that’s how you can do mean normalization as a sort of pre-processing step for collaborative filtering. Depending on your data set, this might some times make your implementation work just a little bit better.
summary
If the ranking system for movies is used from the previous lectures, then new users (who have watched no movies), will be assigned new movies incorrectly. Specifically, they will be assigned θ with all components equal to zero due to the minimization of the regularization term. That is, we assume that the new user will rank all movies 0, which does not seem intuitively correct.
We rectify this problem by normalizing the data relative to the mean. First, we use a matrix Y to store the data from previous ratings, where the ith row of Y is the ratings for the ith movie and the jth column corresponds to the ratings for the jth user.
We can now define a vector $\mu = [\mu_1, \mu_2, \dots , \mu_{n_m}]$ such that $\mu_i = \frac{\sum_{j:r(i,j)=1}{Y_{i,j}}}{\sum_{j}{r(i,j)}}$
Which is effectively the mean of the previous ratings for the ith movie (where only movies that have been watched by users are counted). We now can normalize the data by subtracting u, the mean rating, from the actual ratings for each user (column in matrix Y):
As an example, consider the following matrix Y and mean ratings μ:
$$Y = \begin{bmatrix} 5 & 5 & 0 & 0 \ 4 & ? & ? & 0 \ 0 & 0 & 5 & 4 \ 0 & 0 & 5 & 0 \ \end{bmatrix}, \quad \mu = \begin{bmatrix} 2.5 \ 2 \ 2.25 \ 1.25 \ \end{bmatrix}$$
The resulting Y′ vector is:
$$Y’ = \begin{bmatrix} 2.5 & 2.5 & -2.5 & -2.5 \ 2 & ? & ? & -2 \ -.2.25 & -2.25 & 3.75 & 1.25 \ -1.25 & -1.25 & 3.75 & -1.25 \end{bmatrix}$$
Now we must slightly modify the linear regression prediction to include the mean normalization term:
$$(\theta^{(j)})^T x^{(i)} + \mu_i$$
Now, for a new user, the initial predicted values will be equal to the μ term instead of simply being initialized to zero, which is more accurate.