Note
This personal note is written after studying the opening course on the coursera website, Machine Learning by Andrew NG . And images, audios of this note all comes from the opening course.
01_motivation
01_motivation-i-data-compression
In this video, I’d like to start talking about a second type of unsupervised learning problem called dimensionality reduction.
There are a couple of different reasons why one might want to do dimensionality reduction. One is data compression, and as we’ll see later, a few videos later, data compression not only allows us to compress the data and have it therefore use up less computer memory or disk space, but it will also allow us to speed up our learning algorithms.
But first, let’s start by talking about what is dimensionality reduction. As a motivating example, let’s say that we’ve collected a data set with many, many, many features, and I’ve plotted just two of them here.
And let’s say that unknown to us two of the features were actually the length of something in centimeters, and a different feature, x2, is the length of the same thing in inches. So, this gives us a highly redundant representation and maybe instead of having two separate features x1 then x2, both of which basically measure the length, maybe what we want to do is reduce the data to one-dimensional and just have one number measuring this length. In case this example seems a bit contrived, this centimeter and inches example is actually not that unrealistic, and not that different from things that I see happening in industry. If you have hundreds or thousands of features, it is often this easy to lose track of exactly what features you have. And sometimes may have a few different engineering teams, maybe one engineering team gives you two hundred features, a second engineering team gives you another three hundred features, and a third engineering team gives you five hundred features so you have a thousand features all together, and it actually becomes hard to keep track of you know, exactly which features you got from which team, and it’s actually not that want to have highly redundant features like these. And so if the length in centimeters were rounded off to the nearest centimeter and lengthened inches was rounded off to the nearest inch. Then, that’s why these examples don’t lie perfectly on a straight line, because of, you know, round-off error to the nearest centimeter or the nearest inch. And if we can reduce the data to one dimension instead of two dimensions, that reduces the redundancy.
For a different example, again maybe when there seems fairly less contrives. For may years I’ve been working with autonomous helicopter pilots. Or I’ve been working with pilots that fly helicopters. And so. If you were to measure–if you were to, you know, do a survey or do a test of these different pilots–you might have one feature, x1, which is maybe the skill of these helicopter pilots, and maybe “x2” could be the pilot enjoyment. That is, you know, how much they enjoy flying, and maybe these two features will be highly correlated. And what you really care about might be this sort of this sort of, this direction, a different feature that really measures pilot aptitude. And I’m making up the name aptitude of course, but again, if you highly correlated features, maybe you really want to reduce the dimension.
So, let me say a little bit more about what it really means to reduce the dimension of the data from 2 dimensions down from 2D to 1 dimensional or to 1D. Let me color in these examples by using different colors.
And in this case by reducing the dimension what I mean is that I would like to find maybe this line, this, you know, direction on which most of the data seems to lie and project all the data onto that line which is true, and by doing so, what I can do is just measure the position of each of the examples on that line. And what I can do is come up with a new feature, z1, and to specify the position on the line I need only one number, so it says z1 is a new feature that specifies the location of each of those points on this green line. And what this means, is that where as previously if i had an example x1, maybe this was my first example, x1. So in order to represent x1 originally x1. I needed a two dimensional number, or a two dimensional feature vector. Instead now I can represent z1. I could use just z1 to represent my first example, and that’s going to be a real number. And similarly x2 you know, if x2 is my second example there, then previously, whereas this required two numbers to represent if I instead compute the projection of that black cross onto the line. And now I only need one real number which is z2 to represent the location of this point z2 on the line. And so on through my M examples. So, just to summarize, if we allow ourselves to approximate the original data set by projecting all of my original examples onto this green line over here, then I need only one number, I need only real number to specify the position of a point on the line, and so what I can do is therefore use just one number to represent the location of each of my training examples after they’ve been projected onto that green line. So this is an approximation to the original training self because I have projected all of my training examples onto a line. But now, I need to keep around only one number for each of my examples. And so this halves the memory requirement, or a space requirement, or what have you, for how to store my data. And perhaps more interestingly, more importantly, what we’ll see later, in the later video as well is that this will allow us to make our learning algorithms run more quickly as well. And that is actually, perhaps, even the more interesting application of this data compression rather than reducing the memory or disk space requirement for storing the data.
On the previous slide we showed an example of reducing data from 2D to 1D. On this slide, I’m going to show another example of reducing data from three dimensional 3D to two dimensional 2D. By the way, in the more typical example of dimensionality reduction we might have a thousand dimensional data or 1000D data that we might want to reduce to let’s say a hundred dimensional or 100D, but because of the limitations of what I can plot on the slide. I’m going to use examples of 3D to 2D, or 2D to 1D. So, let’s have a data set like that shown here. And so, I would have a set of examples x(i) which are points in r3. So, I have three dimension examples. I know it might be a little bit hard to see this on the slide, but I’ll show a 3D point cloud in a little bit. And it might be hard to see here, but all of this data maybe lies roughly on the plane, like so. And so what we can do with dimensionality reduction, is take all of this data and project the data down onto a two dimensional plane. So, here what I’ve done is, I’ve taken all the data and I’ve projected all of the data, so that it all lies on the plane. Now, finally, in order to specify the location of a point within a plane, we need two numbers, right? We need to, maybe, specify the location of a point along this axis, and then also specify it’s location along that axis. So, we need two numbers, maybe called z1 and z2 to specify the location of a point within a plane. And so, what that means, is that we can now represent each example, each training example, using two numbers that I’ve drawn here, z1, and z2. So, our data can be represented using vector z which are in r2. And these subscript, z subscript 1, z subscript 2, what I just mean by that is that my vectors here, z, you know, are two dimensional vectors, z1, z2. And so if I have some particular examples, z(i), or that’s the two dimensional vector, z(i)1, z(i)2. And on the previous slide when I was reducing data to one dimensional data then I had only z1, right? And that is what a z1 subscript 1 on the previous slide was, but here I have two dimensional data, so I have z1 and z2 as the two components of the data.
Now, let me just make sure that these figures make sense. So let me just reshow these exact three figures again but with 3D plots.
So the process we went through was that shown in the lab is the optimal data set, in the middle the data set projects on the 2D, and on the right the 2D data sets with z1 and z2 as the axis. Let’s look at them a little bit further. Here’s my original data set, shown on the left, and so I had started off with a 3D point cloud like so, where the axis are labeled x1, x2, x3, and so there’s a 3D point but most of the data, maybe roughly lies on some, you know, not too far from some 2D plain. So, what we can do is take this data and here’s my middle figure. I’m going to project it onto 2D. So, I’ve projected this data so that all of it now lies on this 2D surface. As you can see all the data lies on a plane, ‘cause we’ve projected everything onto a plane, and so what this means is that now I need only two numbers, z1 and z2, to represent the location of point on the plane. And so that’s the process that we can go through to reduce our data from three dimensional to two dimensional. So that’s dimensionality reduction and how we can use it to compress our data. And as we’ll see later this will allow us to make some of our learning algorithms run much later as well, but we’ll get to that only in a later video.
summary
- We may want to reduce the dimension of our features if we have a lot of redundant data.
- To do this, we find two highly correlated features, plot them, and make a new line that seems to describe both features accurately. We place all the new features on this single line.
Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.
Note: in dimensionality reduction, we are reducing our features rather than our number of examples. Our variable m will stay the same size; n, the number of features each example from $x^{(1)}$ to $x^{(m)}$ carries, will be reduced.
02_motivation-ii-visualization
In the last video, we talked about dimensionality reduction for the purpose of compressing the data. In this video, I’d like to tell you about a second application of dimensionality reduction and that is to visualize the data. For a lot of machine learning applications, it really helps us to develop effective learning algorithms, if we can understand our data better. If there is some way of visualizing the data better, and so, dimensionality reduction offers us, often, another useful tool to do so.
Let’s start with an example. Let’s say we’ve collected a large data set of many statistics and facts about different countries around the world. So, maybe the first feature, X1 is the country’s GDP, or the Gross Domestic Product, and X2 is a per capita, meaning the per person GDP, X3 human development index, life expectancy, X5, X6 and so on. And we may have a huge data set like this, where, you know, maybe 50 features for every country, and we have a huge set of countries. So is there something we can do to try to understand our data better? I’ve given this huge table of numbers. How do you visualize this data? If you have 50 features, it’s very difficult to plot 50-dimensional data.
What is a good way to examine this data? Using dimensionality reduction, what we can do is, instead of having each country represented by this featured vector, xi, which is 50-dimensional, so instead of, say, having a country like Canada, instead of having 50 numbers to represent the features of Canada, let’s say we can come up with a different feature representation that is these z vectors, that is in R2.
If that’s the case, if we can have just a pair of numbers, z1 and z2 that somehow, summarizes my 50 numbers, maybe what we can do [xx] is to plot these countries in R2 and use that to try to understand the space in [xx] of features of different countries [xx] the better and so, here, what you can do is reduce the data from 50 D, from 50 dimensions to 2D, so you can plot this as a 2 dimensional plot, and, when you do that, it turns out that, if you look at the output of the Dimensionality Reduction algorithms, It usually doesn’t astride a physical meaning to these new features you want $z_1,z_2$. It’s often up to us to figure out you know, roughly what these features means. But, And if you plot those features, here is what you might find.
So, here, every country is represented by a point ZI, which is an R2 and so each of those. Dots, and this figure represents a country, and so, here’s Z1 and here’s Z2, and a couple of these. So, you might find, for example, That the horizontial axis the Z1 axis corresponds roughly to the overall country size, or the overall economic activity of a country. So the overall GDP, overall economic size of a country. Whereas the vertical axis in our data might correspond to the per person GDP. Or the per person well being, or the per person economic activity, and, you might find that, given these 50 features, you know, these are really the 2 main dimensions of the deviation, and so, out here you may have a country like the U.S.A., which is a relatively large GDP, you know, is a very large GDP and a relatively high per-person GDP as well. Whereas here you might have a country like Singapore, which actually has a very high per person GDP as well, but because Singapore is a much smaller country the overall economy size of Singapore is much smaller than the US. And, over here, you would have countries where individuals are unfortunately some are less well off, maybe shorter life expectancy, less health care, less economic maturity that’s why smaller countries, whereas a point like this will correspond to a country that has a fair, has a substantial amount of economic activity, but where individuals tend to be somewhat less well off. So you might find that the axes Z1 and Z2 can help you to most succinctly capture really what are the two main dimensions of the variations amongst different countries. Such as the overall economic activity of the country projected by the size of the country’s overall economy as well as the per-person individual well-being, measured by per-person GDP, per-person healthcare, and things like that. So that’s how you can use dimensionality reduction, in order to reduce data from 50 dimensions or whatever, down to two dimensions, or maybe down to three dimensions, so that you can plot it and understand your data better.
In the next video, we’ll start to develop a specific algorithm, called PCA, or Principal Component Analysis, which will allow us to do this and also do the earlier application I talked about of compressing the data.
summary
Motivation II: Visualization
It is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it.
We need to find new features, $z_1,z_2$ (and perhaps $z_3$ ) that can effectively summarize all the other features.
Example: hundreds of features related to a country’s economic system may all be combined into one feature that you call “Economic Activity.”
02_principal-component-analysis
01_principal-component-analysis-problem-formulation
For the problem of dimensionality reduction, by far the most popular, by far the most commonly used algorithm is something called principle components analysis, or PCA.
In this video, I’d like to start talking about the problem formulation for PCA. In other words, let’s try to formulate, precisely, exactly what we would like PCA to do. Let’s say we have a data set like this. So, this is a data set of examples x and R2 and let’s say I want to reduce the dimension of the data from two-dimensional to one-dimensional. In other words, I would like to find a line onto which to project the data. So what seems like a good line onto which to project the data, it’s a line like this, might be a pretty good choice. And the reason we think this might be a good choice is that if you look at where the projected versions of the point scales, so I take this point and project it down here. Get that, this point gets projected here, to here, to here, to here. What we find is that the distance between each point and the projected version is pretty small. That is, these blue line segments are pretty short. So what PCA does formally is it tries to find a lower dimensional surface, really a line in this case, onto which to project the data so that the sum of squares of these little blue line segments is minimized. The length of those blue line segments, that’s sometimes also called the projection error. And so what PCA does is it tries to find a surface onto which to project the data so as to minimize that.
As an aside, before applying PCA, it’s standard practice to first perform mean normalization at feature scaling so that the features x1 and x2 should have zero mean, and should have comparable ranges of values. I’ve already done this for this example, but I’ll come back to this later and talk more about feature scaling and the normalization in the context of PCA later. But coming back to this example, in contrast to the red line that I just drew, here’s a different line onto which I could project my data, which is this magenta line. And, as we’ll see, this magenta line is a much worse direction onto which to project my data, right? So if I were to project my data onto the magenta line, we’d get a set of points like that. And the projection errors, that is these blue line segments, will be huge. So these points have to move a huge distance in order to get projected onto the magenta line. And so that’s why PCA, principal components analysis, will choose something like the red line rather than the magenta line down here.
Let’s write out the PCA problem a little more formally. The goal of PCA, if we want to reduce data from two-dimensional to one-dimensional is, we’re going to try find a vector that is a vector u1, which is going to be an Rn, so that would be an R2 in this case. I’m gonna find the direction onto which to project the data, so it’s to minimize the projection error. So, in this example I’m hoping that PCA will find this vector, which l wanna call u(1), so that when I project the data onto the line that I define by extending out this vector, I end up with pretty small reconstruction errors. And that reference of data that looks like this. And by the way, I should mention that where the PCA gives me u(1) or -u(1), doesn’t matter. So if it gives me a positive vector in this direction, that’s fine. If it gives me the opposite vector facing in the opposite direction, so that would be like minus u(1). Let’s draw that in blue instead, right? But it gives a positive u(1) or negative u(1), it doesn’t matter because each of these vectors defines the same red line onto which I’m projecting my data. So this is a case of reducing data from two-dimensional to one-dimensional. In the more general case we have n-dimensional data and we’ll want to reduce it to k-dimensions. In that case we want to find not just a single vector onto which to project the data but we want to find k-dimensions onto which to project the data. So as to minimize this projection error. So here’s the example. If I have a 3D point cloud like this, then maybe what I want to do is find vectors. So find a pair of vectors. And I’m gonna call these vectors. Let’s draw these in red. I’m going to find a pair of vectors, sustained from the origin. Here’s u(1), and plane, or they define a 2D surface, right? Like this with a 2D surface onto which I am going to project my data. For those of you that are familiar with linear algebra, for this year they’re really experts in linear algebra, the formal definition of this is that we are going to find the set of vectors u(1), u(2), maybe up to u(k). And what we’re going to do is project the data onto the linear subspace spanned by this set of k vectors. But if you’re not familiar with linear algebra, just think of it as finding k directions instead of just one direction onto which to project the data. So finding a k-dimensional surface is really finding a 2D plane in this case, shown in this figure, where we can define the position of the points in a plane using k directions. And that’s why for PCA we want to find k vectors onto which to project the data. And so more formally in PCA, what we want to do is find this way to project the data so as to minimize the sort of projection distance, which is the distance between the points and the projections. And so in this 3D example too. Given a point we would take the point and project it onto this 2D surface. We are done with that. And so the projection error would be, the distance between the point and where it gets projected down to my 2D surface. And so what PCA does is I try to find the line, or a plane, or whatever, onto which to project the data, to try to minimize that square projection, that 90 degree or that orthogonal projection error.
Finally, one question I sometimes get asked is how does PCA relate to linear regression? Because when explaining PCA, I sometimes end up drawing diagrams like these and that looks a little bit like linear regression. It turns out PCA is not linear regression and despite some cosmetic similarity, these are actually totally different algorithms. If we were doing linear regression, what we would do would be, on the left we would be trying to predict the value of some variable y given some info features x. And so linear regression, what we’re doing is we’re fitting a straight line so as to minimize the square error between point and this straight line. And so what we’re minimizing would be the squared magnitude of these blue lines. And notice that I’m drawing these blue lines vertically. That these blue lines are the vertical distance between the point and the value predicted by the hypothesis. Whereas in contrast, in PCA, what it does is it tries to minimize the magnitude of these blue lines, which are drawn at an angle. These are really the shortest orthogonal distances. The shortest distance between the point x and this red line. And this gives very different effects depending on the dataset. And more generally, when you’re doing linear regression, there is this distinguished variable y they we’re trying to predict. All that linear regression as well as taking all the values of x and try to use that to predict y. Whereas in PCA, there is no distinguish, or there is no special variable y that we’re trying to predict. And instead, we have a list of features, x1, x2, and so on, up to xn, and all of these features are treated equally,so no one of them is special.
As one last example, if I have three-dimensional data and I want to reduce data from 3D to 2D, so maybe I wanna find two directions, u(1) and u(2), onto which to project my data. Then what I have is I have three features, x1, x2, x3, and all of these are treated alike. All of these are treated symmetrically and there’s no special variable y that I’m trying to predict. And so PCA is not a linear regression, and even though at some cosmetic level they might look related, these are actually very different algorithms. So hopefully you now understand what PCA is doing. It’s trying to find a lower dimensional surface onto which to project the data, so as to minimize this squared projection error. To minimize the square distance between each point and the location of where it gets projected. In the next video, we’ll start to talk about how to actually find this lower dimensional surface onto which to project the data.
summary
Principal Component Analysis Problem Formulation
The most popular dimensionality reduction algorithm is Principal Component Analysis (PCA)
Problem formulation
Given two features, $x_1$ and $x_2$ we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature.
The same can be done with three features, where we map them to a plane.
The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error.
Reduce from 2d to 1d: find a direction (a vector $u^{(1)} \in \mathbb{R}^n$) onto which to project the data so as to minimize the projection error.
The more general case is as follows:
Reduce from n-dimension to k-dimension: Find k vectors $u^{(1)}, u^{(2)}, \dots, u^{(k)}$ onto which to project the data so as to minimize the projection error.
If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so k will be 2.
PCA is not linear regression
In linear regression, we are minimizing the squared error from every point to our predictor line. These are vertical distances.
In PCA, we are minimizing the shortest distance , or shortest orthogonal distances, to our data points.
More generally, in linear regression we are taking all our examples in x and applying the parameters in Θ to predict y.
In PCA, we are taking a number of features $x_1, x_2, \dots, x_n$, and finding a closest common dataset among them. We aren’t trying to predict any result and we aren’t applying any theta weights to the features.
02_principal-component-analysis-algorithm
In this video I’d like to tell you about the principle components analysis algorithm. And by the end of this video you know to implement PCA for yourself.
And use it reduce the dimension of your data. Before applying PCA,there is a data pre-processing step which you should always do.
Given the trading sets of the examples is important to always perform mean normalization, and then depending on your data, maybe perform feature scaling as well. this is very similar to the mean normalization and feature scaling process that we have for supervised learning. In fact it’s exactly the same procedure except that we’re doing it now to our unlabeled data, X1 through Xm. So for mean normalization we first compute the mean of each feature and then we replace each feature, X, with X minus its mean, and so this makes each feature now have exactly zero mean The different features have very different scales. So for example, if x1 is the size of a house, and x2 is the number of bedrooms, to use our earlier example, we then also scale each feature to have a comparable range of values. And so, similar to what we had with supervised learning, we would take x, i substitute j, that’s the j feature and so we would subtract of the mean, now that’s what we have on top, and then divide by sj. Here, sj is some measure of the beta values of feature j. So, it could be the max minus min value, or more commonly, it is the standard deviation of feature j. Having done this sort of data pre-processing, here’s what the PCA algorithm does.
We saw from the previous video that what PCA does is, it tries to find a lower dimensional sub-space onto which to project the data, so as to minimize the squared projection errors, sum of the squared projection errors, as the square of the length of those blue lines that and so what we wanted to do specifically is find a vector, u1, which specifies that direction or in the 2D case we want to find two vectors, u1 and u2, to define this surface onto which to project the data. So, just as a quick reminder of what reducing the dimension of the data means, for this example on the left we were given the examples xI, which are in r2. And what we like to do is find a set of numbers zI in r push to represent our data. So that’s what from reduction from 2D to 1D means. So specifically by projecting data onto this red line there. We need only one number to specify the position of the points on the line. So i’m going to call that number z or z1. Z here [xx] real number, so that’s like a one dimensional vector. So z1 just refers to the first component of this, you know, one by one matrix, or this one dimensional vector. And so we need only one number to specify the position of a point. So if this example here was my example X1, then maybe that gets mapped here. And if this example was X2 maybe that example gets mapped And so this point here will be Z1 and this point here will be Z2, and similarly we would have those other points for These, maybe X3, X4, X5 get mapped to Z1, Z2, Z3. So What PCA has to do is we need to come up with a way to compute two things. One is to compute these vectors, u1, and in this case u1 and u2. And the other is how do we compute these numbers, Z. So on the example on the left we’re reducing the data from 2D to 1D. In the example on the right, we would be reducing data from 3 dimensional as in r3, to zi, which is now two dimensional. So these z vectors would now be two dimensional. So it would be z1 z2 like so, and so we need to give away to compute these new representations, the z1 and z2 of the data as well. So how do you compute all of these quantities? It turns out that a mathematical derivation, also the mathematical proof, for what is the right value U1, U2, Z1, Z2, and so on. That mathematical proof is very complicated and beyond the scope of the course. But once you’ve done [xx] it turns out that the procedure to actually find the value of u1 that you want is not that hard, even though so that the mathematical proof that this value is the correct value is someone more involved and more than i want to get into. But let me just describe the specific procedure that you have to implement in order to compute all of these things, the vectors, u1, u2, the vector z.
Here’s the procedure. Let’s say we want to reduce the data to n dimensions to k dimension What we’re going to do is first compute something called the covariance matrix, and the covariance matrix is commonly denoted by this Greek alphabet which is the capital Greek alphabet sigma. It’s a bit unfortunate that the Greek alphabet sigma looks exactly like the summation symbols. So this is the Greek alphabet Sigma is used to denote a matrix and this here is a summation symbol. So hopefully in these slides there won’t be ambiguity about which is Sigma Matrix, the matrix, which is a summation symbol, and hopefully it will be clear from context when I’m using each one. How do you compute this matrix let’s say we want to store it in an octave variable called sigma. What we need to do is compute something called the eigenvectors of the matrix sigma. And an octave, the way you do that is you use this command, u s v equals s v d of sigma. SVD, by the way, stands for singular value decomposition. This is a Much more advanced single value composition. It is much more advanced linear algebra than you actually need to know but now It turns out that when sigma is equal to matrix there is a few ways to compute these are high in vectors and If you are an expert in linear algebra and if you’ve heard of high in vectors before you may know that there is another octet function called I, which can also be used to compute the same thing. and It turns out that the SVD function and the I function it will give you the same vectors, although SVD is a little more numerically stable. So I tend to use SVD, although I have a few friends that use the I function to do this as wellbut when you apply this to a covariance matrix sigma it gives you the same thing. This is because the covariance matrix always satisfies a mathematical Property called symmetric positive definite You really don’t need to know what that means, but the SVD and I-functions are different functions but when they are applied to a covariance matrix which can be proved to always satisfy this mathematical property; they’ll always give you the same thing. Okay, that was probably much more linear algebra than you needed to know. In case none of that made sense, don’t worry about it. All you need to know is that this system command you should implement in Octave. And if you’re implementing this in a different language than Octave or MATLAB, what you should do is find the numerical linear algebra library that can compute the SVD or singular value decomposition, and there are many such libraries for probably all of the major programming languages. People can use that to compute the matrices u, s, and d of the covariance matrix sigma. So just to fill in some more details, this covariance matrix sigma will be an n by n matrix. And one way to see that is if you look at the definition this is an n by 1 vector and this here I transpose is 1 by N so the product of these two things is going to be an N by N matrix. 1xN transfers, 1xN, so there’s an NxN matrix and when we add up all of these you still have an NxN matrix. And what the SVD outputs three matrices, u, s, and v. The thing you really need out of the SVD is the u matrix. The u matrix will also be a NxN matrix. And if we look at the columns of the U matrix it turns out that the columns of the U matrix will be exactly those vectors, u1, u2 and so on. So u, will be matrix. And if we want to reduce the data from n dimensions down to k dimensions, then what we need to do is take the first k vectors. that gives us u1 up to uK which gives us the K direction onto which we want to project the data.
The rest of the procedure from this SVD numerical linear algebra routine we get this matrix u. We’ll call these columns u1-uN. So, just to wrap up the description of the rest of the procedure, from the SVD numerical linear algebra routine we get these matrices u, s, and d. we’re going to use the first K columns of this matrix to get u1-uK. Now the other thing we need to is take my original data set, X which is an RN And find a lower dimensional representation Z, which is a R K for this data. So the way we’re going to do that is take the first K Columns of the U matrix. Construct this matrix. Stack up U1, U2 and so on up to U K in columns. It’s really basically taking, you know, this part of the matrix, the first K columns of this matrix. And so this is going to be an N by K matrix. I’m going to give this matrix a name. I’m going to call this matrix U, subscript “reduce,” sort of a reduced version of the U matrix maybe. I’m going to use it to reduce the dimension of my data. And the way I’m going to compute Z is going to let Z be equal to this U reduce matrix transpose times X. Or alternatively, you know, to write down what this transpose means. When I take this transpose of this U matrix, what I’m going to end up with is these vectors now in rows. I have U1 transpose down to UK transpose. Then take that times X, and that’s how I get my vector Z. Just to make sure that these dimensions make sense, this matrix here is going to be k by n and x here is going to be n by 1 and so the product here will be k by 1. And so z is k dimensional, is a k dimensional vector, which is exactly what we wanted. And of course these x’s here right, can be Examples in our training set can be examples in our cross validation set, can be examples in our test set, and for example if you know, I wanted to take training example i, I can write this as xi XI and that’s what will give me ZI over there.
So, to summarize, here’s the PCA algorithm on one slide. After mean normalization, to ensure that every feature is zero mean and optional feature scaling whichYou really should do feature scaling if your features take on very different ranges of values. After this pre-processing we compute the carrier matrix Sigma like so by the way if your data is given as a matrix like hits if you have your data Given in rows like this. If you have a matrix X which is your time trading sets written in rows where x1 transpose down to x1 transpose, this covariance matrix sigma actually has a nice vectorizing implementation. You can implement in octave, you can even run sigma equals 1 over m, times x, which is this matrix up here, transpose times x and this simple expression, that’s the vectorize implementation of how to compute the matrix sigma. I’m not going to prove that today. This is the correct vectorization whether you want, you can either numerically test this on yourself by trying out an octave and making sure that both this and this implementations give the same answers or you Can try to prove it yourself mathematically. Either way but this is the correct vectorizing implementation, without compusingnext we can apply the SVD routine to get u, s, and d. And then we grab the first k columns of the u matrix you reduce and finally this defines how we go from a feature vector x to this reduce dimension representation z. And similar to k Means if you’re apply PCA, they way you’d apply this is with vectors X and RN. So, this is not done with X-0 1. So that was the PCA algorithm. One thing I didn’t do is give a mathematical proof that this There it actually give the projection of the data onto the K dimensional subspace onto the K dimensional surface that actually minimizes the square projection error Proof of that is beyond the scope of this course. Fortunately the PCA algorithm can be implemented in not too many lines of code. and if you implement this in octave or algorithm, you actually get a very effective dimensionality reduction algorithm.
So, that was the PCA algorithm. One thing I didn’t do was give a mathematical proof that the U1 and U2 and so on and the Z and so on you get out of this procedure is really the choices that would minimize these squared projection error. Right, remember we said What PCA tries to do is try to find a surface or line onto which to project the data so as to minimize to square projection error. So I didn’t prove that this that, and the mathematical proof of that is beyond the scope of this course. But fortunately the PCA algorithm can be implemented in not too many lines of octave code. And if you implement this, this is actually what will work, or this will work well, and if you implement this algorithm, you get a very effective dimensionality reduction algorithm. That does do the right thing of minimizing this square projection error
summary
Before we can apply PCA, there is a data pre-processing step we must perform:
Data preprocessing
Given training set: x(1),x(2),…,x(m)
Preprocess (feature scaling/mean normalization):
$\mu_j = \dfrac{1}{m}\sum^m_{i=1}x_j^{(i)}$
Replace each $x_j^{(i)}$ with $x_j^{(i)} - \mu_j$
If different features on different scales (e.g., $x_1$ = size of house, $x_2$ = number of bedrooms), scale features to have comparable range of values.
Above, we first subtract the mean of each feature from the original feature. Then we scale all the features $x_j^{(i)} = \dfrac{x_j^{(i)} - \mu_j}{s_j}$
We can define specifically what it means to reduce from 2d to 1d data as follows:
$$\Sigma = \dfrac{1}{m}\sum^m_{i=1}(x^{(i)})(x^{(i)})^T$$
The z values are all real numbers and are the projections of our features onto $u^{(1)}$.
So, PCA has two tasks: figure out $u^{(1)},\dots,u^{(k)}$ and also to find $z_1, z_2, \dots, z_m$.
The mathematical proof for the following procedure is complicated and beyond the scope of this course.
1. Compute “covariance matrix”
$$\Sigma = \dfrac{1}{m}\sum^m_{i=1}(x^{(i)})(x^{(i)})^T$$
This can be vectorized in Octave as:
1 | Sigma = (1/m) * X' * X; |
We denote the covariance matrix with a capital sigma (which happens to be the same symbol for summation, confusingly—they represent entirely different things).
Note that $x^{(i)}$ is an n×1 vector, $(x^{(i)})^T$ is an 1×n vector and X is a m×n matrix (row-wise stored examples). The product of those will be an n×n matrix, which are the dimensions of Σ.
2. Compute “eigenvectors” of covariance matrix Σ
[U,S,V] = svd(Sigma);
svd() is the ‘singular value decomposition’, a built-in Octave function.
What we actually want out of svd() is the ‘U’ matrix of the Sigma covariance matrix: $U \in \mathbb{R}^{n \times n}$. U contains $u^{(1)},\dots,u^{(n)}$, which is exactly what we want.
3. Take the first k columns of the U matrix and compute z
We’ll assign the first k columns of U to a variable called ‘Ureduce’. This will be an n×k matrix. We compute z with:
$$z^{(i)} = Ureduce^T \cdot x^{(i)}$$
$UreduceZ^T$ will have dimensions k×n while x(i) will have dimensions n×1. The product $Ureduce^T \cdot x^{(i)}$ will have dimensions k×1.
To summarize, the whole algorithm in octave is roughly:
1 | Sigma = (1/m) * X' * X; % compute the covariance matrix |
03_applying-pca
01_reconstruction-from-compressed-representation
In some of the earlier videos, I was talking about PCA as a compression algorithm where you may have say, 1,000-dimensional data and compres it to 100-dimensional feature vector. Or have three-dimensional data and compress it to a two-dimensiona representation. So, if this is a compression algorithm, there should be a way to go bac from this compressed representation back to an approximation of you original high-dimensional data. So given zi, which may 100-dimensional, how do you go back to your original representation, xi which was maybe a 1000-dimensional. In this video, I’d like to describe how to do that. In the PCA algorithm, we may have an example like this, so maybe that’s my example x1, and maybe that’s my example x2. And what we do is we take these examples, and we project them onto this one dimensional surface. And then now we need to use a real number, say z1, to specify the location of these points after they’ve been projected onto this one dimensional surface. So, given the point z1, how can we go back to this original two dimensional space? In particular, given the point z, which is R, can we map this back to some approximate representation x and R2 of whatever the original value of the data was? So whereas z equals U reduce transpose x, if you want to go in the opposite direction, the equation for that is, we’re going to write x approx equals U reduce, times z. And again, just to check the dimensions, here U reduce is going to be an n by k dimensional vector, z is going to be k by one dimensional vector. So you multiply these out that’s going to be n by one, so x approx is going to be an n dimensional vector. And so the intent of PCA, that is if the square projection error is not too big, is that this x approx will be close to whatever was the original value of x that you have used to derive z in the first place. To show a picture of what this looks like, this is what it looks like. What you get back of this procedure are points that lie on the projection of that, onto the green line. So to take our early example, if we started off with this value of x1, and we got this value of z1, if you plug z1 through this formula to get x1 approx, then this point here, that would be x1 approx, which is going to be in R2. And similarly, if you do the same procedure, this would be x2 approx. And that’s a pretty decent approximation to the original data. So that’s how you go back from your low dimensional representation z, back to an uncompressed representation of the data. We get back an approximation to your original data x. And we also call this process reconstruction of the original data where we think of trying to reconstruct the original value of x from the compressed representation. So, given an unlabeled data set, you now know how to apply PCA and take your high dimensional features x and map that to this lower-dimensional representation z. And from this video hopefully you now also know how to take these low-representation z and map it back up to an approximation of your original high-dimensional data
Now that you know how to implement and apply PCA, what I’d like to do next is talk about some of the mechanics of how to actually use PCA well. And in particular in the next video, I’d like to talk about how to choose k, which is how to choose the dimension of the reduced representation vector z.
summary
If we use PCA to compress our data, how can we uncompress our data, or go back to our original number of features?
To go from 1-dimension back to 2d we do: $z \in \mathbb{R} \rightarrow x \in \mathbb{R}^2$.
We can do this with the equation:
$$x_{approx}^{(1)} = U_{reduce} \cdot z^{(1)}$$.
Note that we can only get approximations of our original data.
Note: It turns out that the U matrix has the special property that it is a Unitary Matrix. One of the special properties of a Unitary Matrix is:
$U^{-1} = U^*$ where the “*” means “conjugate transpose”.
Since we are dealing with real numbers here, this is equivalent to:
$U^{-1} = U^T$ So we could compute the inverse and use that, but it would be a waste of energy and compute cycles.
02_choosing-the-number-of-principal-components
In the PCA algorithm we take N dimensional features and reduce them to some K dimensional feature representation. This number K is a parameter of the PCA algorithm. This number K is also called the number of principle components or the number of principle components that we’ve retained. And in this video I’d like to give you some guidelines, tell you about how people tend to think about how to choose this parameter K for PCA.
In order to choose k, that is to choose the number of principal components, here are a couple of useful concepts. What PCA tries to do is it tries to minimize the average squared projection error. So it tries to minimize this quantity, which I’m writing down, which is the difference between the original data X and the projected version, X-approx-i, which was defined last video, so it tries to minimize the squared distance between x and it’s projection onto that lower dimensional surface. So that’s the average square projection error. Also let me define the total variation in the data to be the average length squared of these examples Xi so the total variation in the data is the average of my training sets of the length of each of my training examples. And this one says, “On average, how far are my training examples from the vector, from just being all zeros?” How far is, how far on average are my training examples from the origin? When we’re trying to choose k, a pretty common rule of thumb for choosing k is to choose the smaller values so that the ratio between these is less than 0.01. So in other words, a pretty common way to think about how we choose k is we want the average squared projection error. That is the average distance between x and it’s projections divided by the total variation of the data. That is how much the data varies. We want this ratio to be less than, let’s say, 0.01. Or to be less than 1%, which is another way of thinking about it. And the way most people think about choosing K is rather than choosing K directly the way most people talk about it is as what this number is, whether it is 0.01 or some other number. And if it is 0.01, another way to say this to use the language of PCA is that 99% of the variance is retained. I don’t really want to, don’t worry about what this phrase really means technically but this phrase “99% of variance is retained” just means that this quantity on the left is less than 0.01. And so, if you are using PCA and if you want to tell someone, you know, how many principle components you’ve retained it would be more common to say well, I chose k so that 99% of the variance was retained. And that’s kind of a useful thing to know, it means that you know, the average squared projection error divided by the total variation that was at most 1%. That’s kind of an insightful thing to think about, whereas if you tell someone that, “Well I had to 100 principle components” or “k was equal to 100 in a thousand dimensional data” it’s a little hard for people to interpret that. So this number 0.01 is what people often use. Other common values is 0.05, and so this would be 5%, and if you do that then you go and say well 95% of the variance is retained and, you know other numbers maybe 90% of the variance is retained, maybe as low as 85%. So 90% would correspond to say 0.10, kinda 10%. And so range of values from, you know, 90, 95, 99, maybe as low as 85% of the variables contained would be a fairly typical range in values. Maybe 95 to 99 is really the most common range of values that people use. For many data sets you’d be surprised, in order to retain 99% of the variance, you can often reduce the dimension of the data significantly and still retain most of the variance. Because for most real life data says many features are just highly correlated, and so it turns out to be possible to compress the data a lot and still retain you know 99% of the variance or 95% of the variance.
So how do you implement this? Well, here’s one algorithm that you might use. You may start off, if you want to choose the value of k, we might start off with k equals 1. And then we run through PCA. You know, so we compute, you reduce, compute z1, z2, up to zm. Compute all of those x1 approx and so on up to xm approx and then we check if 99% of the variance is retained. Then we’re good and we use k equals 1. But if it isn’t then what we’ll do we’ll next try K equals 2. And then we’ll again run through this entire procedure and check, you know is this expression satisfied. Is this less than 0.01. And if not then we do this again. Let’s try k equals 3, then try k equals 4, and so on until maybe we get up to k equals 17 and we find 99% of the data have is retained and then we use k equals 17, right? That is one way to choose the smallest value of k, so that and 99% of the variance is retained. But as you can imagine, this procedure seems horribly inefficient we’re trying k equals one, k equals two, we’re doing all these calculations. Fortunately when you implement PCA it actually, in this step, it actually gives us a quantity that makes it much easier to compute these things as well. Specifically when you’re calling SVD to get these matrices u, s, and d, when you’re calling usvd on the covariance matrix sigma, it also gives us back this matrix S and what S is, is going to be a square matrix an N by N matrix in fact, that is diagonal. So is diagonal entries s one one, s two two, s three three down to s n n are going to be the only non-zero elements of this matrix, and everything off the diagonals is going to be zero. Okay? So those big O’s that I’m drawing, by that what I mean is that everything off the diagonal of this matrix all of those entries there are going to be zeros. And so, what is possible to show, and I won’t prove this here, and it turns out that for a given value of k, this quantity over here can be computed much more simply. And that quantity can be computed as one minus sum from i equals 1 through K of Sii divided by sum from I equals 1 through N of Sii. So just to say that it words, or just to take another view of how to explain that, if K equals 3 let’s say. What we’re going to do to compute the numerator is sum from one– I equals 1 through 3 of of Sii, so just compute the sum of these first three elements. So that’s the numerator. And then for the denominator, well that’s the sum of all of these diagonal entries. And one minus the ratio of that, that gives me this quantity over here, that I’ve circled in blue. And so, what we can do is just test if this is less than or equal to 0.01. Or equivalently, we can test if the sum from i equals 1 through k, s-i-i divided by sum from i equals 1 through n, s-i-i if this is greater than or equal to 4.99, if you want to be sure that 99% of the variance is retained. And so what you can do is just slowly increase k, set k equals one, set k equals two, set k equals three and so on, and just test this quantity to see what is the smallest value of k that ensures that 99% of the variance is retained. And if you do this, then you need to call the SVD function only once. Because that gives you the S matrix and once you have the S matrix, you can then just keep on doing this calculation by increasing the value of K in the numerator and so you don’t need keep to calling SVD over and over again to test out the different values of K. So this procedure is much more efficient, and this can allow you to select the value of K without needing to run PCA from scratch over and over. You just run SVD once, this gives you all of these diagonal numbers, all of these numbers S11, S22 down to SNN, and then you can just you know, vary K in this expression to find the smallest value of K, so that 99% of the variance is retained.
So to summarize, the way that I often use, the way that I often choose K when I am using PCA for compression is I would call SVD once in the covariance matrix, and then I would use this formula and pick the smallest value of K for which this expression is satisfied. And by the way, even if you were to pick some different value of K, even if you were to pick the value of K manually, you know maybe you have a thousand dimensional data and I just want to choose K equals one hundred. Then, if you want to explain to others what you just did, a good way to explain the performance of your implementation of PCA to them, is actually to take this quantity and compute what this is, and that will tell you what was the percentage of variance retained. And if you report that number, then, you know, people that are familiar with PCA, and people can use this to get a good understanding of how well your hundred dimensional representation is approximating your original data set, because there’s 99% of variance retained. That’s really a measure of your square of construction error, that ratio being 0.01, just gives people a good intuitive sense of whether your implementation of PCA is finding a good approximation of your original data set.
So hopefully, that gives you an efficient procedure for choosing the number K. For choosing what dimension to reduce your data to, and if you apply PCA to very high dimensional data sets, you know, to like a thousand dimensional data, very often, just because data sets tend to have highly correlated features, this is just a property of most of the data sets you see, you often find that PCA will be able to retain ninety nine percent of the variance or say, ninety five ninety nine, some high fraction of the variance, even while compressing the data by a very large factor.
summary
How do we choose k, also called the number of principal components ? Recall that k is the dimension we are reducing to.
One way to choose k is by using the following formula:
- Given the average squared projection error: $\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2$
- Also given the total variation in the data: $\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2$
- Choose k to be the smallest value such that: $\dfrac{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2}{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2} \leq 0.01$
In other words, the squared projection error divided by the total variation should be less than one percent, so that 99% of the variance is retained .
Algorithm for choosing k
- Try PCA with k=1,2,…
- Compute $U_{reduce}, z, x$
- Check the formula given above that 99% of the variance is retained. If not, go to step one and increase k.
This procedure would actually be horribly inefficient. In Octave, we will call svd:
1 | [U,S,V] = svd(Sigma) |
Which gives us a matrix S. We can actually check for 99% of retained variance using the S matrix as follows:
$$\dfrac{\sum_{i=1}^kS_{ii}}{\sum_{i=1}^nS_{ii}} \geq 0.99$$
03_advice-for-applying-pca
In an earlier video, I had said that PCA can be sometimes used to speed up the running time of a learning algorithm. In this video, I’d like to explain how to actually do that, and also say some, just try to give some advice about how to apply PCA.
Here’s how you can use PCA to speed up a learning algorithm, and this supervised learning algorithm speed up is actually the most common use that I personally make of PCA. Let’s say you have a supervised learning problem, note this is a supervised learning problem with inputs X and labels Y, and let’s say that your examples xi are very high dimensional. So, lets say that your examples, xi are 10,000 dimensional feature vectors. One example of that, would be, if you were doing some computer vision problem, where you have a 100x100 images, and so if you have 100x100, that’s 10000 pixels, and so if xi are, you know, feature vectors that contain your 10000 pixel intensity values, then you have 10000 dimensional feature vectors. So with very high-dimensional feature vectors like this, running a learning algorithm can be slow, right? Just, if you feed 10,000 dimensional feature vectors into logistic regression, or a new network, or support vector machine or what have you, just because that’s a lot of data, that’s 10,000 numbers, it can make your learning algorithm run more slowly. Fortunately with PCA we’ll be able to reduce the dimension of this data and so make our algorithms run more efficiently. Here’s how you do that. We are going first check our labeled training set and extract just the inputs, we’re just going to extract the X’s and temporarily put aside the Y’s. So this will now give us an unlabelled training set x1 through xm which are maybe there’s a ten thousand dimensional data, ten thousand dimensional examples we have. So just extract the input vectors x1 through xm. Then we’re going to apply PCA and this will give me a reduced dimension representation of the data, so instead of 10,000 dimensional feature vectors I now have maybe one thousand dimensional feature vectors. So that’s like a 10x savings. So this gives me, if you will, a new training set. So whereas previously I might have had an example x1, y1, my first training input, is now represented by z1. And so we’ll have a new sort of training example, which is Z1 paired with y1. And similarly Z2, Y2, and so on, up to ZM, YM. Because my training examples are now represented with this much lower dimensional representation Z1, Z2, up to ZM. Finally, I can take this reduced dimension training set and feed it to a learning algorithm maybe a neural network, maybe logistic regression, and I can learn the hypothesis H, that takes this input, these low-dimensional representations Z and tries to make predictions. So if I were using logistic regression for example, I would train a hypothesis that outputs, you know, one over one plus E to the negative-theta transpose Z, that takes this input to one these z vectors, and tries to make a prediction. And finally, if you have a new example, maybe a new test example X. What you do is you would take your test example x, map it through the same mapping that was found by PCA to get you your corresponding z. And that z then gets fed to this hypothesis, and this hypothesis then makes a prediction on your input x. One final note, what PCA does is it defines a mapping from x to z and this mapping from x to z should be defined by running PCA only on the training sets. And in particular, this mapping that PCA is learning, right, this mapping, what that does is it computes the set of parameters. That’s the feature scaling and mean normalization. And there’s also computing this matrix U reduced. But all of these things that U reduce, that’s like a parameter that is learned by PCA and we should be fitting our parameters only to our training sets and not to our cross validation or test sets and so these things the U reduced so on, that should be obtained by running PCA only on your training set. And then having found U reduced, or having found the parameters for feature scaling where the mean normalization and scaling the scale that you divide the features by to get them on to comparable scales. Having found all those parameters on the training set, you can then apply the same mapping to other examples that may be In your cross-validation sets or in your test sets, OK? Just to summarize, when you’re running PCA, run your PCA only on the training set portion of the data not the cross-validation set or the test set portion of your data. And that defines the mapping from x to z and you can then apply that mapping to your cross-validation set and your test set and by the way in this example I talked about reducing the data from ten thousand dimensional to one thousand dimensional, this is actually not that unrealistic.For many problems we actually reduce the dimensional data. You know by 5x maybe by 10x and still retain most of the variance and we can do this barely effecting the performance, in terms of classification accuracy, let’s say, barely affecting the classification accuracy of the learning algorithm. And by working with lower dimensional data our learning algorithm can often run much much faster.
To summarize, we’ve so far talked about the following applications of PCA. First is the compression application where we might do so to reduce the memory or the disk space needed to store data and we just talked about how to use this to speed up a learning algorithm. In these applications, in order to choose K, often we’ll do so according to, figuring out what is the percentage of variance retained, and so for this learning algorithm, speed up application often will retain 99% of the variance. That would be a very typical choice for how to choose k. So that’s how you choose k for these compression applications. Whereas for visualization applications while usually we know how to plot only two dimensional data or three dimensional data, and so for visualization applications, we’ll usually choose k equals 2 or k equals 3, because we can plot only 2D and 3D data sets. So that summarizes the main applications of PCA, as well as how to choose the value of k for these different applications.
I should mention that there is often one frequent misuse of PCA and you sometimes hear about others doing this hopefully not too often. I just want to mention this so that you know not to do it. And there is one bad use of PCA, which iss to try to use it to prevent over-fitting. Here’s the reasoning. This is not a great way to use PCA, but here’s the reasoning behind this method, which is,you know if we have Xi, then maybe we’ll have n features, but if we compress the data, and use Zi instead and that reduces the number of features to k, which could be much lower dimensional. And so if we have a much smaller number of features, if k is 1,000 and n is 10,000, then if we have only 1,000 dimensional data, maybe we’re less likely to over-fit than if we were using 10,000-dimensional data with like a thousand features. So some people think of PCA as a way to prevent over-fitting. But just to emphasize this is a bad application of PCA and I do not recommend doing this. And it’s not that this method works badly. If you want to use this method to reduce the dimensional data, to try to prevent over-fitting, it might actually work OK. But this just is not a good way to address over-fitting and instead, if you’re worried about over-fitting, there is a much better way to address it, to use regularization instead of using PCA to reduce the dimension of the data. And the reason is, if you think about how PCA works, it does not use the labels y. You are just looking at your inputs xi, and you’re using that to find a lower-dimensional approximation to your data. So what PCA does, is it throws away some information. It throws away or reduces the dimension of your data without knowing what the values of y is, so this is probably okay using PCA this way is probably okay if, say 99 percent of the variance is retained, if you’re keeping most of the variance, but it might also throw away some valuable information. And it turns out that if you’re retaining 99% of the variance or 95% of the variance or whatever, it turns out that just using regularization will often give you at least as good a method for preventing over-fitting and regularization will often just work better, because when you are applying linear regression or logistic regression or some other method with regularization, well, this minimization problem actually knows what the values of y are, and so is less likely to throw away some valuable information, whereas PCA doesn’t make use of the labels and is more likely to throw away valuable information.
So, to summarize, it is a good use of PCA, if your main motivation to speed up your learning algorithm, but using PCA to prevent over-fitting, that is not a good use of PCA, and using regularization instead is really what many people would recommend doing instead.
Finally, one last misuse of PCA. And so I should say PCA is a very useful algorithm, I often use it for the compression on the visualization purposes. But, what I sometimes see, is also people sometimes use PCA where it shouldn’t be. So, here’s a pretty common thing that I see, which is if someone is designing a machine-learning system, they may write down the plan like this: let’s design a learning system. Get a training set and then, you know, what I’m going to do is run PCA, then train logistic regression and then test on my test data. So often at the very start of a project, someone will just write out a project plan than says lets do these four steps with PCA inside. Before writing down a project plan the incorporates PCA like this, one very good question to ask is, well, what if we were to just do the whole without using PCA. And often people do not consider this step before coming up with a complicated project plan and implementing PCA and so on. And sometime, and so specifically, what I often advise people is, before you implement PCA, I would first suggest that, you know, do whatever it is, take whatever it is you want to do and first consider doing it with your original raw data xi, and only if that doesn’t do what you want, then implement PCA before using Zi. So, before using PCA you know, instead of reducing the dimension of the data, I would consider well, let’s ditch this PCA step, and I would consider, let’s just train my learning algorithm on my original data. Let’s just use my original raw inputs xi, and I would recommend, instead of putting PCA into the algorithm, just try doing whatever it is you’re doing with the xi first. And only if you have a reason to believe that doesn’t work, so that only if your learning algorithm ends up running too slowly, or only if the memory requirement or the disk space requirement is too large, so you want to compress your representation, but if only using the xi doesn’t work, only if you have evidence or strong reason to believe that using the xi won’t work, then implement PCA and consider using the compressed representation. Because what I do see, is sometimes people start off with a project plan that incorporates PCA inside, and sometimes they, whatever they’re doing will work just fine, even without using PCA instead. So, just consider that as an alternative as well, before you go to spend a lot of time to get PCA in, figure out what k is and so on.
So, that’s it for PCA. Despite these last sets of comments, PCA is an incredibly useful algorithm, when you use it for the appropriate applications and I’ve actually used PCA pretty often and for me, I use it mostly to speed up the running time of my learning algorithms. But I think, just as common an application of PCA, is to use it to compress data, to reduce the memory or disk space requirements, or to use it to visualize data. And PCA is one of the most commonly used and one of the most powerful unsupervised learning algorithms. And with what you’ve learned in these videos, I think hopefully you’ll be able to implement PCA and use them through all of these purposes as well.
summary
The most common use of PCA is to speed up supervised learning.
Given a training set with a large number of features (e.g. $x^{(1)},\dots,x^{(m)} \in \mathbb{R}^{10000}$ ) we can use PCA to reduce the number of features in each example of the training set (e.g. $z^{(1)},\dots,z^{(m)} \in \mathbb{R}^{1000}$).
Note that we should define the PCA reduction from $x^{(i)}$ to $z^{(i)}$ only on the training set and not on the cross-validation or test sets. You can apply the mapping z(i) to your cross-validation and test sets after it is defined on the training set.
Applications
- Compressions
Reduce space of data
Speed up algorithm
- Visualization of data
Choose k = 2 or k = 3
Bad use of PC A: trying to prevent overfitting. We might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results y. Using just regularization will be at least as effective.
Don’t assume you need to do PCA. Try your full machine learning algorithm without PCA first. Then use PCA if you find that you need it.