natural language processing word embeddings

Note

This is my personal lecture note after studying the course nlp sequence models at the 2nd week and the copyright belongs to deeplearning.ai.

01_introduction-to-word-embeddings

01_word-representation

Hello, and welcome back. Last week, we learned about RNNs, GRUs, and LSTMs. In this week, you see how many of these ideas can be applied to NLP, to Natural Language Processing, which is one of the features of AI because it’s really being revolutionized by deep learning. One of the key ideas you learn about is word embeddings, which is a way of representing words. That let your algorithms automatically understand analogies like that, man is to woman, as king is to queen, and many other examples. And through these ideas of word embeddings, you’ll be able to build NPL applications, even with models the size of, usually of relatively small label training sets. Finally towards the end of the week, you’ll see how to debias word embeddings. That’s to reduce undesirable gender or ethnicity or other types of bias that learning algorithms can sometimes pick up. So with that, let’s get started with a discussion on word representation.

So far, we’ve been representing words using a vocabulary of words, and a vocabulary from the previous week might be say, 10,000 words. And we’ve been representing words using a one-hot vector. So for example, if man is word number 5391 in this dictionary, then you represent him with a vector with one in position 5391. And I’m also going to use O subscript 5391 to represent this factor, where O here stands for one-hot. And then, if woman is word number 9853, then you represent it with O subscript 9853 which just has a one in position 9853 and zeros elsewhere. And then other words king, queen, apple, orange will be similarly represented with one-hot vector. One of the weaknesses of this representation is that it treats each word as a thing unto itself, and it doesn’t allow an algorithm to easily generalize the cross words. For example, let’s say you have a language model that has learned that when you see “I want a glass of orange ____”. Well, what do you think the next word will be? Very likely, it’ll be “juice”. But even if the learning algorithm has learned that “I want a glass of orange juice” is a likely sentence, if it sees “I want a glass of apple _____”. As far as it knows the relationship between apple and orange is not any closer as the relationship between any of the other words man, woman, king, queen, and orange. And so, it’s not easy for the learning algorithm to generalize from knowing that orange juice is a popular thing, to recognizing that apple juice might also be a popular thing or a popular phrase. And this is because the any product between any two different one-hot vector is zero. If you take any two vectors say, queen and king and any product of them, the end product is zero. If you take apple and orange and any product of them, the end product is zero. And you couldn’t get distance between any pair of these vectors, which is also the same. So it just doesn’t know that somehow apple and orange are much more similar than king and orange or queen and orange.

So, won’t it be nice if instead of a one-hot presentation we can instead learn a featurized representation with each of these words, a man, woman, king, queen, apple, orange or really for every word in the dictionary, we could learn a set of features and values for each of them. So for example, each of these words, we want to know what is the gender associated with each of these things. So, if gender goes from minus one for male to plus one for female, then the gender associated with man might be minus one, for woman might be plus one. And then eventually, learning these things maybe for king you get minus 0.95, for queen plus 0.97, and for apple and orange sort of genderless. Another feature might be, well how royal are these things. And so the terms, man and woman are not really royal, so they might have feature values close to zero. Whereas king and queen are highly royal. And apple and orange are not really loyal. How about age? Well, man and woman doesn’t connotes much about age. Maybe men and woman implies that they’re adults, but maybe neither necessarily young nor old. So maybe values close to zero. Whereas kings and queens are always almost always adults. And apple and orange might be more neutral with respect to age. And then, another feature for here, is this is a food? Well, man is not a food, woman is not a food, neither are kings and queens, but apples and oranges are foods. And they can be many other features as well ranging from, what is the size of this? What is the cost? Is this something that is a live? Is this an action, or is this a noun, or is this a verb, or is it something else? And so on.

So you can imagine coming up with many features. And for the sake of the illustration let’s say, 300 different features, and what that does is, it allows you to take this list of numbers, I’ve only written four here, but this could be a list of 300 numbers, that then becomes a 300 dimensional vector for representing the word man. And I’m going to use the notation e subscript 5391 to denote a representation like this. And similarly, this vector, this 300 dimensional vector or 300 dimensional vector like this, I would denote e9853 to denote a 300 dimensional vector we could use to represent the word woman. And similarly, for the other examples here. Now, if you use this representation to represent the words orange and apple, then notice that the representations for orange and apple are now quite similar. Some of the features will differ because of the color of an orange, the color an apple, the taste, or some of the features would differ. But by a large, a lot of the features of apple and orange are actually the same, or take on very similar values. And so, this increases the odds of the learning algorithm that has figured out that orange juice is a thing, to also quickly figure out that apple juice is a thing. So this allows it to generalize better across different words. So over the next few videos, we’ll find a way to learn words embeddings. We just need you to learn high dimensional feature vectors like these, that gives a better representation than one-hot vectors for representing different words. And the features we’ll end up learning, won’t have a easy to interpret interpretation like that component one is gender, component two is royal, component three is age and so on. Exactly what they’re representing will be a bit harder to figure out. But nonetheless, the featurized representations we will learn, will allow an algorithm to quickly figure out that apple and orange are more similar than say, king and orange or queen and orange.

If we’re able to learn a 300 dimensional feature vector or 300 dimensional embedding for each words, one of the popular things to do is also to take this 300 dimensional data and embed it say, in a two dimensional space so that you can visualize them. And so, one common algorithm for doing this is the t-SNE algorithm due to Laurens van der Maaten and Geoff Hinton. And if you look at one of these embeddings, one of these representations, you find that words like man and woman tend to get grouped together, king and queen tend to get grouped together, and these are the people which tends to get grouped together. Those are animals who can get grouped together. Fruits will tend to be close to each other. Numbers like one, two, three, four, will be close to each other. And then, maybe the animate objects as whole will also tend to be grouped together. But you see plots like these sometimes on the internet to visualize some of these 300 or higher dimensional embeddings. And maybe this gives you a sense that, word embeddings algorithms like this can learn similar features for concepts that feel like they should be more related, as visualized by that concept that seem to you and me like they should be more similar, end up getting mapped to a more similar feature vectors. And these representations will use these sort of featurized representations in maybe a 300 dimensional space, these are called embeddings. And the reason we call them embeddings is, you can think of a 300 dimensional space. And again, they can’t draw out here in two dimensional space because it’s a 3D one. And what you do is you take every words like orange, and have a three dimensional feature vector so that word orange gets embedded to a point in this 300 dimensional space. And the word apple, gets embedded to a different point in this 300 dimensional space. And of course to visualize it, algorithms like t-SNE, map this to a much lower dimensional space, you can actually plot the 2D data and look at it. But that’s what the term embedding comes from.

Word embeddings has been one of the most important ideas in NLP, in Natural Language Processing. In this video, you saw why you might want to learn or use word embeddings. In the next video, let’s take a deeper look at how you’ll be able to use these algorithms, to build NLP algorithims.

02_using-word-embeddings

In the last video, you saw what it might mean to learn a featurized representations of different words. In this video, you see how we can take these representations and plug them into NLP applications. Let’s start with an example.

Continuing with the named entity recognition example, if you’re trying to detect people’s names. Given a sentence like Sally Johnson is an orange farmer, hopefully, you’ll figure out that Sally Johnson is a person’s name, hence, the outputs 1 like that. And one way to be sure that Sally Johnson has to be a person, rather than say the name of the corporation is that you know orange farmer is a person. So previously, we had talked about one hot representations to represent these words, x(1), x(2), and so on. But if you can now use the featurized representations, the embedding vectors that we talked about in the last video. Then after having trained a model that uses word embeddings as the inputs, if you now see a new input, Robert Lin is an apple farmer. Knowing that orange and apple are very similar will make it easier for your learning algorithm to generalize to figure out that Robert Lin is also a human, is also a person’s name. One of the most interesting cases will be, what if in your test set you see not Robert Lin is an apple farmer, but you see much less common words? What if you see Robert Lin is a durian cultivator? A durian is a rare type of fruit, popular in Singapore and a few other countries. But if you have a small label training set for the named entity recognition task, you might not even have seen the word durian or seen the word cultivator in your training set. I guess technically, this should be a durian cultivator. But if you have learned a word embedding that tells you that durian is a fruit, so it’s like an orange, and a cultivator, someone that cultivates is like a farmer, then you might still be generalize from having seen an orange farmer in your training set to knowing that a durian cultivator is also probably a person. So one of the reasons that word embeddings will be able to do this is the algorithms to learning word embeddings can examine very large text corpuses, maybe found off the Internet. So you can examine very large data sets, maybe a billion words, maybe even up to 100 billion words would be quite reasonable. So very large training sets of just unlabeled text. And by examining tons of unlabeled text, which you can download more or less for free, you can figure out that orange and durian are similar. And farmer and cultivator are similar, and therefore, learn embeddings, that groups them together. Now having discovered that orange and durian are both fruits by reading massive amounts of Internet text, what you can do is then take this word embedding and apply it to your named entity recognition task, for which you might have a much smaller training set, maybe just 100,000 words in your training set, or even much smaller. And so this allows you to carry out transfer learning, where you take information you’ve learned from huge amounts of unlabeled text that you can suck down essentially for free off the Internet to figure out that orange, apple, and durian are fruits. And then transfer that knowledge to a task, such as named entity recognition, for which you may have a relatively small labeled training set. And, of course, for simplicity, l drew this for it only as a unidirectional RNN. If you actually want to carry out the named entity recognition task, you should, of course, use a bidirectional RNN rather than a simpler one I’ve drawn here.

But to summarize, this is how you can carry out transfer learning using word embeddings. Step 1 is to learn word embeddings from a large text corpus, a very large text corpus or you can also download pre-trained word embeddings online. There are several word embeddings that you can find online under very permissive licenses. And you can then take these word embeddings and transfer the embedding to new task, where you have a much smaller labeled training sets. And use this, let’s say, 300 dimensional embedding, to represent your words. One nice thing also about this is you can now use relatively lower dimensional feature vectors. So rather than using a 10,000 dimensional one-hot vector, you can now instead use maybe a 300 dimensional dense vector. Although the one-hot vector is fast and the 300 dimensional vector that you might learn for your embedding will be a dense vector. And then, finally, as you train your model on your new task, on your named entity recognition task with a smaller label data set, one thing you can optionally do is to continue to fine tune, continue to adjust the word embeddings with the new data. In practice, you would do this only if this task 2 has a pretty big data set. If your label data set for step 2 is quite small, then usually, I would not bother to continue to fine tune the word embeddings. So word embeddings tend to make the biggest difference when the task you’re trying to carry out has a relatively smaller training set. So it has been useful for many NLP tasks. And I’ll just name a few. Don’t worry if you don’t know these terms. It has been useful for named entity recognition, for text summarization, for co-reference resolution, for parsing. These are all maybe pretty standard NLP tasks. It has been less useful for language modeling, machine translation, especially if you’re accessing a language modeling or machine translation task for which you have a lot of data just dedicated to that task. So as seen in other transfer learning settings, if you’re trying to transfer from some task A to some task B, the process of transfer learning is just most useful when you happen to have a ton of data for A and a relatively smaller data set for B. And so that’s true for a lot of NLP tasks, and just less true for some language modeling and machine translation settings.

Finally, word embeddings has a interesting relationship to the face encoding ideas that you learned about in the previous course, if you took the convolutional neural networks course. So you will remember that for face recognition, we train this Siamese network architecture that would learn, say, a 128 dimensional representation for different faces. And then you can compare these encodings in order to figure out if these two pictures are of the same face. The words encoding and embedding mean fairly similar things. So in the face recognition literature, people also use the term encoding to refer to these vectors, f(x(i)) and f(x(j)). One difference between the face recognition literature and what we do in word embeddings is that, for face recognition, you wanted to train a neural network that can take as input any face picture, even a picture you’ve never seen before, and have a neural network compute an encoding for that new picture. Whereas what we’ll do, and you’ll understand this better when we go through the next few videos, whereas what we’ll do for learning word embeddings is that we’ll have a fixed vocabulary of, say, 10,000 words. And we’ll learn a vector e1 through, say, e10,000 that just learns a fixed encoding or learns a fixed embedding for each of the words in our vocabulary. So that’s one difference between the set of ideas you saw for face recognition versus what the algorithms we’ll discuss in the next few videos. But the terms encoding and embedding are used somewhat interchangeably. So the difference I just described is not represented by the difference in terminologies. It’s just a difference in how we need to use these algorithms in face recognition, where there’s unlimited sea of pictures you could see in the future. Versus natural language processing, where there might be just a fixed vocabulary, and everything else like that we’ll just declare as an unknown word.

So in this video, you saw how using word embeddings allows you to implement this type of transfer learning. And how, by replacing the one-hot vectors we’re using previously with the embedding vectors, you can allow your algorithms to generalize much better, or you can learn from much less label data. Next, I want to show you just a few more properties of these word embeddings. And then after that, we will talk about algorithms for actually learning these word embeddings. Let’s go on to the next video, where you’ll see how word embeddings can help with reasoning about analogies.

03_properties-of-word-embeddings

By now, you should have a sense of how word embeddings can help you build NLP applications. One of the most fascinating properties of word embeddings is that they can also help with analogy reasoning. And while reasonable analogies may not be by itself the most important NLP application, they might also help convey a sense of what these word embeddings are doing, what these word embeddings can do. Let me show you what I mean here are the featurized representations of a set of words that you might hope a word embedding could capture.

Let’s say I pose a question, man is to woman as king is to what? Many of you will say, man is to woman as king is to queen. But is it possible to have an algorithm figure this out automatically? Well, here’s how you could do it, let’s say that you’re using this four dimensional vector to represent man. So this will be your E5391, although just for this video, let me call this e subscript man. And let’s say that’s the embedding vector for woman, so I’m going to call that e subscript woman, and similarly for king and queen. And for this example, I’m just going to assume you’re using four dimensional embeddings, rather than anywhere from 50 to 1,000 dimensional, which would be more typical. One interesting property of these vectors is that if you take the vector, e man, and subtract the vector e woman, then, You end up with approximately -1, negative another 1 is -2, decimal 0- 0, 0- 0, close to 0- 0, so you get roughly -2 0 0 0. And similarly if you take e king minus e queen, then that’s approximately the same thing. That’s about -1- 0.97, it’s about -2. This is about 1- 1, since kings and queens are both about equally royal. So that’s 0, and then age difference, food difference, 0. And so what this is capturing is that the main difference between man and woman is the gender. And the main difference between king and queen, as represented by these vectors, is also the gender. Which is why the difference e man- e woman, and the difference e king- e queen, are about the same. So one way to carry out this analogy reasoning is, if the algorithm is asked, man is to woman as king is to what? What it can do is compute e man- e woman, and try to find a vector, try to find a word so that e man- e woman is close to e king- e of that new word. And it turns out that when queen is the word plugged in here, then the left hand side is close to the the right hand side. So these ideas were first pointed out by Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. And it’s been one of the most remarkable and surprisingly influential results about word embeddings. And I think has helped the whole community get better intuitions about what word embeddings are doing. So let’s formalize how you can turn this into an algorithm.

In pictures, the word embeddings live in maybe a 300 dimensional space. And so the word man is represented as a point in the space, and the word woman is represented as a point in the space. And the word king is represented as another point, and the word queen is represented as another point. And what we pointed out really on the last slide is that the vector difference between man and woman is very similar to the vector difference between king and queen. And this arrow I just drew is really the vector that represents a difference in gender. And remember, these are points we’re plotting in a 300 dimensional space. So in order to carry out this kind of analogical reasoning to figure out, man is to woman is king is to what, what you can do is try to find the word w, So that, This equation holds true, so you want there to be, A high degree of a similarity, between I’m going to use s, And so what you want is to find the word w that maximizes the similarity between, e w compared to e king- e man + e woman Right, so what I did is, I took this e question mark, and replaced that with ew, and then brought ew to just one side of the equation. And then the other three terms to the right hand side of this equation. So we have some appropriate similarity function for measuring how similar is the embedding of some word w to this quantity of the right. Then finding the word that maximizes the similarity should hopefully let you pick out the word queen. And the remarkable thing is, this actually works. If you learn a set of word embeddings and find a word w that maximizes this type of similarity, you can actually get the exact right answer. Depending on the details of the task, but if you look at research papers, it’s not uncommon for research papers to report anywhere from, say, 30% to 75% accuracy on analogy using tasks like these. Where you count an anology attempt as correct only if it guesses the exact word right. So only if, in this case, it picks out the word queen. Before moving on, I just want to clarify what this plot on the left is. Previously, we talked about using algorithms like t-SNE to visualize words. What t-SNE does is, it takes 300-D data, and it maps it in a very non-linear way to a 2D space. And so the mapping that t-SNE learns, this is a very complicated and very non-linear mapping. So after the t-SNE mapping, you should not expect these types of parallelogram relationships, like the one we saw on the left, to hold true. And it’s really in this original 300 dimensional space that you can more reliably count on these types of parallelogram relationships in analogy pairs to hold true. And it may hold true after a mapping through t-SNE, but in most cases, because of t-SNE’s non-linear mapping, you should not count on that. And many of the parallelogram analogy relationships will be broken by t-SNE.

Now, before moving on, let me just quickly describe the similarity function that is most commonly used. So the most commonly used similarity function is called cosine similarity. So this is the equation we had from the previous slide. So in cosine similarity, you define the similarity between two vectors u and v as u transpose v divided by the lengths by the Euclidean lengths. So ignoring the denominator for now, this is basically the inner product between u and v. And so if u and v are very similar, their inner product will tend to be large. And this is called cosine similarity because this is actually the cosine of the angle between the two vectors, u and v. So that’s the angle phi, so this formula is actually the cosine between them. And so you remember from calculus that if this phi, then the cosine of phi looks like this. So if the angle between them is 0, then the cosine similarity is equal to 1. And if their angle is 90 degrees, the cosine similarity is 0. And then if they’re 180 degrees, or pointing in completely opposite directions, it ends up being -1. So that’s where the term cosine similarity comes from, and it works quite well for these analogy reasoning tasks. If you want, you can also use square distance or Euclidian distance, u-v squared. Technically, this would be a measure of dissimilarity rather than a measure of similarity. So we need to take the negative of this, and this will work okay as well. Although I see cosine similarity being used a bit more often. And the main difference between these is how it normalizes the lengths of the vectors u and v. So one of the remarkable results about word embeddings is the generality of analogy relationships they can learn. So for example, it can learn that man is to woman as boy is to girl, because the vector difference between man and woman, similar to king and queen and boy and girl, is primarily just the gender. It can learn that Ottawa, which is the capital of Canada, that Ottawa is to Canada as Nairobi is to Kenya. So that’s the city capital is to the name of the country. It can learn that big is to bigger as tall is to taller, and it can learn things like that. Yen is to Japan, since yen is the currency of Japan, as ruble is to Russia. And all of these things can be learned just by running a word embedding learning algorithm on the large text corpus. It can spot all of these patterns by itself, just by running from very large bodies of text.

So in this video, you saw how word embeddings can be used for analogy reasoning. And while you might not be trying to build an analogy reasoning system yourself as an application, this I hope conveys some intuition about the types of feature-like representations that these representations can learn. And you also saw how cosine similarity can be a way to measure the similarity between two different word embeddings. Now, we talked a lot about properties of these embeddings and how you can use them. Next, let’s talk about how you’d actually learn these word embeddings, let’s go on to the next video.

04_embedding-matrix

Let’s start to formalize the problem of learning a good word embedding. When you implement an algorithm to learn a word embedding, what you end up learning is an embedding matrix. Let’s take a look at what I means.

Let’s say, as usual we’re using our 10,000-word vocabulary. So, the vocabulary has A, Aaron, Orange, Zulu, maybe also unknown word as a token. What we’re going to do is learn embedding matrix E, which is going to be a 300 dimensional by 10,000 dimensional matrix, if you have 10,000 words vocabulary or maybe 10,001 is our word token, there’s one extra token. And the columns of this matrix would be the different embeddings for the 10,000 different words you have in your vocabulary. So, Orange was word number 6257 in our vocabulary of 10,000 words. So, one piece of notation we’ll use is that 06257 was the one-hot vector with zeros everywhere and a one in position 6257. And so, this will be a 10,000-dimensional vector with a one in just one position. So, this isn’t quite a drawn scale. Yes, this should be as tall as the embedding matrix on the left is wide. And if the embedding matrix is called capital E then notice that if you take E and multiply it by just one-hot vector by 0 of 6257, then this will be a 300-dimensional vector. So, E is 300 by 10,000 and 0 is 10,000 by 1. So, the product will be 300 by 1, so with 300-dimensional vector and notice that to compute the first element of this vector, of this 300-dimensional vector, what you do is you will multiply the first row of the matrix E with this. But all of these elements are zero except for element 6257 and so you end up with zero times this, zero times this, zero times this, and so on. And then, 1 times whatever this is, and zero times this, zero times this, zero times and so on. And so, you end up with the first element as whatever is that elements up there, under the Orange column. And then, for the second element of this 300-dimensional vector we’re computing, you would take the vector 0657 and multiply it by the second row with the matrix E. So again, you have zero times this, plus zero times this, plus zero times all of these are the elements and then one times this, and then zero times everything else and add that together. So you end up with this and so on as you go down the rest of this column. So, that’s why the embedding matrix E times this one-hot vector here winds up selecting out this 300-dimensional column corresponding to the word Orange. So, this is going to be equal to E 6257 which is the notation we’re going to use to represent the embedding vector that 300 by one dimensional vector for the word Orange. And more generally, E for a specific word W, this is going to be embedding for a word W. And more generally, E times O substitute J, one-hot vector with one that position J, this is going to be E_J and that’s going to be the embedding for word J in the vocabulary. So, the thing to remember from this slide is that our goal will be to learn an embedding matrix E and what you see in the next video is you initialize E randomly and you’re straight in the sense to learn all the parameters of this 300 by 10,000 dimensional matrix and E times this one-hot vector gives you the embedding vector. Now just one note, when we’re writing the equation, it’ll be convenient to write this type of notation where you take the matrix E and multiply it by the one-hot vector O. But if when you’re implementing this, it is not efficient to actually implement this as a mass matrix vector multiplication because the one-hot vectors, now this is a relatively high dimensional vector and most of these elements are zero. So, it’s actually not efficient to use a matrix vector multiplication to implement this because if we multiply a whole bunch of things by zeros and so the practice, you would actually use a specialized function to just look up a column of the Matrix E rather than do this with the matrix multiplication. But writing of the map, it is just convenient to write it out this way. So, in Keras’s for example there is a embedding layer and we use the embedding layer then it more efficiently just pulls out the column you want from the embedding matrix rather than does it with a much slower matrix vector multiplication.

So, in this video you saw the notations were used to describe algorithms to learning these embeddings and the key terminology is this matrix capital E which contain all the embeddings for the words of the vocabulary. In the next video, we’ll start to talk about specific algorithms for learning this matrix E. Let’s go onto the next video.

02_learning-word-embeddings-word2vec-glove

01_learning-word-embeddings

In this video, you’ll start to learn some concrete algorithms for learning word embeddings. In the history of deep learning as applied to learning word embeddings, people actually started off with relatively complex algorithms. And then over time, researchers discovered they can use simpler and simpler and simpler algorithms and still get very good results especially for a large dataset. But what happened is, some of the algorithms that are most popular today, they are so simple that if I present them first, it might seem almost a little bit magical, how can something this simple work? So, what I’m going to do is start off with some of the slightly more complex algorithms because I think it’s actually easier to develop intuition about why they should work, and then we’ll move on to simplify these algorithms and show you some of the simple algorithms that also give very good results. So, let’s get started.

Let’s say you’re building a language model and you do it with a neural network. So, during training, you might want your neural network to do something like input, I want a glass of orange, and then predict the next word in the sequence. And below each of these words, I have also written down the index in the vocabulary of the different words. So it turns out that building a neural language model is the small way to learn a set of embeddings. And the ideas I present on this slide were due to Yoshua Bengio, Rejean Ducharme, Pascals Vincent, and Christian Jauvin. So, here’s how you can build a neural network to predict the next word in the sequence. Let me take the list of words, I want a glass of orange, and let’s start with the first word I. So I’m going to construct one add vector corresponding to the word I. So there’s a one add vector with a one in position, 4343. So this is going to be 10,000 dimensional vector. And what we’re going to do is then have a matrix of parameters E, and take E times O to get an embedding vector e4343, and this step really means that e4343 is obtained by the matrix E times the one add vector 43. And then we’ll do the same for all of the other words. So the word want, is where 9665 one add vector, multiply by E to get the embedding vector. And similarly, for all the other words. A, is a first word in dictionary, alphabetic comes first, so there is O one, gets this E one. And similarly, for the other words in this phrase. So now you have a bunch of three dimensional embedding, so each of this is a 300 dimensional embedding vector. And what we can do, is fill all of them into a neural network. So here is the neural network layer. And then this neural network feeds to a softmax, which has it’s own parameters as well. And a softmax classifies among the 10,000 possible outputs in the vocab for those final word we’re trying to predict. And so, if in the training slide we saw the word juice then, the target for the softmax in training repeat that it should predict the other word juice was what came after this. So this hidden name here will have his own parameters. So have some, I’m going to call this W1 and there’s also B1. The softmax there was this own parameters W2, B2, and they’re using 300 dimensional word embeddings, then here we have six words. So, this would be six times 300. So this layer or this input will be a 1,800 dimensional vector obtained by taking your six embedding vectors and stacking them together.

Well, what’s actually more commonly done is to have a fixed historical window. So for example, you might decide that you always want to predict the next word given say the previous four words, where four here is a hyperparameter of the algorithm. So this is how you adjust to either very long or very short sentences or you decide to always just look at the previous four words, so you say, I will still use those four words. And so, let’s just get rid of these. And so, if you’re always using a four word history, this means that your neural network will input a 1,200 dimensional feature vector, go into this layer, then have a softmax and try to predict the output. And again, variety of choices. And using a fixed history, just means that you can deal with even arbitrarily long sentences because the input sizes are always fixed. So, the parameters of this model will be this matrix E, and use the same matrix E for all the words. So you don’t have different matrices for different positions in the proceedings four words, is the same matrix E. And then, these weights are also parameters of the algorithm and you can use backprop to perform gradient descent to maximize the likelihood of your training set to just repeatedly predict given four words in a sequence, what is the next word in your text corpus? And it turns out that this algorithm we’ll learn pretty decent word embeddings. And the reason is, if you remember our orange juice, apple juice example, is in the algorithm’s incentive to learn pretty similar word embeddings for orange and apple because doing so allows it to fit the training set better because it’s going to see orange juice sometimes, or see apple juice sometimes, and so, if you have only a 300 dimensional feature vector to represent all of these words, the algorithm will find that it fits the training set fast. If apples, oranges, and grapes, and pears, and so on and maybe also durians which is a very rare fruit and that with similar feature vectors. So, this is one of the earlier and pretty successful algorithms for learning word embeddings, for learning this matrix E. But now let’s generalize this algorithm and see how we can derive even simpler algorithms.

So, I want to illustrate the other algorithms using a more complex sentence as our example. Let’s say that in your training set, you have this longer sentence, I want a glass of orange juice to go along with my cereal. So, what we saw on the last slide was that the job of the algorithm was to predict some word juice, which we are going to call the target words, and it was given some context which was the last four words. And so, if your goal is to learn a embedding of researchers I’ve experimented with many different types of context. If it goes to build a language model then is natural for the context to be a few words right before the target word. But if your goal is into learn the language model per se, then you can choose other contexts. For example, you can pose a learning problem where the context is the four words on the left and right. So, you can take the four words on the left and right as the context, and what that means is that we’re posing a learning problem where the algorithm is given four words on the left. So, a glass of orange, and four words on the right, to go along with, and this has to predict the word in the middle. And posing a learning problem like this where you have the embeddings of the left four words and the right four words feed into a neural network, similar to what you saw in the previous slide, to try to predict the word in the middle, try to put it target word in the middle, this can also be used to learn word embeddings. Or if you want to use a simpler context, maybe you’ll just use the last one word. So given just the word orange, what comes after orange? So this will be different learning problem where you tell it one word, orange, and will say well, what do you think is the next word. And you can construct a neural network that just fits in the word, the one previous word or the embedding of the one previous word to a neural network as you try to predict the next word. Or, one thing that works surprisingly well is to take a nearby one word. Some might tell you that, well, take the word glass, is somewhere close by. Some might say, I saw the word glass and then there’s another words somewhere close to glass, what do you think that word is? So, that’ll be using nearby one word as the context. And we’ll formalize this in the next video but this is the idea of a Skip-Gram model, and just an example of a simpler algorithm where the context is now much simpler, is just one word rather than four words, but this works remarkably well. So what researchers found was that if you really want to build a language model, it’s natural to use the last few words as a context. But if your main goal is really to learn a word embedding, then you can use all of these other contexts and they will result in very meaningful work embeddings as well. I will formalize the details of this in the next video where we talk about the Word2Vec model.

To summarize, in this video you saw how the language modeling problem which causes the pose of machines learning problem where you input the context like the last four words and predicts some target words, how posing that problem allows you to learn input word embedding. In the next video, you’ll see how using even simpler context and even simpler learning algorithms to mark from context to target word, can also allow you to learn a good word embedding. Let’s go on to the next video where we’ll discuss the Walter VEC.

02_word2vec

In the last video, you saw how you can learn a neural language model in order to get good word embeddings. In this video, you see the Word2Vec algorithm which is simple and comfortably more efficient way to learn this types of embeddings. Lets take a look.

Most of the ideas I’ll present in this video are due to Tomas Mikolov, Kai Chen, Greg Corrado, and Jeff Dean. Let’s say you’re given this sentence in your training set. In the skip-gram model, what we’re going to do is come up with a few context to target pairs to create our supervised learning problem. So rather than having the context be always the last four words or the last end words immediately before the target word, what I’m going to do is, say, randomly pick a word to be the context word. And let’s say we chose the word orange. And what we’re going to do is randomly pick another word within some window. Say plus minus five words or plus minus ten words of the context word and we choose that to be target word. So maybe just by chance you might pick juice to be a target word, that’s just one word later. Or you might choose two words before. So you have another pair where the target could be glass or, Maybe just by chance you choose the word my as the target. And so we’ll set up a supervised learning problem where given the context word, you’re asked to predict what is a randomly chosen word within say, a plus minus ten word window, or plus minus five or ten word window of that input context word. And obviously, this is not a very easy learning problem, because within plus minus 10 words of the word orange, it could be a lot of different words. But a goal of setting up this supervised learning problem, isn’t to do well on the supervised learning problem per se, it is that we want to use this learning problem to learn good word embeddings.

So, here are the details of the model. Let’s say that we’ll continue to our vocab of 10,000 words. And some have been on vocab sizes that exceeds a million words. But the basic supervised learning problem we’re going to solve is that we want to learn the mapping from some Context c, such as the word orange to some target, which we will call t, which might be the word juice or the word glass or the word my, if we use the example from the previous slide. So in our vocabulary, orange is word 6257, and the word juice is the word 4834 in our vocab of 10,000 words. And so that’s the input x that you want to learn to map to that open y. So to represent the input such as the word orange, you can start out with some one hot vector which is going to be write as $o_c$, so there’s a one hot vector for the context words. And then similar to what you saw on the last video you can take the embedding matrix E, multiply E by the vector $o_c$, and this gives you your embedding vector for the input context word, so here $e_c$ is equal to capital E times that one hot vector. Then in this new network that we formed we’re going to take this vector $e_c$ and feed it to a softmax unit. So I’ve been drawing softmax unit as a node in a neural network. That’s not an o, that’s a softmax unit. And then there’s a drop in the softmax unit to output $\hat{y}$. So to write out this model in detail. This is the model, the softmax model, probability of different tanka words given the input context word as e to the e, theta t transpose,$e_c$. Divided by some over all words, so we’re going to say, sum from J equals one to all 10,000 words of e to the theta j transposed $e_c$. So here theta T is the parameter associated with, I’ll put t, but really there’s a chance of a particular word, t, being the label. So I’ve left off the biased term to solve mass but we could include that too if we wish. And then finally the loss function for softmax will be the usual. So we use y to represent the target word. And we use a one-hot representation for y hat and y here. Then the lost would be The negative log liklihood, so sum from i equals 1 to 10,000 of $y_ilog(\hat{y}_i)$. So that’s a usual loss for softmax where we’re representing the target y as a one hot vector. So this would be a one hot vector with just 1 1 and the rest zeros. And if the target word is juice, then it’d be element 4834 from up here. That is equal to 1 and the rest will be equal to 0. And similarly Y hat will be a 10,000 dimensional vector output by the softmax unit with probabilities for all 10,000 possible targets words. So to summarize, this is the overall little model, little neural network with basically looking up the embedding and then just a soft max unit. And the matrix E will have a lot of parameters, so the matrix E has parameters corresponding to all of these embedding vectors, $e_c$. And then the softmax unit also has parameters that gives the theta T parameters but if you optimize this loss function with respect to the all of these parameters, you actually get a pretty good set of embedding vectors. So this is called the skip-gram model because is taking as input one word like orange and then tr$y_i$ng to predict some words skipping a few words from the left or the right side. To predict what comes little bit before little bit after the context words.

Now, it turns out there are a couple problems with using this algorithm. And the primary problem is computational speed. In particular, for the softmax model, every time you want to evaluate this probability, you need to carry out a sum over all 10,000 words in your vocabulary. And maybe 10,000 isn’t too bad, but if you’re using a vocabulary of size 100,000 or a 1,000,000, it gets really slow to sum up over this denominator every single time. And, in fact, 10,000 is actually already that will be quite slow, but it makes even harder to scale to larger vocabularies. So there are a few solutions to this, one which you see in the literature is to use a hierarchical softmax classifier. And what that means is, instead of trying to categorize something into all 10,000 carries on one go. Imagine if you have one classifier, it tells you is the target word in the first 5,000 words in the vocabulary? Or is in the second 5,000 words in the vocabulary? And lets say this binary cost that it tells you this is in the first 5,000 words, think of second class to tell you that this in the first 2,500 words of vocab or in the second 2,500 words vocab and so on. Until eventually you get down to classify exactly what word it is, so that the leaf of this tree, and so having a tree of classifiers like this, means that each of the retriever nodes of the tree can be just a binding classifier. And so you don’t need to sum over all 10,000 words or else it will capsize in order to make a single classification. In fact, the computational classifying tree like this scales like log of the vocab size rather than linear in vocab size. So this is called a hierarchical softmax classifier. I should mention in practice, the hierarchical softmax classifier doesn’t use a perfectly balanced tree or this perfectly symmetric tree, with equal numbers of words on the left and right sides of each branch. In practice, the hierarchical softmax classifier can be developed so that the common words tend to be on top, whereas the less common words like durian can be buried much deeper in the tree. Because you see the more common words more often, and so you might need only a few traversals to get to common words like the and of. Whereas you see less frequent words like durian much less often, so it says okay that are buried deep in the tree because you don’t need to go that deep. So there are various heuristics for building the tree how you used to build the hierarchical software spire. So this is one idea you see in the literature, the speeding up the softmax classification. But I won’t spend too much more time. And you can read more details of this on the paper that I referenced by Thomas and others, on the first slide. But I won’t spend too much more time on this. Because in the next video, where she talk about a different method, called nectar sampling, which I think is even simpler. And also works really well for speeding up the softmax classifier and the problem of needing the sum over the entire cap size in the denominator. So you see more of that in the next video. But before moving on, one quick Topic I want you to understand is how to sample the context C. So once you sample the context C, the target T can be sampled within, say, a plus minus ten word window of the context C, but how do you choose the context C? One thing you could do is just sample uniformly, at random, from your training corpus. When we do that, you find that there are some words like the, of, a, and, to and so on that appear extremely frequently. And so, if you do that, you find that in your context to target mapping pairs just get these these types of words extremely frequently, whereas there are other words like orange, apple, and also durian that don’t appear that often. And maybe you don’t want your training site to be dominated by these extremely frequently or current words, because then you spend almost all the effort updating $e_c$, for those frequently occurring words. But you want to make sure that you spend some time updating the embedding, even for these less common words like e durian. So in practice the distribution of words $P(c)$ isn’t taken just entirely uniformly at random for the training set purpose, but instead there are different heuristics that you could use in order to balance out something from the common words together with the less common words.

So that’s it for the Word2Vec skip-gram model. If you read the original paper by that I referenced earlier, you find that that paper actually had two versions of this Word2Vec model, the skip gram was one. And the other one is called the CBow, the continuous backwards model, which takes the surrounding contexts from middle word, and uses the surrounding words to try to predict the middle word, and that algorithm also works, it has some advantages and disadvantages. But the key problem with this algorithm with the skip-gram model as presented so far is that the softmax step is very expensive to calculate because needing to sum over your entire vocabulary size into the denominator of the soft packs. In the next video I show you an algorithm that modifies the training objective that makes it run much more efficiently therefore lets you apply this in a much bigger fitting set as well and therefore learn much better word embeddings. Lets go onto the next video.

03_negative-sampling

In the last video, you saw how the Skip-Gram model allows you to construct a supervised learning task. So we map from context to target and how that allows you to learn a useful word embedding. But the downside of that was the Softmax objective was slow to compute. In this video, you’ll see a modified learning problem called negative sampling that allows you to do something similar to the Skip-Gram model you saw just now, but with a much more efficient learning algorithm. Let’s see how you can do this.

Most of the ideas presented in this video are due to Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. So what we’re going to do in this algorithm is create a new supervised learning problem. And the problem is, given a pair of words like orange and juice, we’re going to predict, is this a context-target pair? So in this example, orange juice was a positive example. And how about orange and king? Well, that’s a negative example, so I’m going to write 0 for the target. So what we’re going to do is we’re actually going to sample a context and a target word. So in this case, we have orange and juice and we’ll associate that with a label of 1, so just put words in the middle. And then having generated a positive example, so the positive example is generated exactly how we generated it in the previous videos. Sample a context word, look around a window of say, plus-minus ten words and pick a target word. So that’s how you generate the first row of this table with orange, juice, 1. And then to generate a negative example, you’re going to take the same context word and then just pick a word at random from the dictionary. So in this case, I chose the word king at random and we will label that as 0. And then let’s take orange and let’s pick another random word from the dictionary. Under the assumption that if we pick a random word, it probably won’t be associated with the word orange, so orange, book, 0. And let’s pick a few others, orange, maybe just by chance, we’ll pick the 0 and then orange. And then orange, and maybe just by chance, we’ll pick the word of and we’ll put a 0 there. And notice that all of these are labeled as 0 even though the word of actually appears next to orange as well.

So to summarize, the way we generated this data set is, we’ll pick a context word and then pick a target word and that is the first row of this table. That gives us a positive example. So context, target, and then give that a label of 1. And then what we’ll do is for some number of times say, k times, we’re going to take the same context word and then pick random words from the dictionary, king, book, the, of, whatever comes out at random from the dictionary and label all those 0, and those will be our negative examples. And it’s okay if just by chance, one of those words we picked at random from the dictionary happens to appear in the window, in a plus-minus ten word window say, next to the context word, orange. Then we’re going to create a supervised learning problem where the learning algorithm inputs x, inputs this pair of words, and it has to predict the target label to predict the output y. So the problem is really given a pair of words like orange and juice, do you think they appear together? Do you think I got these two words by sampling two words close to each other? Or do you think I got them as one word from the text and one word chosen at random from the dictionary? It’s really to try to distinguish between these two types of distributions from which you might sample a pair of words. So this is how you generate the training set. How do you choose k, Mikolov et al, recommend that maybe k is 5 to 20 for smaller data sets. And if you have a very large data set, then chose k to be smaller. So k equals 2 to 5 for larger data sets, and large values of k for smaller data sets. Okay, and in this example, I’ll just use k = 4.

Next, let’s describe the supervised learning model for learning a mapping from x to y. So here was the Softmax model you saw from the previous video. And here is the training set we got from the previous slide where again, this is going to be the new input x and this is going to be the value of y you’re trying to predict. So to define the model, I’m going to use this to denote, this was c for the context word, this to denote the possible target word, t, and this, I’ll use y to denote 0, 1, this is a context target pair. So what we’re going to do is define a logistic regression model. Say, that the chance of y = 1, given the input c, t pair, we’re going to model this as basically a regression model, but the specific formula we’ll use s sigma applied to theta transpose, theta t transpose, e c. So the parameters are similar as before, you have one parameter vector theta for each possible target word. And a separate parameter vector, really the embedding vector, for each possible context word. And we’re going to use this formula to estimate the probability that y is equal to 1. So if you have k examples here, then you can think of this as having a k to 1 ratio of negative to positive examples. So for every positive examples, you have k negative examples with which to train this logistic regression-like model. And so to draw this as a neural network, if the input word is orange, Which is word 6257, then what you do is, you input the one hop vector passing through e, do the multiplication to get the embedding vector 6257. And then what you have is really 10,000 possible logistic regression classification problems. Where one of these will be the classifier corresponding to, well, is the target word juice or not? And then there will be other words, for example, there might be ones somewhere down here which is predicting, is the word king or not and so on, for these possible words in your vocabulary. So think of this as having 10,000 binary logistic regression classifiers, but instead of training all 10,000 of them on every iteration, we’re only going to train five of them. We’re going to train the one responding to the actual target word we got and then train four randomly chosen negative examples. And this is for the case where k is equal to 4. So instead of having one giant 10,000 way Softmax, which is very expensive to compute, we’ve instead turned it into 10,000 binary classification problems, each of which is quite cheap to compute. And on every iteration, we’re only going to train five of them or more generally, k + 1 of them, of k negative examples and one positive examples. And this is why the computation cost of this algorithm is much lower because you’re updating k + 1, let’s just say units, k + 1 binary classification problems. Which is relatively cheap to do on every iteration rather than updating a 10,000 way Softmax classifier. So you get to play with this algorithm in the problem exercise for this week as well. So this technique is called negative sampling because what you’re doing is, you have a positive example, the orange and then juice. And then you will go and deliberately generate a bunch of negative examples, negative samplings, hence, the name negative sampling, with which to train four more of these binary classifiers. And on every iteration, you choose four different random negative words with which to train your algorithm on.

Now, before wrapping up, one more important detail with this algorithm is, how do you choose the negative examples? So after having chosen the context word orange, how do you sample these words to generate the negative examples? So one thing you could do is sample the words in the middle, the candidate target words. One thing you could do is sample it according to the empirical frequency of words in your corpus. So just sample it according to how often different words appears. But the problem with that is that you end up with a very high representation of words like the, of, and, and so on. One other extreme would be to say, you use 1 over the vocab size, sample the negative examples uniformly at random, but that’s also very non-representative of the distribution of English words. So the authors, Mikolov et al, reported that empirically, what they found to work best was to take this heuristic value, which is a little bit in between the two extremes of sampling from the empirical frequencies, meaning from whatever’s the observed distribution in English text to the uniform distribution. And what they did was they sampled proportional to their frequency of a word to the power of three-fourths. So if f of wi is the observed frequency of a particular word in the English language or in your training set corpus, then by taking it to the power of three-fourths, this is somewhere in-between the extreme of taking uniform distribution. And the other extreme of just taking whatever was the observed distribution in your training set. And so I’m not sure this is very theoretically justified, but multiple researchers are now using this heuristic, and it seems to work decently well.

So to summarize, you’ve seen how you can learn word vectors in a Softmax classier, but it’s very computationally expensive. And in this video, you saw how by changing that to a bunch of binary classification problems, you can very efficiently learn words vectors. And if you run this algorithm, you will be able to learn pretty good word vectors. Now of course, as is the case in other areas of deep learning as well, there are open source implementations. And there are also pre-trained word vectors that others have trained and released online under permissive licenses. And so if you want to get going quickly on a NLP problem, it’d be reasonable to download someone else’s word vectors and use that as a starting point. So that’s it for the Skip-Gram model. In the next video, I want to share with you yet another version of a word embedding learning algorithm that is maybe even simpler than what you’ve seen so far. So in the next video, let’s learn about the Glove algorithm.

04_glove-word-vectors

You learn about several algorithms for computing words embeddings. Another algorithm that has some momentum in the NLP community is the GloVe algorithm. This is not used as much as the Word2Vec or the skip-gram models, but it has some enthusiasts. Because I think, in part of its simplicity. Let’s take a look.

The GloVe algorithm was created by Jeffrey Pennington, Richard Socher, and Chris Manning. And GloVe stands for global vectors for word representation. So, previously, we were sampling pairs of words, context and target words, by picking two words that appear in close proximity to each other in our text corpus. So, what the GloVe algorithm does is, it starts off just by making that explicit. So, let’s say $X_{ij}$ be the number of times that a word i appears in the context of j. And so, here i and j play the role of t and c, so you can think of $X_{ij}$ as being x subscript tc. But, you can go through your training corpus and just count up how many words does a word i appear in the context of a different word j. How many times does the word t appear in context of different words c. And depending on the definition of context and target words, you might have that $X_{ij}$ equals $X_{ji}$. And in fact, if you’re defining context and target in terms of whether or not they appear within plus minus 10 words of each other, then it would be a symmetric relationship. Although, if your choice of context was that, the context is always the word immediately before the target word, then $X_{ij}$ and $X_{ji}$ may not be symmetric like this. But for the purposes of the GloVe algorithm, we can define context and target as whether or not the two words appear in close proximity, say within plus or minus 10 words of each other. So, $X_{ij}$ is a count that captures how often do words i and j appear with each other, or close to each other. So what the GloVe model does is, it optimizes the following. We’re going to minimize the difference between theta i transpose e_j minus log of $X_{ij}$ squared. I’m going to fill in some of the parts of this equation. But again, think of i and j as playing the role of t and c. So this is a bit like what you saw previously with theta t transpose e_c. And what you want is, for this to tell you how related are those two words? How related are words t and c? How related are words i and j as measured by how often they occur with each other? Which is affected by this $X_{ij}$. And so, what we’re going to do is, solve for parameters theta and e using gradient descent to minimize the sum over i equals one to 10,000 sum over j from one to 10,000 of this difference. So you just want to learn vectors, so that their end product is a good predictor for how often the two words occur together. Now, just some additional details, if $X_{ij}$ is equal to zero, then log of 0 is undefined, is negative infinity. And so, what we do is, we want sum over the terms where $X_{ij}$ is equal to zero. And so, what we’re going to do is, add an extra weighting term. So this is going to be a weighting term, and this will be equal to zero if $X_{ij}$ is equal to zero. And we’re going to use a convention that zero log zero is equal to zero. So what this means is, that if $X_{ij}$ is equal to zero, just don’t bother to sum over that $X_{ij}$ pair. So then this log of zero term is not relevant. So this means the sum is sum only over the pairs of words that have co-occurred at least once in that context-target relationship. The other thing that $F(X_{ij})$ does is that, there are some words they just appear very often in the English language like, this, is, of, a, and so on. Sometimes we used to call them stop words but there’s really a continuum between frequent and infrequent words. And then there are also some infrequent words like durion, which you actually still want to take into account, but not as frequently as the more common words. And so, the weighting factor can be a function that gives a meaningful amount of computation, even to the less frequent words like durion, and gives more weight but not an unduly large amount of weight to words like, this, is, of, a, which just appear lost in language. And so, there are various heuristics for choosing this weighting function F that need or gives these words too much weight nor gives the infrequent words too little weight. You can take a look at the GloVe paper, they are referenced in the previous slide, if you want the details of how F can be chosen to be a heuristic to accomplish this. And then, finally, one funny thing about this algorithm is that the roles of theta and e are now completely symmetric. So, theta i and e_j are symmetric in that, if you look at the math, they play pretty much the same role and you could reverse them or sort them around, and they actually end up with the same optimization objective. One way to train the algorithm is to initialize theta and e both uniformly around gradient descent to minimize its objective, and then when you’re done for every word, to then take the average. For a given words w, you can have e final to be equal to the embedding that was trained through this gradient descent procedure, plus theta trained through this gradient descent procedure divided by two, because theta and e in this particular formulation play symmetric roles unlike the earlier models we saw in the previous videos, where theta and e actually play different roles and couldn’t just be averaged like that. That’s it for the GloVe algorithm. I think one confusing part of this algorithm is, if you look at this equation, it seems almost too simple. How could it be that just minimizing a square cost function like this allows you to learn meaningful word embeddings? But it turns out that this works. And the way that the inventors end up with this algorithm was, they were building on the history of much more complicated algorithms like the newer language model, and then later, there came the Word2Vec skip-gram model, and then this came later. And we really hope to simplify all of the earlier algorithms.

Before concluding our discussion of algorithms concerning word embeddings, there’s one more property of them that we should discuss briefly. Which is that? We started off with this featurization view as the motivation for learning word vectors. We said, “Well, maybe the first component of the embedding vector to represent gender, the second component to represent how royal it is, then the age and then whether it’s a food, and so on.” But when you learn a word embedding using one of the algorithms that we’ve seen, such as the GloVe algorithm that we just saw on the previous slide, what happens is, you cannot guarantee that the individual components of the embeddings are interpretable. Why is that? Well, let’s say that there is some space where the first axis is gender and the second axis is royal. What you can do is guarantee that the first axis of the embedding vector is aligned with this axis of meaning, of gender, royal, age and food. And in particular, the learning algorithm might choose this to be axis of the first dimension. So, given maybe a context of words, so the first dimension might be this axis and the second dimension might be this. Or it might not even be orthogonal, maybe it’ll be a second non-orthogonal axis, could be the second component of the word embeddings you actually learn. And when we see this, if you have a subsequent understanding of linear algebra is that, if there was some invertible matrix A, then this could just as easily be replaced with A times theta i transpose A inverse transpose e_j. Because we expand this out, this is equal to theta i transpose A transpose A inverse transpose times e_j. And so, the middle term cancels out and we’re left with theta i transpose e_j, same as before. Don’t worry if you didn’t follow the linear algebra, but that’s a brief proof that shows that with an algorithm like this, you can’t guarantee that the axis used to represent the features will be well-aligned with what might be easily humanly interpretable axis. In particular, the first feature might be a combination of gender, and royal, and age, and food, and cost, and size, is it a noun or an action verb, and all the other features. It’s very difficult to look at individual components, individual rows of the embedding matrix and assign the human interpretation to that. But despite this type of linear transformation, the parallelogram map that we worked out when we were describing analogies, that still works. And so, despite this potentially arbitrary linear transformation of the features, you end up learning the parallelogram map for figure analogies still works.

So, that’s it for learning word embeddings. You’ve now seen a variety of algorithms for learning these word embeddings and you get to play them more in this week’s programming exercise as well. Next, I’d like to show you how you can use these algorithms to carry out sentiment classification. Let’s go onto the next video.

03_applications-using-word-embeddings

01_sentiment-classification

Sentiment classification is the task of looking at a piece of text and telling if someone likes or dislikes the thing they’re talking about. It is one of the most important building blocks in NLP and is used in many applications. One of the challenges of sentiment classification is you might not have a huge label training set for it. But with word embeddings, you’re able to build good sentiment classifiers even with only modest-size label training sets. Let’s see how you can do that.

So here’s an example of a sentiment classification problem. The input X is a piece of text and the output Y that you want to predict is what is the sentiment, such as the star rating of, let’s say, a restaurant review. So if someone says, “The dessert is excellent” and they give it a four-star review, “Service was quite slow” two-star review, “Good for a quick meal but nothing special” three-star review. And this is a pretty harsh review, “Completely lacking in good taste, good service, and good ambiance.” That’s a one-star review. So if you can train a system to map from X or Y based on a label data set like this, then you could use it to monitor comments that people are saying about maybe a restaurant that you run. So people might also post messages about your restaurant on social media, on Twitter, or Facebook, or Instagram, or other forms of social media. And if you have a sentiment classifier, they can look just a piece of text and figure out how positive or negative is the sentiment of the poster toward your restaurant. Then you can also be able to keep track of whether or not there are any problems or if your restaurant is getting better or worse over time. So one of the challenges of sentiment classification is you might not have a huge label data set. So for sentimental classification task, training sets with maybe anywhere from 10,000 to maybe 100,000 words would not be uncommon. Sometimes even smaller than 10,000 words and word embeddings that you can take can help you to much better understand especially when you have a small training set. So here’s what you can do.

We’ll go for a couple different algorithms in this video. Here’s a simple sentiment classification model. You can take a sentence like “dessert is excellent” and look up those words in your dictionary. We use a 10,000-word dictionary as usual. And let’s build a classifier to map it to the output Y that this was four stars. So given these four words, as usual, we can take these four words and look up the one-hot vector. So there’s 0 8 9 2 8 which is a one-hot vector multiplied by the embedding matrix E, which can learn from a much larger text corpus. It can learn in embedding from, say, a billion words or a hundred billion words, and use that to extract out the embedding vector for the word “the”, and then do the same for “dessert”, do the same for “is” and do the same for “excellent”. And if this was trained on a very large data set, like a hundred billion words, then this allows you to take a lot of knowledge even from infrequent words and apply them to your problem, even words that weren’t in your labeled training set. Now here’s one way to build a classifier, which is that you can take these vectors, let’s say these are 300-dimensional vectors, and you could then just sum or average them. And I’m just going to put a bigger average operator here and you could use sum or average. And this gives you a 300-dimensional feature vector that you then pass to a soft-max classifier which then outputs Y-hat. And so the softmax can output what are the probabilities of the five possible outcomes from one-star up to five-star. So this will be assortment of the five possible outcomes to predict what is Y. So notice that by using the average operation here, this particular algorithm works for reviews that are short or long because even if a review that is 100 words long, you can just sum or average all the feature vectors for all hundred words and so that gives you a representation, a 300-dimensional feature representation, that you can then pass into your sentiment classifier. So this average will work decently well. And what it does is it really averages the meanings of all the words or sums the meaning of all the words in your example. And this will work to [inaudible]. So one of the problems with this algorithm is it ignores word order. In particular, this is a very negative review, “Completely lacking in good taste, good service, and good ambiance”. But the word good appears a lot. This is a lot. Good, good, good. So if you use an algorithm like this that ignores word order and just sums or averages all of the embeddings for the different words, then you end up having a lot of the representation of good in your final feature vector and your classifier will probably think this is a good review even though this is actually very harsh. This is a one-star review. So here’s a more sophisticated model which is that, instead of just summing all of your word embeddings, you can instead use a RNN for sentiment classification. So here’s what you can do. You can take that review, “Completely lacking in good taste, good service, and good ambiance”, and find for each of them, the one-hot vector. And so I’m going to just skip the one-hot vector representation but take the one-hot vectors, multiply it by the embedding matrix E as usual, then this gives you the embedding vectors and then you can feed these into an RNN. And the job of the RNN is to then compute the representation at the last time step that allows you to predict Y-hat. So this is an example of a many-to-one RNN architecture which we saw in the previous week. And with an algorithm like this, it will be much better at taking word sequence into account and realize that “things are lacking in good taste” is a negative review and “not good” a negative review unlike the previous algorithm, which just sums everything together into a big-word vector mush and doesn’t realize that “not good” has a very different meaning than the words “good” or “lacking in good taste” and so on.

And so if you train this algorithm, you end up with a pretty decent sentiment classification algorithm and because your word embeddings can be trained from a much larger data set, this will do a better job generalizing to maybe even new words now that you’ll see in your training set, such as if someone else says, “Completely absent of good taste, good service, and good ambiance” or something, then even if the word “absent” is not in your label training set, if it was in your 1 billion or 100 billion word corpus used to train the word embeddings, it might still get this right and generalize much better even to words that were in the training set used to train the word embeddings but not necessarily in the label training set that you had for specifically the sentiment classification problem.

So that’s it for sentiment classification, and I hope this gives you a sense of how once you’ve learned or downloaded from online a word embedding, this allows you to quite quickly build pretty effective NLP systems.

02_debiasing-word-embeddings

Machine learning and AI algorithms are increasingly trusted to help with, or to make, extremely important decisions. And so we like to make sure that as much as possible that they’re free of undesirable forms of bias, such as gender bias, ethnicity bias and so on. What I want to do in this video is show you some of the ideas for diminishing or eliminating these forms of bias in word embeddings. When I use the term bias in this video, I don’t mean the bias variants or sense of the bias, instead I mean gender, ethnicity, sexual orientation bias. That’s a different sense of bias then is typically used in the technical discussion on machine learning.

But mostly the problem, we talked about how word embeddings can learn analogies like man is to woman as king is to queen. But what if you ask it, man is to computer programmer as woman is to what? And so the authors of this paper Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai found a somewhat horrifying result where a learned word embedding might output Man:Computer_Programmer as Woman:Homemaker. And that just seems wrong and it enforces a very unhealthy gender stereotype. It’d be much more preferable to have algorithm output man is to computer programmer as a woman is to computer programmer. And they found also, Father:Doctor as Mother is to what? And the really unfortunate result is that some learned word embeddings would output Mother:Nurse. So word embeddings can reflect the gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model. One that I’m especially passionate about is bias relating to socioeconomic status. I think that every person, whether you come from a wealthy family, or a low income family, or anywhere in between, I think everyone should have great opportunities. And because machine learning algorithms are being used to make very important decisions. They’re influencing everything ranging from college admissions, to the way people find jobs, to loan applications, whether your application for a loan gets approved, to in the criminal justice system, even sentencing guidelines. Learning algorithms are making very important decisions and so I think it’s important that we try to change learning algorithms to diminish as much as is possible, or, ideally, eliminate these types of undesirable biases. Now in the case of word embeddings, they can pick up the biases of the text used to train the model and so the biases they pick up or tend to reflect the biases in text as is written by people. Over many decades, over many centuries, I think humanity has made progress in reducing these types of bias. And I think maybe fortunately for AI, I think we actually have better ideas for quickly reducing the bias in AI than for quickly reducing the bias in the human race. Although I think we’re by no means done for AI as well and there’s still a lot of research and hard work to be done to reduce these types of biases in our learning algorithms. But what I want to do in this video is share with you one example of a set of ideas due to the paper referenced at the bottom by Bolukbasi and others on reducing the bias in word embeddings. So here’s the idea.

Let’s say that we’ve already learned a word embedding, so the word babysitter is here, the word doctor is here. We have grandmother here, and grandfather here. Maybe the word girl is embedded there, the word boy is embedded there. And maybe she is embedded here, and he is embedded there. So the first thing we’re going to do it is identify the direction corresponding to a particular bias we want to reduce or eliminate. And, for illustration, I’m going to focus on gender bias but these ideas are applicable to all of the other types of bias that I mention on the previous slide as well. And so how do you identify the direction corresponding to the bias? For the case of gender, what we can do is take the embedding vector for he and subtract the embedding vector for she, because that differs by gender. And take e male, subtract e female, and take a few of these and average them, right? And take a few of these differences and basically average them. And this will allow you to figure out in this case that what looks like this direction(the horizontal direction in the slide) is the gender direction, or the bias direction. Whereas this direction(the vertical direction in the slide) is unrelated to the particular bias we’re trying to address. So this is the non-bias direction. And in this case, the bias direction, think of this as a 1D subspace whereas a non-bias direction, this will be 299-dimensional subspace. Okay, and I’ve simplified the description a little bit in the original paper. The bias direction can be higher than 1-dimensional, and rather than take an average, as I’m describing it here, it’s actually found using a more complicated algorithm called a SVD, a singular value decomposition. Which is closely related to, if you’re familiar with principle component analysis, it uses ideas similar to the pca or the principle component analysis algorithm.

After that, the next step is a neutralization step. So for every word that’s not definitional, project it to get rid of bias. So there are some words that intrinsically capture gender. So words like grandmother, grandfather, girl, boy, she, he, a gender is intrinsic in the definition. Whereas there are other word like doctor and babysitter that we want to be gender neutral. And really, in the more general case, you might want words like doctor or babysitter to be ethnicity neutral or sexual orientation neutral, and so on, but we’ll just use gender as the illustrating example here. But so for every word that is not definitional, this basically means not words like grandmother and grandfather, which really have a very legitimate gender component, because, by definition, grandmothers are female, and grandfathers are male. So for words like doctor and babysitter, let’s just project them onto this axis to reduce their components, or to eliminate their component, in the bias direction. So reduce their component in this horizontal direction. So that’s the second neutralize step.

And then the final step is called equalization in which you might have pairs of words such as grandmother and grandfather, or girl and boy, where you want the only difference in their embedding to be the gender. And so, why do you want that? Well in this example, the distance, or the similarity, between babysitter and grandmother is actually smaller than the distance between babysitter and grandfather. And so this maybe reinforces an unhealthy, or maybe undesirable, bias that grandmothers end up babysitting more than grandfathers. So in the final equalization step, what we’d like to do is to make sure that words like grandmother and grandfather are both exactly the same similarity, or exactly the same distance, from words that should be gender neutral, such as babysitter or such as doctor. So there are a few linear algebra steps for that. But what it will basically do is move grandmother and grandfather to a pair of points that are equidistant from this axis in the middle. And so the effect of that is that now the distance between babysitter, compared to these two words, will be exactly the same. And so, in general, there are many pairs of words like this grandmother-grandfather, boy-girl, sorority-fraternity, girlhood-boyhood, sister-brother, niece-nephew, daughter-son, that you might want to carry out through this equalization step.

So the final detail is, how do you decide what word to neutralize? So for example, the word doctor seems like a word you should neutralize to make it non-gender-specific or non-ethnicity-specific. Whereas the words grandmother and grandmother should not be made non-gender-specific. And there are also words like beard, right, that it’s just a statistical fact that men are much more likely to have beards than women, so maybe beards should be closer to male than female. And so what the authors did is train a classifier to try to figure out what words are definitional, what words should be gender-specific and what words should not be. And it turns out that most words in the English language are not definitional, meaning that gender is not part of the definition. And it’s such a relatively small subset of words like this, grandmother-grandfather, girl-boy, sorority-fraternity, and so on that should not be neutralized. And so a linear classifier can tell you what words to pass through the neutralization step to project out this bias direction, to project it on to this essentially 299-dimensional subspace. And then, finally, the number of pairs you want to equalize, that’s actually also relatively small, and is, at least for the gender example, it is quite feasible to hand-pick most of the pairs you want to equalize. So the full algorithm is a bit more complicated than I present it here, you can take a look at the paper for the full details. And you also get to play with a few of these ideas in the programming exercises as well.

So to summarize, I think that reducing or eliminating bias of our learning algorithms is a very important problem because these algorithms are being asked to help with or to make more and more important decisions in society. In this video I shared just one set of ideas for how to go about trying to address this problem, but this is still a very much an ongoing area of active research by many researchers. So that’s it for this week’s videos. Best of luck with this week’s programming exercises and I look forward to seeing you next week.