Note
This is my personal lecture note after studying the course nlp sequence models at the 1st week and the copyright belongs to deeplearning.ai.
01_why-sequence-models
Welcome to this fifth course on deep learning. In this course, you learn about sequence models, one of the most exciting areas in deep learning. Models like recurrent neural networks or RNNs have transformed speech recognition, natural language processing and other areas. And in this course, you learn how to build these models for yourself. Let’s start by looking at a few examples of where sequence models can be useful.
In speech recognition you are given an input audio clip X and asked to map it to a text transcript Y. Both the input and the output here are sequence data, because X is an audio clip and so that plays out over time and Y, the output, is a sequence of words. So sequence models such as a recurrent neural networks and other variations, you’ll learn about in a little bit have been very useful for speech recognition. Music generation is another example of a problem with sequence data. In this case, only the output Y is a sequence, the input can be the empty set, or it can be a single integer, maybe referring to the genre of music you want to generate or maybe the first few notes of the piece of music you want. But here X can be nothing or maybe just an integer and output Y is a sequence. In sentiment classification the input X is a sequence, so given the input phrase like, “There is nothing to like in this movie” how many stars do you think this review will be? Sequence models are also very useful for DNA sequence analysis. So your DNA is represented via the four alphabets A, C, G, and T. And so given a DNA sequence can you label which part of this DNA sequence say corresponds to a protein. In machine translation you are given an input sentence, voulez-vou chante avec moi? And you’re asked to output the translation in a different language. In video activity recognition you might be given a sequence of video frames and asked to recognize the activity. And in name entity recognition you might be given a sentence and asked to identify the people in that sentence. So all of these problems can be addressed as supervised learning with label data X, Y as the training set. But, as you can tell from this list of examples, there are a lot of different types of sequence problems. In some, both the input X and the output Y are sequences, and in that case (speech recognition), sometimes X and Y can have different lengths, or in this example (at DNA case) and this example(at Name entity recognition), X and Y have the same length. And in some of these examples only either X or only the opposite Y is a sequence. So in this course you learn about sequence models are applicable, so all of these different settings.
So I hope this gives you a sense of the exciting set of problems that sequence models might be able to help you to address. With that let us go on to the next video where we start to define the notation we use to define these sequence-models.
02_notation
In the last video, you saw some of the wide range of applications through which you can apply sequence models. Let’s start by defining a notation that we’ll use to build up these sequence models.
As a motivating example, let’s say you want to build a sequence model to input a sentence like this, Harry Potter and Hermione Granger invented a new spell. And these are characters by the way, from the Harry Potter sequence of novels by J. K. Rowling. And let say you want a sequence model to automatically tell you where are the peoples names in this sentence. So, this is a problem called Named-entity recognition and this is used by search engines for example, to index all of say the last 24 hours news of all the people mentioned in the news articles so that they can index them appropriately. And name into the recognition systems can be used to find people’s names, companies names, times, locations, countries names, currency names, and so on in different types of text. Now, given this input x let’s say that you want a model to operate y that has one outputs per input word and the target output the design y tells you for each of the input words is that part of a person’s name. And technically this maybe isn’t the best output representation, there are some more sophisticated output representations that tells you not just is a word part of a person’s name, but tells you where are the start and ends of people’s names their sentence, you want to know Harry Potter starts here, and ends here, starts here, and ends here. But for this motivating example, I’m just going to stick with this simpler output representation. Now, the input is the sequence of nine words. So, eventually we’re going to have nine sets of features to represent these nine words, and index into the positions and sequence, I’m going to use X and then superscript angle brackets 1, 2, 3 and so on up to X angle brackets nine to index into the different positions. I’m going to use $X^{
So, to represent a word in the sentence the first thing you do is come up with a Vocabulary. Sometimes also called a Dictionary and that means making a list of the words that you will use in your representations. So the first word in the vocabulary is a, that will be the first word in the dictionary. The second word is Aaron and then a little bit further down is the word and, and then eventually you get to the words Harry then eventually the word Potter, and then all the way down to maybe the last word in dictionary is Zulu. And so, a will be word one, Aaron is word two, and in my dictionary the word and appears in positional index 367. Harry appears in position 4075, Potter in position 6830, and Zulu is the last word to the dictionary is maybe word 10,000. So in this example, I’m going to use a dictionary with size 10,000 words. This is quite small by modern NLP applications. For commercial applications, for visual size commercial applications, dictionary sizes of 30 to 50,000 are more common and 100,000 is not uncommon. And then some of the large Internet companies will use dictionary sizes that are maybe a million words or even bigger than that. But you see a lot of commercial applications used dictionary sizes of maybe 30,000 or maybe 50,000 words. But I’m going to use 10,000 for illustration since it’s a nice round number. So, if you have chosen a dictionary of 10,000 words and one way to build this dictionary will be be to look through your training sets and find the top 10,000 occurring words, also look through some of the online dictionaries that tells you what are the most common 10,000 words in the English Language saved. What you can do is then use one hot representations to represent each of these words. For example, $x^{<1>}$ which represents the word Harry would be a vector with all zeros except for a 1 in position 4075 because that was the position of Harry in the dictionary. And then $x^{<2>}$ will be again similarly a vector of all zeros except for a 1 in position 6830 and then zeros everywhere else. The word and was represented as position 367 so $x^{<3>}$ would be a vector with zeros of 1 in position 367 and then zeros everywhere else. And each of these would be a 10,000 dimensional vector if your vocabulary has 10,000 words. And this one A, I guess because A is the first whether the dictionary, then $x^{<7>}$ which corresponds to word a, that would be the vector 1. This is the first element of the dictionary and then zero everywhere else. So in this representation, $x^{
So, to summarize in this video, we described a notation for describing your training set for both x and y for sequence data. In the next video let’s start to describe a Recurrent Neural Networks for learning the mapping from X to Y.
03_recurrent-neural-network-model
In the last video, you saw the notation we used to define sequence learning problems. Now, let’s talk about how you can build a model, build a neural network to drawing the mapping from X to Y.
Now, one thing you could do is try to use a standard neural network for this task. So in our previous example, we had nine input words. So you could imagine trying to take these nine input words, maybe the nine one hot vectors and feeding them into a standard neural network, maybe a few hidden layers and then eventually, have this output the nine values zero or one that tell you whether each word is part of a person’s name. But this turns out not to work well, and there are really two main problems with this. The first is that the inputs and outputs can be different lengths in different examples. So it’s not as if every single example has the same input length $T^{
So what is a recurrent neural network? Let’s build one out. So if you are reading the sentence from left to right, the first word you read is the some first where say X1. What we’re going to do is take the first word and feed it into a neural network layer. I’m going to draw it like this. So that’s a hidden layer of the first neural network. And look at how the neural network maybe try to predict the output. So is this part of a person’s name or not? And what a recurrent neural network does is when it then goes on to read the second word in a sentence, say X2, instead of just predicting Y2 using only X2, it also gets to input some information from whether a computer that time-step ones. So in particular, the activation value from time-step one is passed on to time-step 2. And then, at the next time-step, a recurrent neural network inputs the third word X3, and it tries to predict, output some prediction y-hat 3, and so on, up until the last time-step where inputs $X^{
The recurrent neural network scans through the data from left to right. And the parameters it uses for each time step are shared. So there will be a set of parameters which we’ll describe in greater detail on the next slide, but the parameters governing the connection from X1 to the hidden layer will be some set of the parameters we’re going to write as WAX, and it’s the same parameters $W_{ax}$ that it uses for every time-step I guess you could write $W_{ax}$ there as well. And the activations, the horizontal connections, will be governed by some set of parameters $W_{aa}$, and is the same parameters $W_{aa}$ use on every time-step, and similarly, the sum $W_{ya}$ that governs the output predictions. And I’ll describe in the next slide exactly how these parameters work. So in this recurrent neural network, what this means is that we’re making the prediction for Y3 against the information not only from X3, but also the information from X1 and X2, because the information of X1 can pass through this way to help the prediction with Y3. Now one weakness of this RNN is that it only uses the information that is earlier in the sequence to make a prediction, in particular, when predicting Y3, it doesn’t use information about the words X4, X5, X6 and so on. And so this is a problem because if you’re given a sentence, he said, “Teddy Roosevelt was a great president.” In order to decide whether or not the word Teddy is part of a person’s name, it be really useful to know not just information from the first two words but to know information from the later words in the sentence as well, because the sentence could also happen, he said, “Teddy bears are on sale!” And so, given just the first three words, it’s not possible to know for sure whether the word Teddy is part of a person’s name. In the first example, it is, in the second example, is not, but you can’t tell the difference if you look only at the first three words. So one limitation of this particular neural network structure is that the prediction at a certain time uses inputs or uses information from the inputs earlier in the sequence but not information later in the sequence. We will address this in a later video where we talk about a bidirectional recurrent neural networks or BRNNs. But for now, this simpler uni-directional neural network architecture will suffice for us to explain the key concepts. And we just have to make a quick modifications in these ideas later to enable say the prediction of Y-hat 3 to use both information earlier in the sequence as well as information later in the sequence, but we’ll get to that in a later video.
So let’s not write to explicitly what are the calculations that this neural network does. Here’s a cleaned out version of the picture of the neural network. As I mentioned previously, typically, you started off with the input a0 equals the vector of all zeroes. Next. This is what a forward propagation looks like. To compute a1, you would compute that as an activation function g, applied to Waa times a0 plus W a x times x1 plus a bias was going to write it as ba, and then to compute y hat 1 the prediction of times that one, that will be some activation function, maybe a different activation function, than the one above. But apply to WYA times a1 plus b y. And the notation convention I’m going to use for the sub zero of these matrices like that example, W a x. The second index means that this W a x is going to be multiplied by some x like quantity, and this means that this is used to compute some a like quantity. Like like so. And similarly, you notice that here WYA is multiplied by a sum a like quantity to compute a y type quantity. The activation function used in-to compute the activations will often be a tonnage and the choice of an RNN and sometimes, values are also used although the tonnage is actually a pretty common choice. And we have other ways of preventing the vanishing gradient problem which we’ll talk about later this week. And depending on what your output y is, if it is a binary classification problem, then I guess you would use a sigmoid activation function or it could be a soft Max if you have a ky classification problem. But the choice of activation function here would depend on what type of output y you have. So, for the name entity recognition task, where Y was either zero or one. I guess the second g could be a signal and activation function. And I guess you could write g2 if you want to distinguish that this is these could be different activation functions but I usually won’t do that. And then, more generally at time t, a t will be g of W a a times a, from the previous time-step, plus W a x of x from the current time-step plus B a, and y hat t is equal to g, again, it could be different activation functions but g of WYA times a t plus B y. So, these equations define for propagation in the neural network. Where you would start off with a zeroes [inaudible] and then using a zero and X1, you will compute a1 and y hat one, and then you, take X2 and use X2 and A1 to compute A2 and Y hat two and so on, and you carry out for propagation going from the left to the right of this picture. Now, in order to help us develop the more complex neural networks, I’m actually going to take this notation and simplify it a little bit. So, let me copy these two equations in the next slide. Right.
Here they are, and what I’m going to do is actually take- so to simplify the notation a bit, I’m actually going to take that and write in a slightly simpler way. And someone very does this a
So, that’s it. You now know, what is a basic recurrent network. Next, let’s talk about back propagation and how you learn with these RNNs.
04_backpropagation-through-time
You’ve already learned about the basic structure of an RNN. In this video, you’ll see how backpropagation in a recurrent neural network works. As usual, when you implement this in one of the programming frameworks, often, the programming framework will automatically take care of backpropagation. But I think it’s still useful to have a rough sense of how backprop works in RNNs. Let’s take a look.
You’ve seen how, for forward prop, you would computes these activations from left to right as follows in the neural network, and so you’ve outputs all of the predictions. In backprop, as you might already have guessed, you end up carrying backpropagation calculations in basically the opposite direction of the forward prop arrows.
So, let’s go through the forward propagation calculation. You’re given this input sequence $x^{<1>}, x^{<2>}, x^{<3>}$, up to $x^{
So, I hope that gives you a sense of how forward prop and backprop in RNN works. Now, so far, you’ve only seen this main motivating example in RNN, in which the length of the input sequence was equal to the length of the output sequence. In the next video, I want to show you a much wider range of RNN architecture, so I’ll let you tackle a much wider set of applications. Let’s go on to the next video.
05_different-types-of-rnns
So far, you’ve seen an RNN architecture where the number of inputs, Tx, is equal to the number of outputs, Ty. It turns out that for other applications, Tx and Ty may not always be the same, and in this video, you’ll see a much richer family of RNN architectures.
You might remember this slide from the first video of this week, where the input x and the output y can be many different types. And it’s not always the case that $T_x$ has to be equal to $T_y$. In particular, in this example, $T_x$ can be length one or even an empty set. And then, an example like movie sentiment classification, the output y could be just an integer from 1 to 5, whereas the input is a sequence. And in name entity recognition, in the example we’re using, the input length and the output length are identical, but there are also some problems were the input length and the output length can be different.They’re both our sequences but have different lengths, such as machine translation where a French sentence and English sentence can mean two different numbers of words to say the same thing. So it turns out that we could modify the basic RNN architecture to address all of these problems. And the presentation in this video was inspired by a blog post by Andrej Karpathy, titled, The Unreasonable Effectiveness of Recurrent Neural Networks.
Let’s go through some examples. The example you’ve seen so far use $T_x$ equals $T_y$, where we had an input sequence x(1), x(2) up to x(Tx), and we had a recurrent neural network that works as follows when we would input x(1) to compute y hat (1), y hat (2), and so on up to y hat (Ty), as follows. And in early diagrams, I was drawing a bunch of circles here to denote neurons but I’m just going to make those little circles for most of this video, just to make the notation simpler. So, this is what you might call a many-to-many architecture because the input sequence has many inputs as a sequence and the outputs sequence is also has many outputs. Now, let’s look at a different example. Let’s say, you want to address sentiments classification. Here, x might be a piece of text, such as it might be a movie review that says, “There is nothing to like in this movie.” So x is going to be sequenced, and y might be a number from 1 to 5, or maybe 0 or 1. This is a positive review or a negative review, or it could be a number from 1 to 5. Do you think this is a one-star, two-star, three, four, or five-star review? So in this case, we can simplify the neural network architecture as follows. I will input x(1), x(2). So, input the words one at a time. So if the input text was, “There is nothing to like in this movie.” So “There is nothing to like in this movie,” would be the input. And then rather than having to use an output at every single time-step, we can then just have the RNN read into entire sentence and have it output y at the last time-step when it has already input the entire sentence. So, this neural network would be a many-to-one architecture. Because as many inputs, it inputs many words and then it just outputs one number. For the sake of completeness, there is also a one-to-one architecture. So this one is maybe less interesting. The smaller the standard neural network, we have some input x and we just had some output y. And so, this would be the type of neural network that we covered in the first two courses in this sequence. Now, in addition to many-to-one, you can also have a one-to-many architecture. So an example of a one-to-many neural network architecture will be music generation. And in fact, you get to implement this yourself in one of the primary exercises for this course where you go is have a neural network, output a set of notes corresponding to a piece of music. And the input x could be maybe just an integer, telling it what genre of music you want or what is the first note of the music you want, and if you don’t want to input anything, x could be a null input, could always be the vector zeroes as well. For that, the neural network architecture would be your input x. And then, have your RNN output. The first value, and then, have that, with no further inputs, output. The second value and then go on to output. The third value, and so on, until you synthesize the last notes of the musical piece. If you want, you can have this input a(0) as well. One technical now what you see in the later video is that, when you’re actually generating sequences, often you take these first synthesized output and feed it to the next layer as well. So the network architecture actually ends up looking like that. So, we’ve talked about many-to- many, many-to-one, one-to-many, as well as one-to-one. It turns out there’s one more interesting example of many-to-many which is worth describing. Which is when the input and the output length are different. So, in the many-to-many example, you saw just now, the input length and the output length have to be exactly the same. For an application like machine translation, the number of words in the input sentence, say a French sentence, and the number of words in the output sentence, say the translation into English, those sentences could be different lengths. So here’s an alternative new network architecture where you might have a neural network, first, reading the sentence. So first, reading the input, say French sentence that you want to translate to English. And having done that, you then, have the neural network output the translation. As all those y hat of (Ty). And so, with this architecture, Tx and Ty can be different lengths. And again, you could draw on the a(0) that you want. And so, this that neural network architecture has two distinct parts. There’s the encoder which takes as input, say a French sentence, and then, there’s is a decoder, which having read in the sentence, outputs the translation into a different language. So this would be an example of a many-to-many architecture. So by the end of this week, you have a good understanding of all the components needed to build these types of architectures. And then, technically, there’s one other architecture which we’ll talk about only in week four, which is attention based architectures. Which maybe isn’t cleanly captured by one of the diagrams we’ve drawn so far.
So, to summarize the wide range of RNN architectures, there is one-to-one, although if it’s one-to-one, we could just give it this, and this is just a standard generic neural network. Well, you don’t need an RNN for this. But there is one-to-many. So, this was a music generation or sequenced generation as example. And then, there’s many-to-one, that would be an example of sentiment classification. Where you might want to read as input all the text with a movie review. And then, try to figure out that they liked the movie or not. There is many-to-many, so the name entity recognition, the example we’ve been using, was this where $T_x$ is equal to $T_y$. And then, finally, there’s this other version of many-to-many, where for applications like machine translation, $T_x$ and $T_y$ no longer have to be the same. So, now you know most of the building blocks, the building are pretty much all of these neural networks except that there are some subtleties with sequence generation, which is what we’ll discuss in the next video.
So, I hope you saw from this video that using the basic building blocks of an RNN, there’s already a wide range of models that you might be able put together. But as I mentioned, there are some subtleties to sequence generation, which you’ll get to implement yourself as well in this week’s primary exercise where you implement a language model and hopefully, generate some fun sequences or some fun pieces of text. So, what I want to do in the next video, is go deeper into sequence generation. Let’s see the details in the next video.
06_language-model-and-sequence-generation
Language modeling is one of the most basic and important tasks in natural language processing. There’s also one that RNNs do very well. In this video, you learn about how to build a language model using an RNN, and this will lead up to a fun programming exercise at the end of this week. Where you build a language model and use it to generate Shakespeare-like texting, other types of text. Let’s get started.
So what is a language model? Let’s say you’re building this speech recognition system and you hear the sentence, the apple and pear salad was delicious. So what did you just hear me say? Did I say the apple and pair salad, or did I say the apple and pear salad? You probably think the second sentence is much more likely, and in fact, that’s what a good speech recognition system would help with even though these two sentences sound exactly the same. And the way a speech recognition system picks the second sentence is by using a language model which tells it what the probability is of either of these two sentences. For example, a language model might say that the chance for the first sentence is 3.2 by 10 to the -13. And the chance of the second sentence is say 5.7 by 10 to the -10. And so, with these probabilities, the second sentence is much more likely by over a factor of 10 to the 3 compared to the first sentence. And that’s why speech recognition system will pick the second choice. So what a language model does is given any sentence is job is to tell you what is the probability of a sentence, of that particular sentence. And by probability of sentence I mean, if you want to pick up a random newspaper, open a random email or pick a random webpage or listen to the next thing someone says, the friend of you says. What is the chance that the next sentence you use somewhere out there in the world will be a particular sentence like the apple and pear salad? [COUGH] And this is a fundamental component for both speech recognition systems as you’ve just seen, as well as for machine translation systems where translation systems wants output only sentences that are likely. And so the basic job of a language model is to input a sentence, which I’m going to write as a sequence $y^{<1>}$, $y^{<2>}$ up to $y^{
So how do you build a language model? To build such a model using an RNN you would first need a training set comprising a large corpus of english text. Or text from whatever language you want to build a language model of. And the word corpus is an NLP terminology that just means a large body or a very large set of english text of english sentences. So let’s say you get a sentence in your training set as follows. Cats average 15 hours of sleep a day. The first thing you would do is tokenize this sentence. And that means you would form a vocabulary as we saw in an earlier video. And then map each of these words to, say, one hot vectors, alter indices in your vocabulary. One thing you might also want to do is model when sentences end. So another common thing to do is to add an extra token called a EOS. That stands for End Of Sentence that can help you figure out when a sentence ends. We’ll talk more about this later, but the EOS token can be appended to the end of every sentence in your training sets if you want your models explicitly capture when sentences end. We won’t use the end of sentence token for the programming exercise at the end of this week where for some applications, you might want to use this. And we’ll see later where this comes in handy. So in this example, we have y1, y2, y3, 4, 5, 6, 7, 8, 9. Nine inputs in this example if you append the end of sentence token to the end. And doing the tokenization step, you can decide whether or not the period should be a token as well. In this example, I’m just ignoring punctuation. So I’m just using day as another token. And omitting the period, if you want to treat the period or other punctuation as explicit token, then you can add the period to you vocabulary as well. Now, one other detail would be what if some of the words in your training set, are not in your vocabulary. So if your vocabulary uses 10,000 words, maybe the 10,000 most common words in English, then the term Mau as in the Egyptian Mau is a breed of cat, that might not be in one of your top 10,000 tokens. So in that case you could take the word Mau and replace it with a unique token called UNK or stands for unknown words and would just model, the chance of the unknown word instead of the specific word now. Having carried out the tokenization step which basically means taking the input sentence and mapping out to the individual tokens or the individual words in your vocabulary. Next let’s build an RNN to model the chance of these different sequences. And one of the things we’ll see on the next slide is that you end up setting the inputs x
So let’s go on to built the RNN model and I’m going to continue to use this sentence as the running example. This will be an RNN architecture. At time 0 you’re going to end up computing some activation $a^{<1>}$ as a function of some inputs $x^{<1>}$, and $x^{<1>}$ will just be set it to the set of all zeroes, to 0 vector. And the previous $a^{<0>}$, by convention, also set that to vector zeroes. But what $a^{<1>}$ does is it will make a soft max prediction to try to figure out what is the probability of the first words y. And so that’s going to be $y^{<1>}$. So what this step does is really, it has a soft max it’s trying to predict. What is the probability of any word in the dictionary? That the first one is a, what’s the chance that the first word is Aaron? And then what’s the chance that the first word is cats? All the way to what’s the chance the first word is Zulu? Or what’s the first chance that the first word is an unknown word? Or what’s the first chance that the first word is the in the sentence they’ll have, shouldn’t have to read? Right, so $\hat{y}^{<1>}$ is output to a softmax, it just predicts what’s the chance of the first word being, whatever it ends up being. And in our example, it wind up being the word cats, so this would be a 10,000 way soft max output, if you have a 10,000-word vocabulary. Or 10,002, I guess you could call unknown word and the sentence is two additional tokens. Then, the RNN steps forward to the next step and has some activation, $a^{<1>}$ to the next step. And at this step, his job is try figure out, what is the second word? But now we will also give it the correct first word. So we’ll tell it that, gee, in reality, the first word was actually Cats so that’s $y^{<1>}$. So tell it cats, and this is why $y^{<1>} = x^{<2>}$, and so at the second step the output is again predicted by a soft max. The RNN’s jobs to predict was the chance of a being whatever the word it is. Is it a or Aaron, or Cats or Zulu or unknown whether EOS or whatever given what had come previously. So in this case, I guess the right answer was average since the sentence starts with cats average. And then you go on to the next step of the RNN. Where you now compute $a^{<3>}$. But to predict what is the third word, which is 15, we can now give it the first two words. So we’re going to tell it cats average are the first two words. So this next input here, $x^{<3>} = y^{<2>}$, so the word average is input, and this job is to figure out what is the next word in the sequence. So in other words trying to figure out what is the probability of anywhere than dictionary given that what just came before was cats. Average, right? And in this case, the right answer is 15 and so on. Until at the end, you end up at, I guess, time step 9, you end up feeding it $x^{<9>}$, which is equal to $y^{<8>}$, which is the word, day. And then this has $a^{<9>}$, and its jpob iws to output $\hat{y}^{<9>}$, and this happens to be the EOS token. So what’s the chance of whatever this given, everything that comes before, and hopefully it will predict that there’s a high chance of it, EOS and the sentence token. So each step in the RNN will look at some set of preceding words such as, given the first three words, what is the distribution over the next word? And so this RNN learns to predict one word at a time going from left to right. Next to train us to a network, we’re going to define the cos function. So, at a certain time, t, if the true word was yt and the new networks soft max predicted some $\hat{y}^{
So that’s the basic structure of how you can train a language model using an RNN. If some of these ideas still seem a little bit abstract, don’t worry about it, you get to practice all of these ideas in their program exercise. But next it turns out one of the most fun things you could do with a language model is to sample sequences from the model. Let’s take a look at that in the next video.
07_sampling-novel-sequences
After you train a sequence model, one of the ways you can informally get a sense of what is learned is to have a sample novel sequences. Let’s take a look at how you could do that.
So remember that a sequence model, models the chance of any particular sequence of words as follows, and so what we like to do is sample from this distribution to generate novel sequences of words. So the network was trained using this structure shown at the top. But to sample, you do something slightly different, so what you want to do is first sample what is the first word you want your model to generate. And so for that you input the usual $x^{<1>}$ equals 0, $a^{<0>}$ equals 0. And now your first time stamp will have some max probability over possible outputs. So what you do is you then randomly sample according to this softmax distribution. So what the soft max distribution gives you is it tells you what is the chance that it refers to this a, what is the chance that it refers to this Aaron? What’s the chance it refers to Zulu, what is the chance that the first word is the Unknown word token. Maybe it was a chance it was a end of sentence token.
And then you take this vector and use, for example, the numpy command np.random.choice
to sample according to distribution defined by this vector probabilities, and that lets you sample the first words. Next you then go on to the second time step, and now remember that the second time step is expecting this $\hat{y}^{<1>}$ as input. But what you do is you then take the $\hat{y}^{<1>}$ that you just sampled and pass that in here as the input to the next timestep. So whatever works, you just chose the first time step passes this input in the second position, and then this softmax will make a prediction for what is $\hat{y}^{<2>}$. Example, let’s say that after you sample the first word, the first word happened to be “The”, which is very common choice of first word. Then you pass in “The” as $x^{<2>}$, which is now equal to $\hat{y}^{<1>}$. And now you’re trying to figure out what is the chance of what the second word is given that the first word is d. And this is going to be $\hat{y}^{<2>}$. Then you again use this type of sampling function to sample $\hat{y}^{<2>}$. And then at the next time stamp, you take whatever choice you had represented say as a one hard encoding. And pass that to next timestep and then you sample the third word to that whatever you chose, and you keep going until you get to the last time step. And so how do you know when the sequence ends? Well, one thing you could do is if the end of sentence token is part of your vocabulary, you could keep sampling until you generate an EOS token. And that tells you you’ve hit the end of a sentence and you can stop. Or alternatively, if you do not include this in your vocabulary then you can also just decide to sample 20 words or 100 words or something, and then keep going until you’ve reached that number of time steps. And this particular procedure will sometimes generate an unknown word token. If you want to make sure that your algorithm never generates this token, one thing you could do is just reject any sample that came out as unknown word token and just keep resampling from the rest of the vocabulary until you get a word that’s not an unknown word. Or you can just leave it in the output as well if you don’t mind having an unknown word output. So this is how you would generate a randomly chosen sentence from your RNN language model.
Now, so far we’ve been building a words level RNN, by which I mean the vocabulary are words from English. Depending on your application, one thing you can do is also build a character level RNN. So in this case your vocabulary will just be the alphabets. Up to z, and as well as maybe space, punctuation if you wish, the digits 0 to 9. And if you want to distinguish the uppercase and lowercase, you can include the uppercase alphabets as well, and one thing you can do as you just look at your training set and look at the characters that appears there and use that to define the vocabulary. And if you build a character level language model rather than a word level language model, then your sequence $y^{<1>}, y^{<2>}, y^{<3>}$, would be the individual characters in your training data, rather than the individual words in your training data. So for our previous example, the sentence cats average 15 hours of sleep a day. In this example, c would be $y^{<1>}$, a would be $y^{<2>}$, t will be $y^{<3>}$, the space will be $y^{<4>}$ and so on. Using a character level language model has some pros and cons. One is that you don’t ever have to worry about unknown word tokens. In particular, a character level language model is able to assign a sequence like mau, a non-zero probability. Whereas if mau was not in your vocabulary for the word level language model, you just have to assign it the unknown word token. But the main disadvantage of the character level language model is that you end up with much longer sequences. So many english sentences will have 10 to 20 words but may have many, many dozens of characters. And so character language models are not as good as word level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence. And character level models are also just more computationally expensive to train. So the trend I’ve been seeing in natural language processing is that for the most part, word level language model are still used, but as computers gets faster there are more and more applications where people are, at least in some special cases, starting to look at more character level models. But they tend to be much hardware, much more computationally expensive to train, so they are not in widespread use today. Except for maybe specialized applications where you might need to deal with unknown words or other vocabulary words a lot. Or they are also used in more specialized applications where you have a more specialized vocabulary. So under these methods, what you can now do is build an RNN to look at the purpose of English text, build a word level, build a character language model, sample from the language model that you’ve trained.
So here are some examples of text thatwere examples from a language model, actually from a culture level language model. And you get to implement something like this yourself in the programming exercise. If the model was trained on news articles, then it generates texts like that shown on the left. And this looks vaguely like news text, not quite grammatical, but maybe sounds a little bit like things that could be appearing news, concussion epidemic to be examined. And it was trained on Shakespearean text and then it generates stuff that sounds like Shakespeare could have written it. The mortal moon hath her eclipse in love. And subject of this thou art another this fold. When besser be my love to me see sabl’s. For whose are ruse of mine eyes heaves.
So that’s it for the basic RNN, and how you can build a language model using it, as well as sample from the language model that you’ve trained. In the next few videos, I want to discuss further some of the challenges of training RNNs, as well as how to adjust some of these challenges, specifically vanishing gradients by building even more powerful models of the RNN. So in the next video let’s talk about the problem of vanishing the gradient and we will go on to talk about the GRU, Gate Recurring Unit as well as the LSTM models.
08_vanishing-gradients-with-rnns
You’ve learned about how RNNs work and how they can be applied to problems like name entity recognition, as well as to language modeling, and you saw how backpropagation can be used to train in RNN. It turns out that one of the problems with a basic RNN algorithm is that it runs into vanishing gradient problems. Let’s discuss that, and then in the next few videos, we’ll talk about some solutions that will help to address this problem.
So, you’ve seen pictures of RNNS that look like this. And let’s take a language modeling example. Let’s say you see this sentence, “The cat which already ate and maybe already ate a bunch of food that was delicious dot, dot, dot, dot, was full.” And so, to be consistent, just because cat is singular, it should be the cat was, were then was, “The cats which already ate a bunch of food was delicious, and apples, and pears, and so on, were full.” So to be consistent, it should be cat was or cats were. And this is one example of when language can have very long-term dependencies, where it worked at this much earlier can affect what needs to come much later in the sentence. But it turns out the basics RNN we’ve seen so far it’s not very good at capturing very long-term dependencies. To explain why, you might remember from our early discussions of training very deep neural networks, that we talked about the vanishing gradients problem. So this is a very, very deep neural network say, 100 layers or even much deeper than you would carry out forward prop, from left to right and then back prop. And we said that, if this is a very deep neural network, then the gradient from just output y, would have a very hard time propagating back to affect the weights of these earlier layers, to affect the computations in the earlier layers. And for an RNN with a similar problem, you have forward prop came from left to right, and then back prop, going from right to left. And it can be quite difficult, because of the same vanishing gradients problem, for the outputs of the errors associated with the later time steps to affect the computations that are earlier. And so in practice, what this means is, it might be difficult to get a neural network to realize that it needs to memorize the just see a singular noun or a plural noun, so that later on in the sequence that can generate either was or were, depending on whether it was singular or plural. And notice that in English, this stuff in the middle could be arbitrarily long, right? So you might need to memorize the singular/plural for a very long time before you get to use that bit of information. So because of this problem, the basic RNN model has many local influences, meaning that the output $y^{<3>}$ is mainly influenced by values close to $y^{<3>}$. And a value here is mainly influenced by inputs that are somewhere close. And it’s difficult for the output here to be strongly influenced by an input that was very early in the sequence. And this is because whatever the output is, whether this got it right, this got it wrong, it’s just very difficult for the area to backpropagate all the way to the beginning of the sequence, and therefore to modify how the neural network is doing computations earlier in the sequence. So this is a weakness of the basic RNN algorithm. One, which was not addressed in the next few videos. But if we don’t address it, then RNNs tend not to be very good at capturing long-range dependencies. And even though this discussion has focused on vanishing gradients, you will remember when we talked about very deep neural networks, that we also talked about exploding gradients. We’re doing back prop, the gradients should not just decrease exponentially, they may also increase exponentially with the number of layers you go through. It turns out that vanishing gradients tends to be the bigger problem with training RNNs, although when exploding gradients happens, it can be catastrophic because the exponentially large gradients can cause your parameters to become so large that your neural network parameters get really messed up. So it turns out that exploding gradients are easier to spot because the parameters just blow up and you might often see NaNs, or not a numbers, meaning results of a numerical overflow in your neural network computation. And if you do see exploding gradients, one solution to that is apply gradient clipping. And what that really means, all that means is look at your gradient vectors, and if it is bigger than some threshold, re-scale some of your gradient vector so that is not too big. So there are clips according to some maximum value. So if you see exploding gradients, if your derivatives do explode or you see NaNs, just apply gradient clipping, and that’s a relatively robust solution that will take care of exploding gradients. But vanishing gradients is much harder to solve and it will be the subject of the next few videos.
So to summarize, in an earlier course, you saw how the training of very deep neural network, you can run into a vanishing gradient or exploding gradient problems with the derivative, either decreases exponentially or grows exponentially as a function of the number of layers. And in RNN, say in RNN processing data over a thousand times sets, over 10,000 times sets, that’s basically a 1,000 layer or they go 10,000 layer neural network, and so, it too runs into these types of problems. Exploding gradients, you could sort of address by just using gradient clipping, but vanishing gradients will take more work to address. So what we do in the next video is talk about GRU, the greater recurrent units, which is a very effective solution for addressing the vanishing gradient problem and will allow your neural network to capture much longer range dependencies. So, lets go on to the next video.
09_gated-recurrent-unit-gru
You’ve seen how a basic RNN works. In this video, you learn about the Gated Recurrent Unit which is a modification to the RNN hidden layer that makes it much better capturing long range connections and helps a lot with the vanishing gradient problems. Let’s take a look.
You’ve already seen the formula for computing the activations at time t of RNN. It’s the activation function applied to the parameter Wa times the activations in the previous time set, the current input and then plus ba. So I’m going to draw this as a picture. So the RNN unit, I’m going to draw as a picture, drawn as a box which inputs a of t-1, the activation for the last time-step. And also inputs x
Lots of the idea of GRU were due to these two papers respectively by Yu Young Chang, Kagawa, Gaza Hera, Chang Hung Chu and Jose Banjo. And I’m sometimes going to refer to this sentence which we’d seen in the last video to motivate that. Given a sentence like this, you might need to remember the cat was singular, to make sure you understand why that was rather than were. So the cat was for or the cats were for. So as we read in this sentence from left to right, the GRU unit is going to have a new variable called c, which stands for cell, for memory cell. And what the memory cell do is it will provide a bit of memory to remember, for example, whether cat was singular or plural, so that when it gets much further into the sentence it can still work under consideration whether the subject of the sentence was singular or plural. And so at time t the memory cell will have some value c of t. And what we see is that the GRU unit will actually output an activation value a of t that’s equal to c of t. And for now I wanted to use different symbol c and a to denote the memory cell value and the output activation value, even though they are the same. I’m using this notation because when we talk about LSTMs, a little bit later, these will be two different values. But for now, for the GRU, c of t is equal to the output activation a of t. So these are the equations that govern the computations of a GRU unit. And every time-step, we’re going to consider overwriting the memory cell with a value c tilde of t. So this is going to be a candidate for replacing c of t. And we’re going to compute this using an activation function tanh of Wc. And so that’s the parameter to make sure it’s Wc and we’ll plus this parameter matrix, the previous value of the memory cell, the activation value as well as the current input value $x^{
Now I just want to talk over some more details of how you implement this. In the equations I’ve written, $c^{
Let me describe the full GRU unit. So to do that, let me copy the three main equations. This one, this one and this one to the next slide. So here they are. And for the full GRU unit, I’m sure to make one change to this which is, for the first equation which was calculating the candidate new value for the memory cell, I’m going just to add one term. Let me pushed that a little bit to the right, and I’m going to add one more gate. So this is another gate $\Gamma_r$. You can think of r as standing for relevance. So this gate $\Gamma_r$ tells you how relevant is $c^{
So that’s it for the GRU, for the Gate Recurrent Unit. This is one of the ideas in RNN that has enabled them to become much better at capturing very long range dependencies has made RNN much more effective. Next, as I briefly mentioned, the other most commonly used variation of this class of idea is something called the LSTM unit, Long Short Term Memory unit. Let’s take a look at that in the next video.
10_long-short-term-memory-lstm
In the last video, you learned about the GRU, the gated recurrent units, and how that can allow you to learn very long range connections in a sequence. The other type of unit that allows you to do this very well is the LSTM or the long short term memory units, and this is even more powerful than the GRU. Let’s take a look.
Here are the equations from the previous video for the GRU. And for the GRU, we had $a^{
So, here again are the equations governing the behavior of the LSTM. Once again, it’s traditional to explain these things using pictures. So let me draw one here. And if these pictures are too complicated, don’t worry about it. I personally find the equations easier to understand than the picture. But I’ll just show the picture here for the intuitions it conveys. The bigger picture here was very much inspired by a blog post due to Chris Ola, title, Understanding LSTM Network, and the diagram drawing here is quite similar to one that he drew in his blog post. But the key thing is to take away from this picture are maybe that you use $a^{
So, that’s it for the LSTM. When should you use a GRU? And when should you use an LSTM? There isn’t widespread consensus in this. And even though I presented GRUs first, in the history of deep learning, LSTMs actually came much earlier, and then GRUs were relatively recent invention that were maybe derived as Pavia’s simplification of the more complicated LSTM model. Researchers have tried both of these models on many different problems, and on different problems, different algorithms will win out. So, there isn’t a universally-superior algorithm which is why I want to show you both of them. But I feel like when I am using these, the advantage of the GRU is that it’s a simpler model and so it is actually easier to build a much bigger network, it only has two gates, so computationally, it runs a bit faster. So, it scales the building somewhat bigger models but the LSTM is more powerful and more effective since it has three gates instead of two. If you want to pick one to use, I think LSTM has been the historically more proven choice. So, if you had to pick one, I think most people today will still use the LSTM as the default first thing to try. Although, I think in the last few years, GRUs had been gaining a lot of momentum and I feel like more and more teams are also using GRUs because they’re a bit simpler but often work just as well. It might be easier to scale them to even bigger problems. So, that’s it for LSTMs. Well, either GRUs or LSTMs, you’ll be able to build neural network that can capture a much longer range dependancy.
11_bidirectional-rnn
By now, you’ve seen most of the cheap building blocks of RNNs. But, there are just two more ideas that let you build much more powerful models. One is bidirectional RNNs, which lets you at a point in time to take information from both earlier and later in the sequence, so we’ll talk about that in this video. And second, is deep RNNs, which you’ll see in the next video. So let’s start with Bidirectional RNNs.
So, to motivate bidirectional RNNs, let’s look at this network which you’ve seen a few times before in the context of named entity recognition. And one of the problems of this network is that, to figure out whether the third word Teddy is a part of the person’s name, it’s not enough to just look at the first part of the sentence. So to tell, if Y three should be zero or one, you need more information than just the first three words because the first three words doesn’t tell you if they’ll talking about Teddy bears or talk about the former US president, Teddy Roosevelt. So this is a unidirectional or forward directional only RNN. And, this comment I just made is true, whether these cells are standard RNN blocks or whether they’re GRU units or whether they’re LSTM blocks. But all of these blocks are in a forward only direction. So what a bidirectional RNN does or BRNN, is fix this issue. So, a bidirectional RNN works as follows.
I’m going to use a simplified four inputs or maybe a four word sentence. So we have four inputs. X one through X four. So this networks heading there will have a forward recurrent components. So I’m going to call this, A one, A two, A three and A four, and I’m going to draw a right arrow over that to denote this is the forward recurrent component, and so they’ll be connected as follows. And so, each of these four recurrent units inputs the current X, and then feeds in to help predict Y-hat one, Y-hat two, Y-hat three, and Y-hat four. So, so far I haven’t done anything. Basically, we’ve drawn the RNN from the previous slide, but with the arrows placed in slightly funny positions. But I drew the arrows in this slightly funny positions because what we’re going to do is add a backward recurrent layer. So we’d have A one, left arrow to denote this is a backward connection, and then A two, backwards, A three, backwards and A four, backwards, so the left arrow denotes that it is a backward connection. And so, we’re then going to connect to network up as follows. And this A backward connections will be connected to each other going backward in time. So, notice that this network defines a Acyclic graph. And so, given an input sequence, X one through X four, the fourth sequence will first compute A forward one, then use that to compute A forward two, then A forward three, then A forward four. Whereas, the backward sequence would start by computing A backward four, and then go back and compute A backward three, and then as you are computing network activation, this is not backward this is forward prop. But the forward prop has part of the computation going from left to right and part of computation going from right to left in this diagram. But having computed A backward three, you can then use those activations to compute A backward two, and then A backward one, and then finally having computed all you had in the activations, you can then make your predictions. And so, for example, to make the predictions, your network will have something like Y-hat at time t is an activation function applied to WY with both the forward activation at time t, and the backward activation at time t being fed in to make that prediction at time t. So, if you look at the prediction at time set three for example, then information from X one can flow through here, forward one to forward two, they’re are all stated in the function here, to forward three to Y-hat three. So information from X one, X two, X three are all taken into account with information from X four can flow through a backward four to a backward three to Y three. So this allows the prediction at time three to take as input both information from the past, as well as information from the present which goes into both the forward and the backward things at this step, as well as information from the future. So, in particular, given a phrase like, “He said, Teddy Roosevelt…” To predict whether Teddy is a part of the person’s name, you take into account information from the past and from the future. So this is the bidirectional recurrent neural network and these blocks here can be not just the standard RNN block but they can also be GRU blocks or LSTM blocks. In fact, for a lots of NLP problems, for a lot of text with natural language processing problems, a bidirectional RNN with a LSTM appears to be commonly used. So, we have NLP problem and you have the complete sentence, you try to label things in the sentence, a bidirectional RNN with LSTM blocks both forward and backward would be a pretty views of first thing to try. So, that’s it for the bidirectional RNN and this is a modification they can make to the basic RNN architecture or the GRU or the LSTM, and by making this change you can have a model that uses RNN and or GRU or LSTM and is able to make predictions anywhere even in the middle of a sequence by taking into account information potentially from the entire sequence. The disadvantage of the bidirectional RNN is that you do need the entire sequence of data before you can make predictions anywhere. So, for example, if you’re building a speech recognition system, then the BRNN will let you take into account the entire speech utterance but if you use this straightforward implementation, you need to wait for the person to stop talking to get the entire utterance before you can actually process it and make a speech recognition prediction. So for a real type speech recognition applications, they’re somewhat more complex modules as well rather than just using the standard bidirectional RNN as you’ve seen here. But for a lot of natural language processing applications where you can get the entire sentence all the same time, the standard BRNN algorithm is actually very effective.
So, that’s it for BRNNs and next and final video for this week, let’s talk about how to take all of these ideas RNNs, LSTMs and GRUs and the bidirectional versions and construct deep versions of them.
12_deep-rnns
The different versions of RNNs you’ve seen so far will already work quite well by themselves. But for learning very complex functions sometimes is useful to stack multiple layers of RNNs together to build even deeper versions of these models. In this video, you’ll see how to build these deeper RNNs. Let’s take a look.
So you remember for a standard neural network, you will have an input X. And then that’s stacked to some hidden layer and so that might have activations of say, $a^{<1>}$ for the first hidden layer, and then that’s stacked to the next layer with activations $a^{<2>}$, then maybe another layer, activations $a^{<3>}$ and then you make a prediction $ŷ$. So a deep RNN is a bit like this, by taking this network that I just drew by hand and unrolling that in time. So let’s take a look. So here’s the standard RNN that you’ve seen so far. But I’ve changed the notation a little bit which is that, instead of writing this as $a^{<0>}$ for the activation time zero, I’ve added this square bracket 1 to denote that this is for layer one. So the notation we’re going to use is $a^{[l]}$ to denote that it’s an activation associated with layer l and then
One thing you do see sometimes is that you have recurrent layers that are stacked on top of each other. But then you might take the output here, let’s get rid of this, and then just have a bunch of deep layers that are not connected horizontally but have a deep network here that then finally predicts $y^{<1>}$. And you can have the same deep network here that predicts $y^{<2>}$. So this is a type of network architecture that we’re seeing a little bit more where you have three recurrent units that connected in time, followed by a network, followed by a network after that, as we seen for $y^{<3>}$ and $y^{<4>}$, of course. There’s a deep network, but that does not have the horizontal connections. So that’s one type of architecture we seem to be seeing more of. And quite often, these blocks don’t just have to be standard RNN, the simple RNN model. They can also be GRU blocks LSTM blocks. And finally, you can also build deep versions of the bidirectional RNN. Because deep RNNs are quite computationally expensive to train, there’s often a large temporal extent already, though you just don’t see as many deep recurrent layers. This has, I guess, three deep recurrent layers that are connected in time. You don’t see as many deep recurrent layers as you would see in a number of layers in a deep conventional neural network.
So that’s it for deep RNNs. With what you’ve seen this week, ranging from the basic RNN, the basic recurrent unit, to the GRU, to the LSTM, to the bidirectional RNN, to the deep versions of this that you just saw, you now have a very rich toolbox for constructing very powerful models for learning sequence models. I hope you enjoyed this week’s videos. Best of luck with the problem exercises and I look forward to seeing you next week.