recurrent neural networks

Note

This is my personal lecture note after studying the course nlp sequence models at the 1st week and the copyright belongs to deeplearning.ai.

01_why-sequence-models

Welcome to this fifth course on deep learning. In this course, you learn about sequence models, one of the most exciting areas in deep learning. Models like recurrent neural networks or RNNs have transformed speech recognition, natural language processing and other areas. And in this course, you learn how to build these models for yourself. Let’s start by looking at a few examples of where sequence models can be useful.

In speech recognition you are given an input audio clip X and asked to map it to a text transcript Y. Both the input and the output here are sequence data, because X is an audio clip and so that plays out over time and Y, the output, is a sequence of words. So sequence models such as a recurrent neural networks and other variations, you’ll learn about in a little bit have been very useful for speech recognition. Music generation is another example of a problem with sequence data. In this case, only the output Y is a sequence, the input can be the empty set, or it can be a single integer, maybe referring to the genre of music you want to generate or maybe the first few notes of the piece of music you want. But here X can be nothing or maybe just an integer and output Y is a sequence. In sentiment classification the input X is a sequence, so given the input phrase like, “There is nothing to like in this movie” how many stars do you think this review will be? Sequence models are also very useful for DNA sequence analysis. So your DNA is represented via the four alphabets A, C, G, and T. And so given a DNA sequence can you label which part of this DNA sequence say corresponds to a protein. In machine translation you are given an input sentence, voulez-vou chante avec moi? And you’re asked to output the translation in a different language. In video activity recognition you might be given a sequence of video frames and asked to recognize the activity. And in name entity recognition you might be given a sentence and asked to identify the people in that sentence. So all of these problems can be addressed as supervised learning with label data X, Y as the training set. But, as you can tell from this list of examples, there are a lot of different types of sequence problems. In some, both the input X and the output Y are sequences, and in that case (speech recognition), sometimes X and Y can have different lengths, or in this example (at DNA case) and this example(at Name entity recognition), X and Y have the same length. And in some of these examples only either X or only the opposite Y is a sequence. So in this course you learn about sequence models are applicable, so all of these different settings.

So I hope this gives you a sense of the exciting set of problems that sequence models might be able to help you to address. With that let us go on to the next video where we start to define the notation we use to define these sequence-models.

02_notation

In the last video, you saw some of the wide range of applications through which you can apply sequence models. Let’s start by defining a notation that we’ll use to build up these sequence models.

As a motivating example, let’s say you want to build a sequence model to input a sentence like this, Harry Potter and Hermione Granger invented a new spell. And these are characters by the way, from the Harry Potter sequence of novels by J. K. Rowling. And let say you want a sequence model to automatically tell you where are the peoples names in this sentence. So, this is a problem called Named-entity recognition and this is used by search engines for example, to index all of say the last 24 hours news of all the people mentioned in the news articles so that they can index them appropriately. And name into the recognition systems can be used to find people’s names, companies names, times, locations, countries names, currency names, and so on in different types of text. Now, given this input x let’s say that you want a model to operate y that has one outputs per input word and the target output the design y tells you for each of the input words is that part of a person’s name. And technically this maybe isn’t the best output representation, there are some more sophisticated output representations that tells you not just is a word part of a person’s name, but tells you where are the start and ends of people’s names their sentence, you want to know Harry Potter starts here, and ends here, starts here, and ends here. But for this motivating example, I’m just going to stick with this simpler output representation. Now, the input is the sequence of nine words. So, eventually we’re going to have nine sets of features to represent these nine words, and index into the positions and sequence, I’m going to use X and then superscript angle brackets 1, 2, 3 and so on up to X angle brackets nine to index into the different positions. I’m going to use $X^{}$ with the index t to index into positions, in the middle of the sequence. And t implies that these are temporal sequences although whether the sequences are temporal one or not, I’m going to use the index t to index into the positions in the sequence. And similarly for the outputs, we’re going to refer to these outputs as y and go back at 1, 2, 3 and so on up to y nine. Let’s also used T sub of x to denote the length of the input sequence, so in this case there are nine words. So $T_x$ is equal to 9 and we used $T_y$ to denote the length of the output sequence. In this example $T_x$ is equal to $T_y$ but you saw on the last video $T_x$ and $T_y$ can be different. So, you will remember that in the notation we’ve been using, we’ve been writing X round brackets i to denote the i training example. So, to refer to the TIF element or the TIF element in the sequence of training example i will use this notation and if $T_x$ is the length of a sequence then different examples in your training set can have different lengths. And so $T_x^i$ would be the input sequence length for training example i, and similarly $y^{(i)}$ means the TIF element in the output sequence of the i for an example and $T_y^i$ will be the length of the output sequence in the i training example. So into this example, $T_x^i$ is equal to 9 would be the highly different training example with a sentence of 15 words and $T_x^i$ will be close to 15 for that different training example. Now, that we’re starting to work in NLP or Natural Language Processing. Now, this is our first serious foray into NLP or Natural Language Processing. And one of the things we need to decide is, how to represent individual words in the sequence. So, how do you represent a word like Harry, and why should $x^{<1>}$ really be? Let’s next talk about how we would represent individual words in a sentence.

So, to represent a word in the sentence the first thing you do is come up with a Vocabulary. Sometimes also called a Dictionary and that means making a list of the words that you will use in your representations. So the first word in the vocabulary is a, that will be the first word in the dictionary. The second word is Aaron and then a little bit further down is the word and, and then eventually you get to the words Harry then eventually the word Potter, and then all the way down to maybe the last word in dictionary is Zulu. And so, a will be word one, Aaron is word two, and in my dictionary the word and appears in positional index 367. Harry appears in position 4075, Potter in position 6830, and Zulu is the last word to the dictionary is maybe word 10,000. So in this example, I’m going to use a dictionary with size 10,000 words. This is quite small by modern NLP applications. For commercial applications, for visual size commercial applications, dictionary sizes of 30 to 50,000 are more common and 100,000 is not uncommon. And then some of the large Internet companies will use dictionary sizes that are maybe a million words or even bigger than that. But you see a lot of commercial applications used dictionary sizes of maybe 30,000 or maybe 50,000 words. But I’m going to use 10,000 for illustration since it’s a nice round number. So, if you have chosen a dictionary of 10,000 words and one way to build this dictionary will be be to look through your training sets and find the top 10,000 occurring words, also look through some of the online dictionaries that tells you what are the most common 10,000 words in the English Language saved. What you can do is then use one hot representations to represent each of these words. For example, $x^{<1>}$ which represents the word Harry would be a vector with all zeros except for a 1 in position 4075 because that was the position of Harry in the dictionary. And then $x^{<2>}$ will be again similarly a vector of all zeros except for a 1 in position 6830 and then zeros everywhere else. The word and was represented as position 367 so $x^{<3>}$ would be a vector with zeros of 1 in position 367 and then zeros everywhere else. And each of these would be a 10,000 dimensional vector if your vocabulary has 10,000 words. And this one A, I guess because A is the first whether the dictionary, then $x^{<7>}$ which corresponds to word a, that would be the vector 1. This is the first element of the dictionary and then zero everywhere else. So in this representation, $x^{}$ for each of the values of t in a sentence will be a one-hot vector, one-hot because there’s exactly one one is on and zero everywhere else and you will have nine of them to represent the nine words in this sentence. And the goal is given this representation for X to learn a mapping using a sequence model to then target output y, I will do this as a supervised learning problem, I’m sure given the table data with both x and y. Then just one last detail, which we’ll talk more about in a later video is, what if you encounter a word that is not in your vocabulary? Well the answer is, you create a new token or a new fake word called Unknown Word which under note as follows and go back as UNK to represent words not in your vocabulary, we’ll come more to talk more about this later.

So, to summarize in this video, we described a notation for describing your training set for both x and y for sequence data. In the next video let’s start to describe a Recurrent Neural Networks for learning the mapping from X to Y.

03_recurrent-neural-network-model

In the last video, you saw the notation we used to define sequence learning problems. Now, let’s talk about how you can build a model, build a neural network to drawing the mapping from X to Y.

Now, one thing you could do is try to use a standard neural network for this task. So in our previous example, we had nine input words. So you could imagine trying to take these nine input words, maybe the nine one hot vectors and feeding them into a standard neural network, maybe a few hidden layers and then eventually, have this output the nine values zero or one that tell you whether each word is part of a person’s name. But this turns out not to work well, and there are really two main problems with this. The first is that the inputs and outputs can be different lengths in different examples. So it’s not as if every single example has the same input length $T^{}$ or the same output length $T^{}$. And maybe if every sentence had a maximum length, maybe you could pad, or zero pad every input up to that maximum length, but this still doesn’t seem like a good representation. And in a second, it might be more serious problem is that a naive neural network architecture like this, it doesn’t share features learned across different positions of techs. In particular, if the neural network has learned that maybe the word heavy appearing in position one gives a sign that that is part of a person’s name, then one would be nice if it automatically figures out that heavy appearing in some other position, $X^{}$ also means that that might be a person’s name. And this is maybe similar to what you saw in convolutional neural networks where you want things learned for one part of the image to generalize quickly to other parts of the image, and we’d like similar effect for sequence data as well. And similar to what you saw with ConvNets using a better representation will also let you reduce the number of parameters in your model. So previously, we said that each of these is a 10,000 dimensional one vector. And so, this is just a very large input layer. If the total input size was maximum number of words times 10,000, and the weight matrix of this first layer would end up having an enormous number of parameters. So a recurrent neural network which will start to describe in the next slide, does not have either of these disadvantages.

So what is a recurrent neural network? Let’s build one out. So if you are reading the sentence from left to right, the first word you read is the some first where say X1. What we’re going to do is take the first word and feed it into a neural network layer. I’m going to draw it like this. So that’s a hidden layer of the first neural network. And look at how the neural network maybe try to predict the output. So is this part of a person’s name or not? And what a recurrent neural network does is when it then goes on to read the second word in a sentence, say X2, instead of just predicting Y2 using only X2, it also gets to input some information from whether a computer that time-step ones. So in particular, the activation value from time-step one is passed on to time-step 2. And then, at the next time-step, a recurrent neural network inputs the third word X3, and it tries to predict, output some prediction y-hat 3, and so on, up until the last time-step where inputs $X^{}$, and then it outputs Y hat TY. In this example, Tx=Ty, and the architecture will change a bit if Tx and Ty are not identical. And so, at each time-step, the recurrent neural network passes on this activation to the next time-step for it to use. And to kick off the whole thing, we’ll also have some made up activation at time zero. This is usually the vector of zeroes. Some researchers will initialize a zero randomly have other ways to initialize a zero but really having a vector zero is just a fake. Time Zero activation is the most common choice. And so that does input into the neural network. In some research papers or in some books, you see this type of neural network drawn with the following diagram in which every time-step, you input X and output Y hat, maybe sometimes there will be a T index there, and then to denote the recurrent connection, sometimes people will draw a loop like that, that the layer feeds back to itself. Sometimes they’ll draw a shaded box to denote that this is the shaded box here, denotes a time delay of one step. I personally find these recurrent diagrams much harder to interpret. And so throughout this course, I will tend to draw the on the road diagram like the one you have on the left. But if you see something like the diagram on the right in a textbook or in a research paper, what it really means, or the way I tend to think about it is the mentally unrolled into the diagram you have on the left hand side.

The recurrent neural network scans through the data from left to right. And the parameters it uses for each time step are shared. So there will be a set of parameters which we’ll describe in greater detail on the next slide, but the parameters governing the connection from X1 to the hidden layer will be some set of the parameters we’re going to write as WAX, and it’s the same parameters $W_{ax}$ that it uses for every time-step I guess you could write $W_{ax}$ there as well. And the activations, the horizontal connections, will be governed by some set of parameters $W_{aa}$, and is the same parameters $W_{aa}$ use on every time-step, and similarly, the sum $W_{ya}$ that governs the output predictions. And I’ll describe in the next slide exactly how these parameters work. So in this recurrent neural network, what this means is that we’re making the prediction for Y3 against the information not only from X3, but also the information from X1 and X2, because the information of X1 can pass through this way to help the prediction with Y3. Now one weakness of this RNN is that it only uses the information that is earlier in the sequence to make a prediction, in particular, when predicting Y3, it doesn’t use information about the words X4, X5, X6 and so on. And so this is a problem because if you’re given a sentence, he said, “Teddy Roosevelt was a great president.” In order to decide whether or not the word Teddy is part of a person’s name, it be really useful to know not just information from the first two words but to know information from the later words in the sentence as well, because the sentence could also happen, he said, “Teddy bears are on sale!” And so, given just the first three words, it’s not possible to know for sure whether the word Teddy is part of a person’s name. In the first example, it is, in the second example, is not, but you can’t tell the difference if you look only at the first three words. So one limitation of this particular neural network structure is that the prediction at a certain time uses inputs or uses information from the inputs earlier in the sequence but not information later in the sequence. We will address this in a later video where we talk about a bidirectional recurrent neural networks or BRNNs. But for now, this simpler uni-directional neural network architecture will suffice for us to explain the key concepts. And we just have to make a quick modifications in these ideas later to enable say the prediction of Y-hat 3 to use both information earlier in the sequence as well as information later in the sequence, but we’ll get to that in a later video.

So let’s not write to explicitly what are the calculations that this neural network does. Here’s a cleaned out version of the picture of the neural network. As I mentioned previously, typically, you started off with the input a0 equals the vector of all zeroes. Next. This is what a forward propagation looks like. To compute a1, you would compute that as an activation function g, applied to Waa times a0 plus W a x times x1 plus a bias was going to write it as ba, and then to compute y hat 1 the prediction of times that one, that will be some activation function, maybe a different activation function, than the one above. But apply to WYA times a1 plus b y. And the notation convention I’m going to use for the sub zero of these matrices like that example, W a x. The second index means that this W a x is going to be multiplied by some x like quantity, and this means that this is used to compute some a like quantity. Like like so. And similarly, you notice that here WYA is multiplied by a sum a like quantity to compute a y type quantity. The activation function used in-to compute the activations will often be a tonnage and the choice of an RNN and sometimes, values are also used although the tonnage is actually a pretty common choice. And we have other ways of preventing the vanishing gradient problem which we’ll talk about later this week. And depending on what your output y is, if it is a binary classification problem, then I guess you would use a sigmoid activation function or it could be a soft Max if you have a ky classification problem. But the choice of activation function here would depend on what type of output y you have. So, for the name entity recognition task, where Y was either zero or one. I guess the second g could be a signal and activation function. And I guess you could write g2 if you want to distinguish that this is these could be different activation functions but I usually won’t do that. And then, more generally at time t, a t will be g of W a a times a, from the previous time-step, plus W a x of x from the current time-step plus B a, and y hat t is equal to g, again, it could be different activation functions but g of WYA times a t plus B y. So, these equations define for propagation in the neural network. Where you would start off with a zeroes [inaudible] and then using a zero and X1, you will compute a1 and y hat one, and then you, take X2 and use X2 and A1 to compute A2 and Y hat two and so on, and you carry out for propagation going from the left to the right of this picture. Now, in order to help us develop the more complex neural networks, I’m actually going to take this notation and simplify it a little bit. So, let me copy these two equations in the next slide. Right.

Here they are, and what I’m going to do is actually take- so to simplify the notation a bit, I’m actually going to take that and write in a slightly simpler way. And someone very does this a = g times just a matrix $W_a$ times a new quantity is going to be $a^{}$ comma $x^{}$ and then, plus B a. And so, that underlining quantity on the left and right are supposed to be equivalent. So, the way we define $W_{a}$ is we’ll take this matrix $W_{aa}$ and this matrix $W_{ax}$. And put them side by side and stack them horizontally as follows. And this will be the matrix $W_{a}$. So for example, if a was a hundred dimensional, and then another example, X was 10,000 dimensional, then $W_{aa}$ would have been a 100 by 100 dimensional matrix and $W_{ax}$ would have been a 100 by 10,000 dimensional matrix. And so stacking these two matrices together this would be 100 dimensional. This would be 100, and this would be I guess 10,000 elements. So $W_{a}$ will be a 100 by one zero one zero zero zero dimensional matrix. I guess this diagram on the left is not drawn to scale. Since $W_{ax}$ would be a very wide matrix. And what this notation means, is to just take the two vectors, and stack them together. So, let me use that notation to denote that we’re going to take the vector $a^{}$. So there’s a 100 dimensional and stack it on top of $a^{}$. So this ends up being a one zero one zero zero dimensional vector. And so hopefully, you check for yourself that this matrix times this vector, just gives you back to the original quantity. Right. Because now, this matrix $W_{aa}$ times $W_{ax}$ multiplied by this $a^{}$ $x^{}$ vector, this is just equal to $W_{aa}$ times $a^{}$ plus $W_{ax}$ times x t which is exactly what we had back over here. So, the advantages of this notation is that rather than carrying around two parameter matrices $W_{aa}$ and $W_{ax}$, we can compress them into just one parameter matrix $W_{a}$. And this will simplify a notation for when we develop more complex models. And then, for this, in a similar way I’m just going to rewrite this slightly with the ranges as $W_y$ $a^{}$ plus $b_y$. And now, we just have the substrates in the notation $W_y$ and $b_y$, it denotes what type of output quantity over computing. So $W_y$ indicates that there’s a weight matrix of computing a y like quantity and here a Wa and ba on top. In the case of those the parameters of computing that an a and activation output quantity.

So, that’s it. You now know, what is a basic recurrent network. Next, let’s talk about back propagation and how you learn with these RNNs.

04_backpropagation-through-time

You’ve already learned about the basic structure of an RNN. In this video, you’ll see how backpropagation in a recurrent neural network works. As usual, when you implement this in one of the programming frameworks, often, the programming framework will automatically take care of backpropagation. But I think it’s still useful to have a rough sense of how backprop works in RNNs. Let’s take a look.

You’ve seen how, for forward prop, you would computes these activations from left to right as follows in the neural network, and so you’ve outputs all of the predictions. In backprop, as you might already have guessed, you end up carrying backpropagation calculations in basically the opposite direction of the forward prop arrows.

So, let’s go through the forward propagation calculation. You’re given this input sequence $x^{<1>}, x^{<2>}, x^{<3>}$, up to $x^{}$. And then using $x^{<1>}$ and say, $a^{<0>}$, you’re going to compute the activation, times that one, and then together, $x^{<2>}$ together with $a^{<1>}$ are used to compute $a^{<2>}$, and then $a^{<3>}$, and so on, up to $a^{}$. All right. And then to actually compute $a^{<1>}$, you also need the parameters. We’ll just draw this in green, $W_a$ and $b_a$, those are the parameters that are used to compute $a^{<1>}$. And then, these parameters are actually used for every single timestep so, these parameters are actually used to compute $a^{<1>}$, $a^{<3>}$, and so on, all the activations up to last timestep depend on the parameters $W_a$ and $b_a$. Let’s keep fleshing out this graph. Now, given $a^{<1>}$, your neural network can then compute the first prediction, $\hat{y}^{<1>}$, and then the second timestep, $\hat{y}^{<2>}$, $\hat{y}^{<3>}$, and so on, with $\hat{y}^{}$. And let me again draw the parameters of a different color. So, to compute $\hat{y}$, you need the parameters, $W_y$ as well as $b_y$, and this goes into this node as well as all the others. So, I’ll draw this in green as well. Next, in order to compute backpropagation, you need a loss function. So let’s define an element-wise loss force, which is supposed for a certain word in the sequence. It is a person’s name, so $y^{}$ is one. And your neural network outputs some probability of maybe 0.1 of the particular word being a person’s name. So I’m going to define this as the standard logistic regression loss, also called the cross entropy loss. This may look familiar to you from where we were previously looking at binary classification problems. So this is the loss associated with a single prediction at a single position or at a single time set, t, for a single word. Let’s now define the overall loss of the entire sequence, so L will be defined as the sum overall t equals one to, i guess, $T_x$ or $T_y$. $T_x$ is equals to $T_y$ in this example of the losses for the individual timesteps, comma $y^{}$. And then, so, just have to L without this superscript T. This is the loss for the entire sequence. So, in a computation graph, to compute the loss given $\hat{y}^{<1>}$, you can then compute the loss for the first timestep given that you compute the loss for the second timestep, the loss for the third timestep, and so on, the loss for the final timestep. And then lastly, to compute the overall loss, we will take these and sum them all up to compute the final L using that equation, which is the sum of the individual per timestep losses. So, this is the computation problem and from the earlier examples you’ve seen of backpropagation, it shouldn’t surprise you that backprop then just requires doing computations or parsing messages in the opposite directions. So, all of the four propagation steps arrows, so you end up doing that. And that then, allows you to compute all the appropriate quantities that lets you then, take the riveters, respected parameters, and update the parameters using gradient descent. Now, in this back propagation procedure, the most significant message or the most significant recursive calculation is this one, which goes from right to left, and that’s why it gives this algorithm as well, a pretty fast full name called backpropagation through time. And the motivation for this name is that for forward prop, you are scanning from left to right, increasing indices of the time, t, whereas, the backpropagation, you’re going from right to left, you’re kind of going backwards in time. So this gives this, I think a really cool name, backpropagation through time, where you’re going backwards in time, right? That phrase really makes it sound like you need a time machine to implement this output, but I just thought that backprop through time is just one of the coolest names for an algorithm.

So, I hope that gives you a sense of how forward prop and backprop in RNN works. Now, so far, you’ve only seen this main motivating example in RNN, in which the length of the input sequence was equal to the length of the output sequence. In the next video, I want to show you a much wider range of RNN architecture, so I’ll let you tackle a much wider set of applications. Let’s go on to the next video.

05_different-types-of-rnns

So far, you’ve seen an RNN architecture where the number of inputs, Tx, is equal to the number of outputs, Ty. It turns out that for other applications, Tx and Ty may not always be the same, and in this video, you’ll see a much richer family of RNN architectures.

You might remember this slide from the first video of this week, where the input x and the output y can be many different types. And it’s not always the case that $T_x$ has to be equal to $T_y$. In particular, in this example, $T_x$ can be length one or even an empty set. And then, an example like movie sentiment classification, the output y could be just an integer from 1 to 5, whereas the input is a sequence. And in name entity recognition, in the example we’re using, the input length and the output length are identical, but there are also some problems were the input length and the output length can be different.They’re both our sequences but have different lengths, such as machine translation where a French sentence and English sentence can mean two different numbers of words to say the same thing. So it turns out that we could modify the basic RNN architecture to address all of these problems. And the presentation in this video was inspired by a blog post by Andrej Karpathy, titled, The Unreasonable Effectiveness of Recurrent Neural Networks.

Let’s go through some examples. The example you’ve seen so far use $T_x$ equals $T_y$, where we had an input sequence x(1), x(2) up to x(Tx), and we had a recurrent neural network that works as follows when we would input x(1) to compute y hat (1), y hat (2), and so on up to y hat (Ty), as follows. And in early diagrams, I was drawing a bunch of circles here to denote neurons but I’m just going to make those little circles for most of this video, just to make the notation simpler. So, this is what you might call a many-to-many architecture because the input sequence has many inputs as a sequence and the outputs sequence is also has many outputs. Now, let’s look at a different example. Let’s say, you want to address sentiments classification. Here, x might be a piece of text, such as it might be a movie review that says, “There is nothing to like in this movie.” So x is going to be sequenced, and y might be a number from 1 to 5, or maybe 0 or 1. This is a positive review or a negative review, or it could be a number from 1 to 5. Do you think this is a one-star, two-star, three, four, or five-star review? So in this case, we can simplify the neural network architecture as follows. I will input x(1), x(2). So, input the words one at a time. So if the input text was, “There is nothing to like in this movie.” So “There is nothing to like in this movie,” would be the input. And then rather than having to use an output at every single time-step, we can then just have the RNN read into entire sentence and have it output y at the last time-step when it has already input the entire sentence. So, this neural network would be a many-to-one architecture. Because as many inputs, it inputs many words and then it just outputs one number. For the sake of completeness, there is also a one-to-one architecture. So this one is maybe less interesting. The smaller the standard neural network, we have some input x and we just had some output y. And so, this would be the type of neural network that we covered in the first two courses in this sequence. Now, in addition to many-to-one, you can also have a one-to-many architecture. So an example of a one-to-many neural network architecture will be music generation. And in fact, you get to implement this yourself in one of the primary exercises for this course where you go is have a neural network, output a set of notes corresponding to a piece of music. And the input x could be maybe just an integer, telling it what genre of music you want or what is the first note of the music you want, and if you don’t want to input anything, x could be a null input, could always be the vector zeroes as well. For that, the neural network architecture would be your input x. And then, have your RNN output. The first value, and then, have that, with no further inputs, output. The second value and then go on to output. The third value, and so on, until you synthesize the last notes of the musical piece. If you want, you can have this input a(0) as well. One technical now what you see in the later video is that, when you’re actually generating sequences, often you take these first synthesized output and feed it to the next layer as well. So the network architecture actually ends up looking like that. So, we’ve talked about many-to- many, many-to-one, one-to-many, as well as one-to-one. It turns out there’s one more interesting example of many-to-many which is worth describing. Which is when the input and the output length are different. So, in the many-to-many example, you saw just now, the input length and the output length have to be exactly the same. For an application like machine translation, the number of words in the input sentence, say a French sentence, and the number of words in the output sentence, say the translation into English, those sentences could be different lengths. So here’s an alternative new network architecture where you might have a neural network, first, reading the sentence. So first, reading the input, say French sentence that you want to translate to English. And having done that, you then, have the neural network output the translation. As all those y hat of (Ty). And so, with this architecture, Tx and Ty can be different lengths. And again, you could draw on the a(0) that you want. And so, this that neural network architecture has two distinct parts. There’s the encoder which takes as input, say a French sentence, and then, there’s is a decoder, which having read in the sentence, outputs the translation into a different language. So this would be an example of a many-to-many architecture. So by the end of this week, you have a good understanding of all the components needed to build these types of architectures. And then, technically, there’s one other architecture which we’ll talk about only in week four, which is attention based architectures. Which maybe isn’t cleanly captured by one of the diagrams we’ve drawn so far.

So, to summarize the wide range of RNN architectures, there is one-to-one, although if it’s one-to-one, we could just give it this, and this is just a standard generic neural network. Well, you don’t need an RNN for this. But there is one-to-many. So, this was a music generation or sequenced generation as example. And then, there’s many-to-one, that would be an example of sentiment classification. Where you might want to read as input all the text with a movie review. And then, try to figure out that they liked the movie or not. There is many-to-many, so the name entity recognition, the example we’ve been using, was this where $T_x$ is equal to $T_y$. And then, finally, there’s this other version of many-to-many, where for applications like machine translation, $T_x$ and $T_y$ no longer have to be the same. So, now you know most of the building blocks, the building are pretty much all of these neural networks except that there are some subtleties with sequence generation, which is what we’ll discuss in the next video.

So, I hope you saw from this video that using the basic building blocks of an RNN, there’s already a wide range of models that you might be able put together. But as I mentioned, there are some subtleties to sequence generation, which you’ll get to implement yourself as well in this week’s primary exercise where you implement a language model and hopefully, generate some fun sequences or some fun pieces of text. So, what I want to do in the next video, is go deeper into sequence generation. Let’s see the details in the next video.

06_language-model-and-sequence-generation

Language modeling is one of the most basic and important tasks in natural language processing. There’s also one that RNNs do very well. In this video, you learn about how to build a language model using an RNN, and this will lead up to a fun programming exercise at the end of this week. Where you build a language model and use it to generate Shakespeare-like texting, other types of text. Let’s get started.

So what is a language model? Let’s say you’re building this speech recognition system and you hear the sentence, the apple and pear salad was delicious. So what did you just hear me say? Did I say the apple and pair salad, or did I say the apple and pear salad? You probably think the second sentence is much more likely, and in fact, that’s what a good speech recognition system would help with even though these two sentences sound exactly the same. And the way a speech recognition system picks the second sentence is by using a language model which tells it what the probability is of either of these two sentences. For example, a language model might say that the chance for the first sentence is 3.2 by 10 to the -13. And the chance of the second sentence is say 5.7 by 10 to the -10. And so, with these probabilities, the second sentence is much more likely by over a factor of 10 to the 3 compared to the first sentence. And that’s why speech recognition system will pick the second choice. So what a language model does is given any sentence is job is to tell you what is the probability of a sentence, of that particular sentence. And by probability of sentence I mean, if you want to pick up a random newspaper, open a random email or pick a random webpage or listen to the next thing someone says, the friend of you says. What is the chance that the next sentence you use somewhere out there in the world will be a particular sentence like the apple and pear salad? [COUGH] And this is a fundamental component for both speech recognition systems as you’ve just seen, as well as for machine translation systems where translation systems wants output only sentences that are likely. And so the basic job of a language model is to input a sentence, which I’m going to write as a sequence $y^{<1>}$, $y^{<2>}$ up to $y^{}$. And for language model will be useful to represent a sentences as outputs y rather than inputs x. But what the language model does is it estimates the probability of that particular sequence of words.

So how do you build a language model? To build such a model using an RNN you would first need a training set comprising a large corpus of english text. Or text from whatever language you want to build a language model of. And the word corpus is an NLP terminology that just means a large body or a very large set of english text of english sentences. So let’s say you get a sentence in your training set as follows. Cats average 15 hours of sleep a day. The first thing you would do is tokenize this sentence. And that means you would form a vocabulary as we saw in an earlier video. And then map each of these words to, say, one hot vectors, alter indices in your vocabulary. One thing you might also want to do is model when sentences end. So another common thing to do is to add an extra token called a EOS. That stands for End Of Sentence that can help you figure out when a sentence ends. We’ll talk more about this later, but the EOS token can be appended to the end of every sentence in your training sets if you want your models explicitly capture when sentences end. We won’t use the end of sentence token for the programming exercise at the end of this week where for some applications, you might want to use this. And we’ll see later where this comes in handy. So in this example, we have y1, y2, y3, 4, 5, 6, 7, 8, 9. Nine inputs in this example if you append the end of sentence token to the end. And doing the tokenization step, you can decide whether or not the period should be a token as well. In this example, I’m just ignoring punctuation. So I’m just using day as another token. And omitting the period, if you want to treat the period or other punctuation as explicit token, then you can add the period to you vocabulary as well. Now, one other detail would be what if some of the words in your training set, are not in your vocabulary. So if your vocabulary uses 10,000 words, maybe the 10,000 most common words in English, then the term Mau as in the Egyptian Mau is a breed of cat, that might not be in one of your top 10,000 tokens. So in that case you could take the word Mau and replace it with a unique token called UNK or stands for unknown words and would just model, the chance of the unknown word instead of the specific word now. Having carried out the tokenization step which basically means taking the input sentence and mapping out to the individual tokens or the individual words in your vocabulary. Next let’s build an RNN to model the chance of these different sequences. And one of the things we’ll see on the next slide is that you end up setting the inputs x = y or you see that in a little bit.

So let’s go on to built the RNN model and I’m going to continue to use this sentence as the running example. This will be an RNN architecture. At time 0 you’re going to end up computing some activation $a^{<1>}$ as a function of some inputs $x^{<1>}$, and $x^{<1>}$ will just be set it to the set of all zeroes, to 0 vector. And the previous $a^{<0>}$, by convention, also set that to vector zeroes. But what $a^{<1>}$ does is it will make a soft max prediction to try to figure out what is the probability of the first words y. And so that’s going to be $y^{<1>}$. So what this step does is really, it has a soft max it’s trying to predict. What is the probability of any word in the dictionary? That the first one is a, what’s the chance that the first word is Aaron? And then what’s the chance that the first word is cats? All the way to what’s the chance the first word is Zulu? Or what’s the first chance that the first word is an unknown word? Or what’s the first chance that the first word is the in the sentence they’ll have, shouldn’t have to read? Right, so $\hat{y}^{<1>}$ is output to a softmax, it just predicts what’s the chance of the first word being, whatever it ends up being. And in our example, it wind up being the word cats, so this would be a 10,000 way soft max output, if you have a 10,000-word vocabulary. Or 10,002, I guess you could call unknown word and the sentence is two additional tokens. Then, the RNN steps forward to the next step and has some activation, $a^{<1>}$ to the next step. And at this step, his job is try figure out, what is the second word? But now we will also give it the correct first word. So we’ll tell it that, gee, in reality, the first word was actually Cats so that’s $y^{<1>}$. So tell it cats, and this is why $y^{<1>} = x^{<2>}$, and so at the second step the output is again predicted by a soft max. The RNN’s jobs to predict was the chance of a being whatever the word it is. Is it a or Aaron, or Cats or Zulu or unknown whether EOS or whatever given what had come previously. So in this case, I guess the right answer was average since the sentence starts with cats average. And then you go on to the next step of the RNN. Where you now compute $a^{<3>}$. But to predict what is the third word, which is 15, we can now give it the first two words. So we’re going to tell it cats average are the first two words. So this next input here, $x^{<3>} = y^{<2>}$, so the word average is input, and this job is to figure out what is the next word in the sequence. So in other words trying to figure out what is the probability of anywhere than dictionary given that what just came before was cats. Average, right? And in this case, the right answer is 15 and so on. Until at the end, you end up at, I guess, time step 9, you end up feeding it $x^{<9>}$, which is equal to $y^{<8>}$, which is the word, day. And then this has $a^{<9>}$, and its jpob iws to output $\hat{y}^{<9>}$, and this happens to be the EOS token. So what’s the chance of whatever this given, everything that comes before, and hopefully it will predict that there’s a high chance of it, EOS and the sentence token. So each step in the RNN will look at some set of preceding words such as, given the first three words, what is the distribution over the next word? And so this RNN learns to predict one word at a time going from left to right. Next to train us to a network, we’re going to define the cos function. So, at a certain time, t, if the true word was yt and the new networks soft max predicted some $\hat{y}^{}$, then this is the soft max loss function that you should already be familiar with. And then the overall loss is just the sum overall time steps of the loss associated with the individual predictions. And if you train this RNN on the last training set, what you’ll be able to do is given any initial set of words, such as cats average 15 hours of, it can predict what is the chance of the next word. And given a new sentence say, $y^{<1>}$, $y^{<1>}$, $y^{<1>}$with just a three words, for simplicity, the way you can figure out what is the chance of this entire sentence would be. Well, the first soft max tells you what’s the chance of $y^{<1>}$. That would be this first output. And then the second one can tell you what’s the chance of p of $y^{<1>}$ given $y^{<1>}$. And then the third one tells you what’s the chance of $y^{<1>}$ given $y^{<1>}$ and $y^{<1>}$. And so by multiplying out these three probabilities. And you’ll see much more details of this in the previous exercise. By multiply these three, you end up with the probability of the three sentence, of the three-word sentence.

So that’s the basic structure of how you can train a language model using an RNN. If some of these ideas still seem a little bit abstract, don’t worry about it, you get to practice all of these ideas in their program exercise. But next it turns out one of the most fun things you could do with a language model is to sample sequences from the model. Let’s take a look at that in the next video.

07_sampling-novel-sequences

After you train a sequence model, one of the ways you can informally get a sense of what is learned is to have a sample novel sequences. Let’s take a look at how you could do that.

So remember that a sequence model, models the chance of any particular sequence of words as follows, and so what we like to do is sample from this distribution to generate novel sequences of words. So the network was trained using this structure shown at the top. But to sample, you do something slightly different, so what you want to do is first sample what is the first word you want your model to generate. And so for that you input the usual $x^{<1>}$ equals 0, $a^{<0>}$ equals 0. And now your first time stamp will have some max probability over possible outputs. So what you do is you then randomly sample according to this softmax distribution. So what the soft max distribution gives you is it tells you what is the chance that it refers to this a, what is the chance that it refers to this Aaron? What’s the chance it refers to Zulu, what is the chance that the first word is the Unknown word token. Maybe it was a chance it was a end of sentence token.

And then you take this vector and use, for example, the numpy command np.random.choice to sample according to distribution defined by this vector probabilities, and that lets you sample the first words. Next you then go on to the second time step, and now remember that the second time step is expecting this $\hat{y}^{<1>}$ as input. But what you do is you then take the $\hat{y}^{<1>}$ that you just sampled and pass that in here as the input to the next timestep. So whatever works, you just chose the first time step passes this input in the second position, and then this softmax will make a prediction for what is $\hat{y}^{<2>}$. Example, let’s say that after you sample the first word, the first word happened to be “The”, which is very common choice of first word. Then you pass in “The” as $x^{<2>}$, which is now equal to $\hat{y}^{<1>}$. And now you’re trying to figure out what is the chance of what the second word is given that the first word is d. And this is going to be $\hat{y}^{<2>}$. Then you again use this type of sampling function to sample $\hat{y}^{<2>}$. And then at the next time stamp, you take whatever choice you had represented say as a one hard encoding. And pass that to next timestep and then you sample the third word to that whatever you chose, and you keep going until you get to the last time step. And so how do you know when the sequence ends? Well, one thing you could do is if the end of sentence token is part of your vocabulary, you could keep sampling until you generate an EOS token. And that tells you you’ve hit the end of a sentence and you can stop. Or alternatively, if you do not include this in your vocabulary then you can also just decide to sample 20 words or 100 words or something, and then keep going until you’ve reached that number of time steps. And this particular procedure will sometimes generate an unknown word token. If you want to make sure that your algorithm never generates this token, one thing you could do is just reject any sample that came out as unknown word token and just keep resampling from the rest of the vocabulary until you get a word that’s not an unknown word. Or you can just leave it in the output as well if you don’t mind having an unknown word output. So this is how you would generate a randomly chosen sentence from your RNN language model.

Now, so far we’ve been building a words level RNN, by which I mean the vocabulary are words from English. Depending on your application, one thing you can do is also build a character level RNN. So in this case your vocabulary will just be the alphabets. Up to z, and as well as maybe space, punctuation if you wish, the digits 0 to 9. And if you want to distinguish the uppercase and lowercase, you can include the uppercase alphabets as well, and one thing you can do as you just look at your training set and look at the characters that appears there and use that to define the vocabulary. And if you build a character level language model rather than a word level language model, then your sequence $y^{<1>}, y^{<2>}, y^{<3>}$, would be the individual characters in your training data, rather than the individual words in your training data. So for our previous example, the sentence cats average 15 hours of sleep a day. In this example, c would be $y^{<1>}$, a would be $y^{<2>}$, t will be $y^{<3>}$, the space will be $y^{<4>}$ and so on. Using a character level language model has some pros and cons. One is that you don’t ever have to worry about unknown word tokens. In particular, a character level language model is able to assign a sequence like mau, a non-zero probability. Whereas if mau was not in your vocabulary for the word level language model, you just have to assign it the unknown word token. But the main disadvantage of the character level language model is that you end up with much longer sequences. So many english sentences will have 10 to 20 words but may have many, many dozens of characters. And so character language models are not as good as word level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence. And character level models are also just more computationally expensive to train. So the trend I’ve been seeing in natural language processing is that for the most part, word level language model are still used, but as computers gets faster there are more and more applications where people are, at least in some special cases, starting to look at more character level models. But they tend to be much hardware, much more computationally expensive to train, so they are not in widespread use today. Except for maybe specialized applications where you might need to deal with unknown words or other vocabulary words a lot. Or they are also used in more specialized applications where you have a more specialized vocabulary. So under these methods, what you can now do is build an RNN to look at the purpose of English text, build a word level, build a character language model, sample from the language model that you’ve trained.

So here are some examples of text thatwere examples from a language model, actually from a culture level language model. And you get to implement something like this yourself in the programming exercise. If the model was trained on news articles, then it generates texts like that shown on the left. And this looks vaguely like news text, not quite grammatical, but maybe sounds a little bit like things that could be appearing news, concussion epidemic to be examined. And it was trained on Shakespearean text and then it generates stuff that sounds like Shakespeare could have written it. The mortal moon hath her eclipse in love. And subject of this thou art another this fold. When besser be my love to me see sabl’s. For whose are ruse of mine eyes heaves.

So that’s it for the basic RNN, and how you can build a language model using it, as well as sample from the language model that you’ve trained. In the next few videos, I want to discuss further some of the challenges of training RNNs, as well as how to adjust some of these challenges, specifically vanishing gradients by building even more powerful models of the RNN. So in the next video let’s talk about the problem of vanishing the gradient and we will go on to talk about the GRU, Gate Recurring Unit as well as the LSTM models.

08_vanishing-gradients-with-rnns

You’ve learned about how RNNs work and how they can be applied to problems like name entity recognition, as well as to language modeling, and you saw how backpropagation can be used to train in RNN. It turns out that one of the problems with a basic RNN algorithm is that it runs into vanishing gradient problems. Let’s discuss that, and then in the next few videos, we’ll talk about some solutions that will help to address this problem.

So, you’ve seen pictures of RNNS that look like this. And let’s take a language modeling example. Let’s say you see this sentence, “The cat which already ate and maybe already ate a bunch of food that was delicious dot, dot, dot, dot, was full.” And so, to be consistent, just because cat is singular, it should be the cat was, were then was, “The cats which already ate a bunch of food was delicious, and apples, and pears, and so on, were full.” So to be consistent, it should be cat was or cats were. And this is one example of when language can have very long-term dependencies, where it worked at this much earlier can affect what needs to come much later in the sentence. But it turns out the basics RNN we’ve seen so far it’s not very good at capturing very long-term dependencies. To explain why, you might remember from our early discussions of training very deep neural networks, that we talked about the vanishing gradients problem. So this is a very, very deep neural network say, 100 layers or even much deeper than you would carry out forward prop, from left to right and then back prop. And we said that, if this is a very deep neural network, then the gradient from just output y, would have a very hard time propagating back to affect the weights of these earlier layers, to affect the computations in the earlier layers. And for an RNN with a similar problem, you have forward prop came from left to right, and then back prop, going from right to left. And it can be quite difficult, because of the same vanishing gradients problem, for the outputs of the errors associated with the later time steps to affect the computations that are earlier. And so in practice, what this means is, it might be difficult to get a neural network to realize that it needs to memorize the just see a singular noun or a plural noun, so that later on in the sequence that can generate either was or were, depending on whether it was singular or plural. And notice that in English, this stuff in the middle could be arbitrarily long, right? So you might need to memorize the singular/plural for a very long time before you get to use that bit of information. So because of this problem, the basic RNN model has many local influences, meaning that the output $y^{<3>}$ is mainly influenced by values close to $y^{<3>}$. And a value here is mainly influenced by inputs that are somewhere close. And it’s difficult for the output here to be strongly influenced by an input that was very early in the sequence. And this is because whatever the output is, whether this got it right, this got it wrong, it’s just very difficult for the area to backpropagate all the way to the beginning of the sequence, and therefore to modify how the neural network is doing computations earlier in the sequence. So this is a weakness of the basic RNN algorithm. One, which was not addressed in the next few videos. But if we don’t address it, then RNNs tend not to be very good at capturing long-range dependencies. And even though this discussion has focused on vanishing gradients, you will remember when we talked about very deep neural networks, that we also talked about exploding gradients. We’re doing back prop, the gradients should not just decrease exponentially, they may also increase exponentially with the number of layers you go through. It turns out that vanishing gradients tends to be the bigger problem with training RNNs, although when exploding gradients happens, it can be catastrophic because the exponentially large gradients can cause your parameters to become so large that your neural network parameters get really messed up. So it turns out that exploding gradients are easier to spot because the parameters just blow up and you might often see NaNs, or not a numbers, meaning results of a numerical overflow in your neural network computation. And if you do see exploding gradients, one solution to that is apply gradient clipping. And what that really means, all that means is look at your gradient vectors, and if it is bigger than some threshold, re-scale some of your gradient vector so that is not too big. So there are clips according to some maximum value. So if you see exploding gradients, if your derivatives do explode or you see NaNs, just apply gradient clipping, and that’s a relatively robust solution that will take care of exploding gradients. But vanishing gradients is much harder to solve and it will be the subject of the next few videos.

So to summarize, in an earlier course, you saw how the training of very deep neural network, you can run into a vanishing gradient or exploding gradient problems with the derivative, either decreases exponentially or grows exponentially as a function of the number of layers. And in RNN, say in RNN processing data over a thousand times sets, over 10,000 times sets, that’s basically a 1,000 layer or they go 10,000 layer neural network, and so, it too runs into these types of problems. Exploding gradients, you could sort of address by just using gradient clipping, but vanishing gradients will take more work to address. So what we do in the next video is talk about GRU, the greater recurrent units, which is a very effective solution for addressing the vanishing gradient problem and will allow your neural network to capture much longer range dependencies. So, lets go on to the next video.

09_gated-recurrent-unit-gru

You’ve seen how a basic RNN works. In this video, you learn about the Gated Recurrent Unit which is a modification to the RNN hidden layer that makes it much better capturing long range connections and helps a lot with the vanishing gradient problems. Let’s take a look.

You’ve already seen the formula for computing the activations at time t of RNN. It’s the activation function applied to the parameter Wa times the activations in the previous time set, the current input and then plus ba. So I’m going to draw this as a picture. So the RNN unit, I’m going to draw as a picture, drawn as a box which inputs a of t-1, the activation for the last time-step. And also inputs x and these two go together. And after some weights and after this type of linear calculation, if g is a tanh activation function, then after the tanh, it computes the output activation a. And the output activation a(t) might also be passed to say a softener unit or something that could then be used to output y. So this is maybe a visualization of the RNN unit of the hidden layer of the RNN in terms of a picture. And I want to show you this picture because we’re going to use a similar picture to explain the GRU or the Gated Recurrent Unit.

Lots of the idea of GRU were due to these two papers respectively by Yu Young Chang, Kagawa, Gaza Hera, Chang Hung Chu and Jose Banjo. And I’m sometimes going to refer to this sentence which we’d seen in the last video to motivate that. Given a sentence like this, you might need to remember the cat was singular, to make sure you understand why that was rather than were. So the cat was for or the cats were for. So as we read in this sentence from left to right, the GRU unit is going to have a new variable called c, which stands for cell, for memory cell. And what the memory cell do is it will provide a bit of memory to remember, for example, whether cat was singular or plural, so that when it gets much further into the sentence it can still work under consideration whether the subject of the sentence was singular or plural. And so at time t the memory cell will have some value c of t. And what we see is that the GRU unit will actually output an activation value a of t that’s equal to c of t. And for now I wanted to use different symbol c and a to denote the memory cell value and the output activation value, even though they are the same. I’m using this notation because when we talk about LSTMs, a little bit later, these will be two different values. But for now, for the GRU, c of t is equal to the output activation a of t. So these are the equations that govern the computations of a GRU unit. And every time-step, we’re going to consider overwriting the memory cell with a value c tilde of t. So this is going to be a candidate for replacing c of t. And we’re going to compute this using an activation function tanh of Wc. And so that’s the parameter to make sure it’s Wc and we’ll plus this parameter matrix, the previous value of the memory cell, the activation value as well as the current input value $x^{}$, and then plus the bias. So c tilde of t is going to be a candidate for replacing $c^{}$. And then the key, really the important idea of the GRU it will be that we have a gate. So the gate, I’m going to call gamma u. This is the capital Greek alphabet gamma subscript u, and u stands for update gate, and this will be a value between zero and one. And to develop your intuition about how GRUs work, think of gamma u, this gate value, as being always zero or one. Although in practice, your compute it with a sigmoid function applied to this. So remember that the sigmoid function looks like this. And so it’s value is always between zero and one. And for most of the possible ranges of the input, the sigmoid function is either very, very close to zero or very, very close to one. So for intuition, think of gamma as being either zero or one most of the time. And this alphabet u stands for- I chose the alphabet gamma for this because if you look at a gate fence, looks a bit like this I guess, then there are a lot of gammas in this fence. So that’s why gamma u, we’re going to use to denote the gate. Also Greek alphabet G, right. G for gate. So G for gamma and G for gate. And then next, the key part of the GRU is this equation which is that we have come up with a candidate where we’re thinking of updating c using c tilde, and then the gate will decide whether or not we actually update it. And so the way to think about it is maybe this memory cell c is going to be set to either zero or one depending on whether the word you are considering, really the subject of the sentence is singular or plural. So because it’s singular, let’s say that we set this to one. And if it was plural, maybe we would set this to zero, and then the GRU unit would memorize the value of the $c^{}$ all the way until here, where this is still equal to one and so that tells it, oh, it’s singular so use the choice was. And the job of the gate, of gamma u, is to decide when do you update these values. In particular, when you see the phrase, the cat, you know they you’re talking about a new concept the especially subject of the sentence cat. So that would be a good time to update this bit and then maybe when you’re done using it, the cat blah blah blah was full, then you know, okay, I don’t need to memorize anymore, I can just forget that. So the specific equation we’ll use for the GRU is the following. Which is that the actual value of $c^{}$ will be equal to this gate times the candidate value plus one minus the gate times the old value, $c^{}$. So you notice that if the gate, if this update value, this equal to one, then it’s saying set the new value of $c^{}$ equal to this candidate value. So that’s like over here, set gate equal to one so go ahead and update that bit. And then for all of these values in the middle, you should have the gate equals zero. So this is saying don’t update it, don’t update it, don’t update it, just hang onto the old value. Because if gamma u is equal to zero, then this would be zero, and this would be one. And so it’s just setting $c^{}$ equal to the old value, even as you scan the sentence from left to right. So when the gate is equal to zero, we’re saying don’t update it, don’t update it, just hang on to the value and don’t forget what this value was. And so that way even when you get all the way down here, hopefully you’ve just been setting $c^{}$ equals $c^{}$ all along. And it still memorizes, the cat was singular. So let me also draw a picture to denote the GRU unit. And by the way, when you look in online blog posts and textbooks and tutorials these types of pictures are quite popular for explaining GRUs as well as we’ll see later, LSTM units. I personally find the equations easier to understand in a pictures. So if the picture doesn’t make sense. Don’t worry about it, but I’ll just draw in case helps some of you. So a GRU unit inputs $c^{}$, for the previous time-step and just happens to be equal to 80 minus one. So take that as input and then it also takes as input $x^{}$, then these two things get combined together. And with some appropriate weighting and some tanh, this gives you c tilde t which is a candidate for placing $c^{}$, and then with a different set of parameters and through a sigmoid activation function, this gives you gamma u, which is the update gate. And then finally, all of these things combine together through another operation. And I won’t write out the formula, but this box here which wish I shaded in purple represents this equation which we had down there. So that’s what this purple operation represents. And it takes as input the gate value, the candidate new value, or there is this gate value again and the old value for $c^{}$, right. So it takes as input this, this and this and together they generate the new value for the memory cell. And so that’s $c^{}$ equals a. And if you wish you could also use this process to soft max or something to make some prediction for $y^{}$. So that is the GRU unit or at least a slightly simplified version of it. And what is remarkably good at is through the gates deciding that when you’re scanning the sentence from left to right say, that’s a good time to update one particular memory cell and then to not change, not change it until you get to the point where you really need it to use this memory cell that is set even earlier in the sentence. And because the sigmoid value, now, because the gate is quite easy to set to zero right. So long as this quantity is a large negative value, then up to numerical around off the uptake gate will be essentially zero. Very, very, very close to zero. So when that’s the case, then this updated equation and subsetting $c^{}$ equals $c^{}$. And so this is very good at maintaining the value for the cell. And because gamma can be so close to zero, can be 0.000001 or even smaller than that, it doesn’t suffer from much of a vanishing gradient problem. Because when you say gamma so close to zero this becomes essentially $c^{}$ equals $c^{}$ and the value of $c^{}$ is maintained pretty much exactly even across many many many many time-steps. So this can help significantly with the vanishing gradient problem and therefore allow a neural network to go on even very long range dependencies, such as a cat and was related even if they’re separated by a lot of words in the middle.

Now I just want to talk over some more details of how you implement this. In the equations I’ve written, $c^{}$ can be a vector. So if you have 100 dimensional or hidden activation value then $c^{}$ can be a 100 dimensional say. And so $\tilde{c}^{}$ would also be the same dimension, and gamma would also be the same dimension as the other things on drawing boxes. And in that case, these asterisks are actually element wise multiplication. So here if gamma u, if the gate is 100 dimensional vector, what it is really a 100 dimensional vector of bits, the value is mostly zero and one. That tells you of this 100 dimensional memory cell which are the bits you want to update. And, of course, in practice gamma won’t be exactly zero or one. Sometimes it takes values in the middle as well but it is convenient for intuition to think of it as mostly taking on values that are exactly zero, pretty much exactly zero or pretty much exactly one. And what these element wise multiplications do is it just element wise tells the GRU unit which other bits in your- It just tells your GRU which are the dimensions of your memory cell vector to update at every time-step. So you can choose to keep some bits constant while updating other bits. So, for example, maybe you use one bit to remember the singular or plural cat and maybe use some other bits to realize that you’re talking about food. And so because you’re talk about eating and talk about food, then you’d expect to talk about whether the cat is four letter, right. You can use different bits and change only a subset of the bits every point in time. You now understand the most important ideas of the GRU. What I’m presenting in this slide is actually a slightly simplified GRU unit.

Let me describe the full GRU unit. So to do that, let me copy the three main equations. This one, this one and this one to the next slide. So here they are. And for the full GRU unit, I’m sure to make one change to this which is, for the first equation which was calculating the candidate new value for the memory cell, I’m going just to add one term. Let me pushed that a little bit to the right, and I’m going to add one more gate. So this is another gate $\Gamma_r$. You can think of r as standing for relevance. So this gate $\Gamma_r$ tells you how relevant is $c^{}$ to computing the next candidate for $c^{}$. And this gate $\Gamma_r$ is computed pretty much as you’d expect with a new parameter matrix $W_r$, and then the same things as input $x^{}$ plus $b_r$. So as you can imagine there are multiple ways to design these types of neural networks. And why do we have $\Gamma_r$ ? Why not use a simpler version from the previous slides? So it turns out that over many years researchers have experimented with many, many different possible versions of how to design these units, to try to have longer range connections, to try to have more the longer range effects and also address vanishing gradient problems. And the GRU is one of the most commonly used versions that researchers have converged to and found as robust and useful for many different problems. If you wish you could try to invent new versions of these units if you want, but the GRU is a standard one, that’s just common used. Although you can imagine that researchers have tried other versions that are similar but not exactly the same as what I’m writing down here as well. And the other common version is called an LSTM which stands for Long Short Term Memory which we’ll talk about in the next video. But GRUs and LSTMs are two specific instantiations of this set of ideas that are most commonly used. Just one note on notation. I tried to define a consistent notation to make these ideas easier to understand. If you look at the academic literature, you sometimes see people- If you look at the academic literature sometimes you see people using alternative notation to be $\tilde{x}$, u, r and h to refer to these quantities as well. But I try to use a more consistent notation between GRUs and LSTMs as well as using a more consistent notation gamma to refer to the gates, so hopefully make these ideas easier to understand.

So that’s it for the GRU, for the Gate Recurrent Unit. This is one of the ideas in RNN that has enabled them to become much better at capturing very long range dependencies has made RNN much more effective. Next, as I briefly mentioned, the other most commonly used variation of this class of idea is something called the LSTM unit, Long Short Term Memory unit. Let’s take a look at that in the next video.

10_long-short-term-memory-lstm

In the last video, you learned about the GRU, the gated recurrent units, and how that can allow you to learn very long range connections in a sequence. The other type of unit that allows you to do this very well is the LSTM or the long short term memory units, and this is even more powerful than the GRU. Let’s take a look.

Here are the equations from the previous video for the GRU. And for the GRU, we had $a^{}$ equals $c^{}$, and two gates, the optic gate and the relevance gate, $\tilde{c}^{}$, which is a candidate for replacing the memory cell, and then we use the update gate, $\Gamma_u$, to decide whether or not to update $c^{}$ using $\tilde{c}^{}$. The LSTM is an even slightly more powerful and more general version of the GRU, and is due to Sepp Hochreiter and Jurgen Schmidhuber. And this was a really seminal paper, a huge impact on sequence modelling. I think this paper is one of the more difficult to read. It goes quite along into theory of vanishing gradients. And so, I think more people have learned about the details of LSTM through maybe other places than from this particular paper even though I think this paper has had a wonderful impact on the Deep Learning community. But these are the equations that govern the LSTM. So, the book continued to the memory cell, c, and the candidate value for updating it, $\tilde{c}^{}$, will be this, and so on. Notice that for the LSTM, we will no longer have the case that $a^{}$ is equal to $c^{}$. So, this is what we use. And so, this is just like the equation on the left except that with now, more specially use $a^{}$ there or $a^{}$ instead of $c^{}$. And we’re not using this gamma or this relevance gate. Although you could have a variation of the LSTM where you put that back in, but with the more common version of the LSTM, doesn’t bother with that. And then we will have an update gate, same as before. So, W updates and we’re going to use $a^{}$ here, $x^{}$ plus $b_u$. And one new property of the LSTM is, instead of having one update gate control, both of these terms, we’re going to have two separate terms. So instead of $\Gamma_u$ and one minus $\Gamma_u$, we’re going have $\Gamma_u$ here. And forget gate, which we’re going to call $\Gamma_f$. So, this gate, $\Gamma_f$, is going to be sigmoid of pretty much what you’d expect, $x^{}$ plus $b_f$. And then, we’re going to have a new output gate which is sigma of $W_o$. And then again, pretty much what you’d expect, plus $b_o$. And then, the update value to the memory so will be $c^{}$ equals $\Gamma_u$. And this asterisk denotes element-wise multiplication. This is a vector-vector element-wise multiplication, plus, and instead of one minus $\Gamma_u$, we’re going to have a separate forget gate, $\Gamma_f$, times c of t minus one. So this gives the memory cell the option of keeping the old value $c^{}$ minus one and then just adding to it, this new value, $\tilde{c}^{}$. So, use a separate update and forget gates. So, this stands for update, forget, and output gate. And then finally, instead of $a^{}$ equals $c^{}$ $a^{}$ is $a^{}$ equal to the output gate element-wise multiplied by $c^{}$. So, these are the equations that govern the LSTM and you can tell it has three gates instead of two. So, it’s a bit more complicated and it places the gates into slightly different places.

So, here again are the equations governing the behavior of the LSTM. Once again, it’s traditional to explain these things using pictures. So let me draw one here. And if these pictures are too complicated, don’t worry about it. I personally find the equations easier to understand than the picture. But I’ll just show the picture here for the intuitions it conveys. The bigger picture here was very much inspired by a blog post due to Chris Ola, title, Understanding LSTM Network, and the diagram drawing here is quite similar to one that he drew in his blog post. But the key thing is to take away from this picture are maybe that you use $a^{}$ and $x^{}$ to compute all the gate values. In this picture, you have $a^{}$, $x^{}$ coming together to compute the forget gate, to compute the update gates, and to compute output gate. And they also go through a tanh to compute $\tilde{c}^{}$. And then these values are combined in these complicated ways with element-wise multiplies and so on, to get $c^{}$ from the previous $c^{}$. Now, one element of this is interesting as you have a bunch of these in parallel. So, that’s one of them and you connect them. You then connect these temporally. So it does the input $x^{<1>}$ then $x^{<2>}$, $x^{<3>}$. So, you can take these units and just hold them up as follows, where the output a at the previous timestep is the input a at the next timestep, the c. I’ve simplified to diagrams a little bit in the bottom. And one cool thing about this you’ll notice is that there’s this line at the top that shows how, so long as you set the forget and the update gate appropriately, it is relatively easy for the LSTM to have some value $c^{<0>}$ and have that be passed all the way to the right to have your, maybe, $c^{<3>}$ equals $c^{<0>}$. And this is why the LSTM, as well as the GRU, is very good at memorizing certain values even for a long time, for certain real values stored in the memory cell even for many, many timesteps. So, that’s it for the LSTM. As you can imagine, there are also a few variations on this that people use. Perhaps, the most common one is that instead of just having the gate values be dependent only on $a^{}$, $x^{}$, sometimes, people also sneak in there the values $c^{}$ as well. This is called a peephole connection. Not a great name maybe but you’ll see, peephole connection. What that means is that the gate values may depend not just on $a^{}$ and on $x^{}$, but also on the previous memory cell value, and the peephole connection can go into all three of these gates’ computations. So that’s one common variation you see of LSTMs. One technical detail is that these are, say, 100-dimensional vectors. So if you have a 100-dimensional hidden memory cell unit, and so is this. And the, say, fifth element of $c^{}$ affects only the fifth element of the corresponding gates, so that relationship is one-to-one, where not every element of the 100-dimensional $c^{}$ can affect all elements of the case. But instead, the first element of $c^{}$ affects the first element of the case, second element affects the second element, and so on. But if you ever read the paper and see someone talk about the peephole connection, that’s when they mean that $c^{}$ is used to affect the gate value as well.

So, that’s it for the LSTM. When should you use a GRU? And when should you use an LSTM? There isn’t widespread consensus in this. And even though I presented GRUs first, in the history of deep learning, LSTMs actually came much earlier, and then GRUs were relatively recent invention that were maybe derived as Pavia’s simplification of the more complicated LSTM model. Researchers have tried both of these models on many different problems, and on different problems, different algorithms will win out. So, there isn’t a universally-superior algorithm which is why I want to show you both of them. But I feel like when I am using these, the advantage of the GRU is that it’s a simpler model and so it is actually easier to build a much bigger network, it only has two gates, so computationally, it runs a bit faster. So, it scales the building somewhat bigger models but the LSTM is more powerful and more effective since it has three gates instead of two. If you want to pick one to use, I think LSTM has been the historically more proven choice. So, if you had to pick one, I think most people today will still use the LSTM as the default first thing to try. Although, I think in the last few years, GRUs had been gaining a lot of momentum and I feel like more and more teams are also using GRUs because they’re a bit simpler but often work just as well. It might be easier to scale them to even bigger problems. So, that’s it for LSTMs. Well, either GRUs or LSTMs, you’ll be able to build neural network that can capture a much longer range dependancy.

11_bidirectional-rnn

By now, you’ve seen most of the cheap building blocks of RNNs. But, there are just two more ideas that let you build much more powerful models. One is bidirectional RNNs, which lets you at a point in time to take information from both earlier and later in the sequence, so we’ll talk about that in this video. And second, is deep RNNs, which you’ll see in the next video. So let’s start with Bidirectional RNNs.

So, to motivate bidirectional RNNs, let’s look at this network which you’ve seen a few times before in the context of named entity recognition. And one of the problems of this network is that, to figure out whether the third word Teddy is a part of the person’s name, it’s not enough to just look at the first part of the sentence. So to tell, if Y three should be zero or one, you need more information than just the first three words because the first three words doesn’t tell you if they’ll talking about Teddy bears or talk about the former US president, Teddy Roosevelt. So this is a unidirectional or forward directional only RNN. And, this comment I just made is true, whether these cells are standard RNN blocks or whether they’re GRU units or whether they’re LSTM blocks. But all of these blocks are in a forward only direction. So what a bidirectional RNN does or BRNN, is fix this issue. So, a bidirectional RNN works as follows.

I’m going to use a simplified four inputs or maybe a four word sentence. So we have four inputs. X one through X four. So this networks heading there will have a forward recurrent components. So I’m going to call this, A one, A two, A three and A four, and I’m going to draw a right arrow over that to denote this is the forward recurrent component, and so they’ll be connected as follows. And so, each of these four recurrent units inputs the current X, and then feeds in to help predict Y-hat one, Y-hat two, Y-hat three, and Y-hat four. So, so far I haven’t done anything. Basically, we’ve drawn the RNN from the previous slide, but with the arrows placed in slightly funny positions. But I drew the arrows in this slightly funny positions because what we’re going to do is add a backward recurrent layer. So we’d have A one, left arrow to denote this is a backward connection, and then A two, backwards, A three, backwards and A four, backwards, so the left arrow denotes that it is a backward connection. And so, we’re then going to connect to network up as follows. And this A backward connections will be connected to each other going backward in time. So, notice that this network defines a Acyclic graph. And so, given an input sequence, X one through X four, the fourth sequence will first compute A forward one, then use that to compute A forward two, then A forward three, then A forward four. Whereas, the backward sequence would start by computing A backward four, and then go back and compute A backward three, and then as you are computing network activation, this is not backward this is forward prop. But the forward prop has part of the computation going from left to right and part of computation going from right to left in this diagram. But having computed A backward three, you can then use those activations to compute A backward two, and then A backward one, and then finally having computed all you had in the activations, you can then make your predictions. And so, for example, to make the predictions, your network will have something like Y-hat at time t is an activation function applied to WY with both the forward activation at time t, and the backward activation at time t being fed in to make that prediction at time t. So, if you look at the prediction at time set three for example, then information from X one can flow through here, forward one to forward two, they’re are all stated in the function here, to forward three to Y-hat three. So information from X one, X two, X three are all taken into account with information from X four can flow through a backward four to a backward three to Y three. So this allows the prediction at time three to take as input both information from the past, as well as information from the present which goes into both the forward and the backward things at this step, as well as information from the future. So, in particular, given a phrase like, “He said, Teddy Roosevelt…” To predict whether Teddy is a part of the person’s name, you take into account information from the past and from the future. So this is the bidirectional recurrent neural network and these blocks here can be not just the standard RNN block but they can also be GRU blocks or LSTM blocks. In fact, for a lots of NLP problems, for a lot of text with natural language processing problems, a bidirectional RNN with a LSTM appears to be commonly used. So, we have NLP problem and you have the complete sentence, you try to label things in the sentence, a bidirectional RNN with LSTM blocks both forward and backward would be a pretty views of first thing to try. So, that’s it for the bidirectional RNN and this is a modification they can make to the basic RNN architecture or the GRU or the LSTM, and by making this change you can have a model that uses RNN and or GRU or LSTM and is able to make predictions anywhere even in the middle of a sequence by taking into account information potentially from the entire sequence. The disadvantage of the bidirectional RNN is that you do need the entire sequence of data before you can make predictions anywhere. So, for example, if you’re building a speech recognition system, then the BRNN will let you take into account the entire speech utterance but if you use this straightforward implementation, you need to wait for the person to stop talking to get the entire utterance before you can actually process it and make a speech recognition prediction. So for a real type speech recognition applications, they’re somewhat more complex modules as well rather than just using the standard bidirectional RNN as you’ve seen here. But for a lot of natural language processing applications where you can get the entire sentence all the same time, the standard BRNN algorithm is actually very effective.

So, that’s it for BRNNs and next and final video for this week, let’s talk about how to take all of these ideas RNNs, LSTMs and GRUs and the bidirectional versions and construct deep versions of them.

12_deep-rnns

The different versions of RNNs you’ve seen so far will already work quite well by themselves. But for learning very complex functions sometimes is useful to stack multiple layers of RNNs together to build even deeper versions of these models. In this video, you’ll see how to build these deeper RNNs. Let’s take a look.

So you remember for a standard neural network, you will have an input X. And then that’s stacked to some hidden layer and so that might have activations of say, $a^{<1>}$ for the first hidden layer, and then that’s stacked to the next layer with activations $a^{<2>}$, then maybe another layer, activations $a^{<3>}$ and then you make a prediction $ŷ$. So a deep RNN is a bit like this, by taking this network that I just drew by hand and unrolling that in time. So let’s take a look. So here’s the standard RNN that you’ve seen so far. But I’ve changed the notation a little bit which is that, instead of writing this as $a^{<0>}$ for the activation time zero, I’ve added this square bracket 1 to denote that this is for layer one. So the notation we’re going to use is $a^{[l]}$ to denote that it’s an activation associated with layer l and then to denote that that’s associated over time t. So this will have an activation on $a^{[1]<1>}$, this would be $a^{[1]<2>}$, $a^{[1]<3>}$, $a^{[1]<4>}$. And then we can just stack these things on top and so this will be a new network with three hidden layers. So let’s look at an example of how this value is computed. So $a^{[2]<3>}$ has two inputs. It has the input coming from the bottom, and there’s the input coming from the left. So the computer has an activation function g applied to a weight matrix. This is going to be $W_a$ because computing an a quantity, an activation quantity. And for the second layer, and so I’m going to give this $a^{[2]<2>}$, there’s that thing, comma $a^{[1]<3>}$, there’s that thing, plus $b_a$ associated to the second layer. And that’s how you get that activation value. And so the same parameters $W_a^{[2]}$ and $b_a^{[2]}$ are used for every one of these computations at this layer. Whereas, in contrast, the first layer would have its own parameters $W_a^{[1]}$ and $b_a^{[1]}$. So whereas for standard RNNs like the one on the left, you know we’ve seen neural networks that are very, very deep, maybe over 100 layers. For RNNs, having three layers is already quite a lot. Because of the temporal dimension, these networks can already get quite big even if you have just a small handful of layers. And you don’t usually see these stacked up to be like 100 layers.

One thing you do see sometimes is that you have recurrent layers that are stacked on top of each other. But then you might take the output here, let’s get rid of this, and then just have a bunch of deep layers that are not connected horizontally but have a deep network here that then finally predicts $y^{<1>}$. And you can have the same deep network here that predicts $y^{<2>}$. So this is a type of network architecture that we’re seeing a little bit more where you have three recurrent units that connected in time, followed by a network, followed by a network after that, as we seen for $y^{<3>}$ and $y^{<4>}$, of course. There’s a deep network, but that does not have the horizontal connections. So that’s one type of architecture we seem to be seeing more of. And quite often, these blocks don’t just have to be standard RNN, the simple RNN model. They can also be GRU blocks LSTM blocks. And finally, you can also build deep versions of the bidirectional RNN. Because deep RNNs are quite computationally expensive to train, there’s often a large temporal extent already, though you just don’t see as many deep recurrent layers. This has, I guess, three deep recurrent layers that are connected in time. You don’t see as many deep recurrent layers as you would see in a number of layers in a deep conventional neural network.

So that’s it for deep RNNs. With what you’ve seen this week, ranging from the basic RNN, the basic recurrent unit, to the GRU, to the LSTM, to the bidirectional RNN, to the deep versions of this that you just saw, you now have a very rich toolbox for constructing very powerful models for learning sequence models. I hope you enjoyed this week’s videos. Best of luck with the problem exercises and I look forward to seeing you next week.