sequence models attention mechanism

Note

This is my personal lecture note after studying the course nlp sequence models at the 3rd week and the copyright belongs to deeplearning.ai.## 01_various-sequence-to-sequence-architectures

01_basic-models

Hello, and welcome to this final week of this course, as well as to the final week of this sequence of five courses in the deep learning specialization. You’re nearly at the finish line. In this week, you hear about sequence-to-sequence models, which are useful for everything from machine translation to speech recognition. Let’s start with the basic models and then later this week you, hear about beam search, the attention model, and we’ll wrap up the discussion of models for audio data, like speech. Let’s get started.

Let’s say you want to input a French sentence like Jane visite l’Afrique en septembre, and you want to translate it to the English sentence, Jane is visiting Africa in September. As usual, let’s use x<1> through x, in this case <5>, to represent the words in the input sequence, and we’ll use y<1> through y<6> to represent the words in the output sequence. So, how can you train a new network to input the sequence x and output the sequence y? Well, here’s something you could do, and the ideas I’m about to present are mainly from these two papers due to Sutskever, Oriol Vinyals, and Quoc Le, and that one by Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwen, and Yoshua Bengio. First, let’s have a network, which we’re going to call the encoder network be built as a RNN, and this could be a GRU and LSTM, feed in the input French words one word at a time. And after ingesting the input sequence, the RNN then offers a vector that represents the input sentence. After that, you can build a decoder network which I’m going to draw here, which takes as input the encoding output by the encoder network shown in black on the left, and then can be trained to output the translation one word at a time until eventually it outputs say, the end of sequence or end the sentence token upon which the decoder stops and as usual we could take the generated tokens and feed them to the next [inaudible] in the sequence like we ‘re doing before when synthesizing text using the language model. One of the most remarkable recent results in deep learning is that this model works, given enough pairs of French and English sentences. If you train the model to input a French sentence and output the corresponding English translation, this will actually work decently well. And this model simply uses an encoder network, whose job it is to find an encoding of the input French sentence and then use a decoder network to then generate the corresponding English translation.

An architecture very similar to this also works for image captioning so given an image like the one shown here, maybe wanted to be captioned automatically as a cat sitting on a chair. So how do you train a new network to input an image and output a caption like that phrase up there? Here’s what you can do. From the earlier course on ConvNet you’ve seen how you can input an image into a convolutional network, maybe a pre-trained AlexNet, and have that learn an encoding or learn a set of features of the input image. So, this is actually the AlexNet architecture and if we get rid of this final Softmax unit, the pre-trained AlexNet can give you a 4096-dimensional feature vector of which to represent this picture of a cat. And so this pre-trained network can be the encoder network for the image and you now have a 4096-dimensional vector that represents the image. You can then take this and feed it to an RNN, whose job it is to generate the caption one word at a time. So similar to what we saw with machine translation translating from French to English, you can now input a feature vector describing the input and then have it generate an output sequence or output set of words one word at a time. And this actually works pretty well for image captioning, especially if the caption you want to generate is not too long. As far as I know, this type of model was first proposed by Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille, although it turns out there were multiple groups coming up with very similar models independently and at about the same time. So two other groups that had done very similar work at about the same time and I think independently of Mao et al were Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, as well as Andrej Karpathy and Fei-Fei Yi.

So, you’ve now seen how a basic sequence-to-sequence model works, or how a basic image-to-sequence or image captioning model works, but there are some differences between how you would run a model like this, so generating a sequence compared to how you were synthesizing novel text using a language model. One of the key differences is, you don’t want a randomly chosen translation, you maybe want the most likely translation, or you don’t want a randomly chosen caption, maybe not, but you might want the best caption and most likely caption. So let’s see in the next video how you go about generating that.

02_picking-the-most-likely-sentence

There are some similarities between the sequence to sequence machine translation model and the language models that you have worked within the first week of this course, but there are some significant differences as well. Let’s take a look.

So, you can think of machine translation as building a conditional language model. Here’s what I mean, in language modeling, this was the network we had built in the first week. And this model allows you to estimate the probability of a sentence. That’s what a language model does. And you can also use this to generate novel sentences, and sometimes when you are writing x1 and x2 here, where in this example, x2 would be equal to y1 or equal to y and one is just a feedback. But x1, x2, and so on were not important. So just to clean this up for this slide, I’m going to just cross these off. X1 could be the vector of all zeros and x2, x3 are just the previous output you are generating. So that was the language model. The machine translation model looks as follows, and I am going to use a couple different colors, green and purple, to denote respectively the coded network in green and the decoded network in purple. And you notice that the decoded network looks pretty much identical to the language model that we had up there. So what the machine translation model is, is very similar to the language model, except that instead of always starting along with the vector of all zeros, it instead has an encoded network that figures out some representation for the input sentence, and it takes that input sentence and starts off the decoded network with representation of the input sentence rather than with the representation of all zeros. So, that’s why I call this a conditional language model, and instead of modeling the probability of any sentence, it is now modeling the probability of, say, the output English translation, conditions on some input French sentence. So in other words, you’re trying to estimate the probability of an English translation. Like, what’s the chance that the translation is “Jane is visiting Africa in September,” but conditions on the input French censors like, “Jane visite I’Afrique en septembre.” So, this is really the probability of an English sentence conditions on an input French sentence which is why it is a conditional language model.

Now, if you want to apply this model to actually translate a sentence from French into English, given this input French sentence, the model might tell you what is the probability of difference in corresponding English translations. So, x is the French sentence, “Jane visite l’Afrique en septembre.” And, this now tells you what is the probability of different English translations of that French input. And, what you do not want is to sample outputs at random. If you sample words from this distribution, p of y given x, maybe one time you get a pretty good translation, “Jane is visiting Africa in September.” But, maybe another time you get a different translation, “Jane is going to be visiting Africa in September. “ Which sounds a little awkward but is not a terrible translation, just not the best one. And sometimes, just by chance, you get, say, others: “In September, Jane will visit Africa.” And maybe, just by chance, sometimes you sample a really bad translation: “Her African friend welcomed Jane in September.” So, when you’re using this model for machine translation, you’re not trying to sample at random from this distribution. Instead, what you would like is to find the English sentence, y, that maximizes that conditional probability. So in developing a machine translation system, one of the things you need to do is come up with an algorithm that can actually find the value of y that maximizes this term over here. The most common algorithm for doing this is called beam search, and it’s something you’ll see in the next video. But, before moving on to describe beam search, you might wonder, why not just use greedy search? So, what is greedy search?

Well, greedy search is an algorithm from computer science which says to generate the first word just pick whatever is the most likely first word according to your conditional language model. Going to your machine translation model and then after having picked the first word, you then pick whatever is the second word that seems most likely, then pick the third word that seems most likely. This algorithm is called greedy search. And, what you would really like is to pick the entire sequence of words, $y^{<1>}, y^{<2>}$ , up to $y^{}$, that’s there, that maximizes the joint probability of that whole thing. And it turns out that the greedy approach, where you just pick the best first word, and then, after having picked the best first word, try to pick the best second word, and then, after that, try to pick the best third word, that approach doesn’t really work. To demonstrate that, let’s consider the following two translations. The first one is a better translation, so hopefully, in our machine translation model, it will say that p of y given x is higher for the first sentence. It’s just a better, more succinct translation of the French input. The second one is not a bad translation, it’s just more verbose, it has more unnecessary words. But, if the algorithm has picked “Jane is” as the first two words, because “going” is a more common English word, probably the chance of “Jane is going,” given the French input, this might actually be higher than the chance of “Jane is visiting,” given the French sentence. So, it’s quite possible that if you just pick the third word based on whatever maximizes the probability of just the first three words, you end up choosing option number two. But, this ultimately ends up resulting in a less optimal sentence, in a less good sentence as measured by this model for p of y given x. I know this was may be a slightly hand-wavey argument, but, this is an example of a broader phenomenon, where if you want to find the sequence of words, y1, y2, all the way up to the final word that together maximize the probability, it’s not always optimal to just pick one word at a time. And, of course, the total number of combinations of words in the English sentence is exponentially larger. So, if you have just 10,000 words in a dictionary and if you’re contemplating translations that are up to ten words long, then there are 10000 to the tenth possible sentences that are ten words long. Picking words from the vocabulary size, the dictionary size of 10000 words. So, this is just a huge space of possible sentences, and it’s impossible to rate them all, which is why the most common thing to do is use an approximate search out of them.

And, what an approximate search algorithm does, is it will try, it won’t always succeed, but it will to pick the sentence, y, that maximizes that conditional probability. And, even though it’s not guaranteed to find the value of y that maximizes this, it usually does a good enough job.

So, to summarize, in this video, you saw how machine translation can be posed as a conditional language modeling problem. But one major difference between this and the earlier language modeling problems is rather than wanting to generate a sentence at random, you may want to try to find the most likely English sentence, most likely English translation. But the set of all English sentences of a certain length is too large to exhaustively enumerate. So, we have to resort to a search algorithm. So, with that, let’s go onto the next video where you’ll learn about beam search algorithm.

03_beam-search

In this video, you learn about the beam search algorithm. In the last video, you remember how for machine translation given an input French sentence, you don’t want to output a random English translation, you want to output the best and the most likely English translation. The same is also true for speech recognition where given an input audio clip, you don’t want to output a random text transcript of that audio, you want to output the best, maybe the most likely, text transcript. Beam search is the most widely used algorithm to do this.

And in this video, you see how to get beam search to work for yourself. Let’s just try Beam Search using our running example of the French sentence, “Jane, visite l’Afrique en Septembre”. Hopefully being translated into, “Jane, visits Africa in September”. The first thing Beam search has to do is try to pick the first words of the English translation, that’s going to operate. So here I’ve listed, say, 10,000 words into vocabulary. And to simplify the problem a bit, I’m going to ignore capitalization. So I’m just listing all the words in lower case. So, in the first step of Beam Search, I use this network fragment with the coalition in green and decoalition in purple, to try to evaluate what is the probability of that for a square. So, what’s the probability of the first output y, given the input sentence x gives the French input. So, whereas greedy search will pick only the one most likely words and move on, Beam Search instead can consider multiple alternatives. So, the Beam Search algorithm has a parameter called B, which is called the beam width and for this example I’m going to set the beam width to be with the three. And what this means is Beam search will cause that not just one possibility but consider three at the time. So in particular, let’s say evaluating this probability over different choices the first words, it finds that the choices in, Jane and September are the most likely three possibilities for the first words in the English outputs. Then Beam search will stowaway in computer memory that it wants to try all of three of these words, and if the beam width parameter were said differently, the beam width parameter was 10, then we keep track of not just three but of the ten, most likely possible choices for the first word. So, to be clear in order to perform this first step of Beam search, what you need to do is run the input French sentence through this encoder network and then this first step will then decode the network, this is a softmax output overall 10,000 possibilities. Then you would take those 10,000 possible outputs and keep in memory which were the top three.

Let’s go into the second step of Beam search. Having picked in, Jane and September as the three most likely choice of the first word, what Beam search will do now, is for each of these three choices consider what should be the second word, so after “in” maybe a second word is “a” or maybe as Aaron, I’m just listing words from the vocabulary, from the dictionary or somewhere down the list will be September, somewhere down the list there’s visit and then all the way to z and then the last word is zulu. So, to evaluate the probability of second word, it will use this new network fragments where is coder in green and for the decoder portion when trying to decide what comes after in. Remember the decoder first outputs, y hat one. So, I’m going to set to this y hat one to the word “in” as it goes back in. So there’s the word “in” because it decided for now. That’s because It trying to figure out that the first word was “in”, what is the second word, and then this will output I guess y hat two. And so by hard wiring y hat one here, really the inputs here to be the first words “in” this time were fragment can be used to evaluate whether it’s the probability of the second word given the input french sentence and that the first words of the translation has been the word “in”. Now notice that what we also need help out in this second step would be assertions to find the pair of the first and second words that is most likely it’s not just a second where is most likely that the pair of the first and second whereas the most likely and by the rules of conditional probability. This can be expressed as P of the first words times P of probability of the second words. Which you are getting from this network fragment and so if for each of the three words you’ve chosen “in”, “Jane,” and “September” you save away this probability then you can multiply them by this second probabilities to get the probability of the first and second words.

So now you’ve seen how if the first word was “in” how you can evaluate the probability of the second word. Now at first it was “Jane” you do the same thing. The sentence could be “Jane a”,” Jane Aaron”, and so on down to “Jane is”, “Jane visits” and so on. And you will use this in your network fragments let me draw this in as well where here you will hardwire, Y hat One to be Jane. And so with the First word y one hat’s hard wired as Jane than just the network fragments can tell you what’s the probability of the second words to me. And given that the first word is “Jane”. And then same as above you can multiply with P of Y1 to get the probability of Y1 and Y2 for each of these 10,000 different possible choices for the second word. And then finally do the same thing for September although words from a down to Zulu and use this network fragment. That just goes in as well to see if the first word was September. What was the most likely options for the second words. So for this second step of beam search because we’re continuing to use a beam width of three and because there are 10,000 words in the vocabulary you’d end up considering three times 10000 or thirty thousand possibilities because there are 10,000 here, 10,000 here, 10,000 here as the beam width times the number of words in the vocabulary and what you do is you evaluate all of these 30000 options according to the probably the first and second words and then pick the top three. So with a cut down, these 30,000 possibilities down to three again down the beam width rounded again so let’s say that 30,000 choices, the most likely were in September and say Jane is, and Jane visits sorry this bit messy but those are the most likely three out of the 30,000 choices then that’s what Beam’s search would memorize away and take on to the next step being surge. So notice one thing if beam search decides that the most likely choices are the first and second words are in September, or Jane is, or Jane visits. Then what that means is that it is now rejecting September as a candidate for the first words of the output English translation so we’re now down to two possibilities for the first words but we still have a beam width of three keeping track of three choices for pairs of Y1, Y2 before going onto the third step of beam search. Just want to notice that because of beam width is equal to three, every step you instantiate three copies of the network to evaluate these partial sentence fragments and the output. And it’s because of beam width is equal to three that you have three copies of the network with different choices for the first words, but these three copies of the network can be very efficiently used to evaluate all 30,000 options for the second word. So just don’t instantiate 30,000 copies of the network or three copies of the network to very quickly evaluate all 10,000 possible outputs at that softmax output say for Y2.

Let’s just quickly illustrate one more step of beam search. So said that the most likely choices for first two words were in September, Jane is, and Jane visits and for each of these pairs of words which we should have saved the way in computer memory the probability of Y1 and Y2 given the input X given the French sentence X. So similar to before, we now want to consider what is the third word. So in September a? In September Aaron? All the way down to is in September Zulu and to evaluate possible choices for the third word, you use this network fragments where you Hardwire the first word here to be in the second word to be September. And so this network fragment allows you to evaluate what’s the probability of the third word given the input French sentence X and given that the first two words are in September and English output. And then you do the same thing for the second fragment. So like so. And same thing for Jane visits and so beam search will then once again pick the top three possibilities may be that things in September. Jane is a likely outcome or Jane is visiting is likely or maybe Jane visits Africa is likely for that first three words and then it keeps going and then you go onto the fourth step of beam search hat one more word and on it goes. And the outcome of this process hopefully will be that adding one word at a time that Beam search will decide that. Jane visits Africa in September will be terminated by the end of sentence symbol using that system is quite common. They’ll find that this is a likely output English sentence and you’ll see more details of this yourself. In this week’s exercise as well where you get to play with beam search yourself. So with a beam of three being searched considers three possibilities at a time. Notice that if the beam width was said to be equal to one, say cause there’s only one, then this essentially becomes the greedy search algorithm which we had discussed in the last video but by considering multiple possibilities say three or ten or some other number at the same time beam search will usually find a much better output sentence than greedy search.

You’ve now seen how Beam Search works but it turns out there’s some additional tips and tricks refinements that help you to make beam search work even better. Let’s go onto the next video to take a look.

In the last video, you saw the basic beam search algorithm. In this video, you’ll learn some little changes that make it work even better. Length normalization is a small change to the beam search algorithm that can help you get much better results. Here’s what it is.

Beam search is maximizing this probability. And this product here is just expressing the observation that P(y1) up to y(Ty), given x, can be expressed as P(y1) given x times P(y2), given x and y1 times dot dot dot, up to I guess p of y Ty given x and y1 up to y t1-1. Maybe this notation is a bit more scary and more intimidating than it needs to be, but this is that probabilities that you see previously. Now, if you’re implementing these, these probabilities are all numbers less than 1. Often they’re much less than 1. And multiplying a lot of numbers less than 1 will result in a tiny, tiny, tiny number, which can result in numerical underflow. Meaning that it’s too small for the floating part representation in your computer to store accurately. So in practice, instead of maximizing this product, we will take logs. And if you insert a log there, then log of a product becomes a sum of a log, and maximizing this sum of log probabilities should give you the same results in terms of selecting the most likely sentence y. So by taking logs, you end up with a more numerically stable algorithm that is less prone to rounding errors, numerical rounding errors, or to really numerical underflow. And because the log function, that’s the logarithmic function, this is strictly monotonically increasing function, maximizing P(y). And because the logarithmic function, here’s the log function, is a strictly monotonically increasing function, we know that maximizing log P(y) given x should give you the same result as maximizing P(y) given x. As in the same value of y that maximizes this should also maximize that. So in most implementations, you keep track of the sum of logs of the probabilities rather than the protocol of probabilities. Now, there’s one other change to this objective function that makes the machine translation algorithm work even better. Which is that, if you referred to this original objective up here, if you have a very long sentence, the probability of that sentence is going to be low, because you’re multiplying as many terms here. Lots of numbers are less than 1 to estimate the probability of that sentence. And so if you multiply all the numbers that are less than 1 together, you just tend to end up with a smaller probability. And so this objective function has an undesirable effect, that maybe it unnaturally tends to prefer very short translations. It tends to prefer very short outputs. Because the probability of a short sentence is determined just by multiplying fewer of these numbers are less than 1. And so the product would just be not quite as small. And by the way, the same thing is true for this. The log of our probability is always less than or equal to 1. You’re actually in this range of the log. So the more terms you have together, the more negative this thing becomes. So there’s one other change to the algorithm that makes it work better, which is instead of using this as the objective you’re trying to maximize, one thing you could do is normalize this by the number of words in your translation. And so this takes the average of the log of the probability of each word. And this significantly reduces the penalty for outputting longer translations. And in practice, as a heuristic instead of dividing by Ty, by the number of words in the output sentence, sometimes you use a softer approach. We have Ty to the power of alpha, where maybe alpha is equal to 0.7. So if alpha was equal to 1, then yeah, completely normalizing by length. If alpha was equal to 0, then, well, Ty to the 0 would be 1, then you’re just not normalizing at all. And this is somewhat in between full normalization, and no normalization, and alpha’s another hyper parameter you have within that you can tune to try to get the best results. And have to admit, using alpha this way, this is a heuristic or this is a hack. There isn’t a great theoretical justification for it, but people have found this works well. People have found that it works well in practice, so many groups will do this. And you can try out different values of alpha and see which one gives you the best result.

So just to wrap up how you run beam search, as you run beam search you see a lot of sentences with length equal 1, a lot of sentences with length equal 2, a lot of sentences with length equals 3. And so on, and maybe you run beam search for 30 steps and you consider output sentences up to length 30, let’s say. And so with beam with a 3, you will be keeping track of the top three possibilities for each of these possible sentence lengths, 1, 2, 3, 4 and so on, up to 30. Then, you would look at all of the output sentences and score them against this score. And so you can take your top sentences and just compute this objective function onto sentences that you have seen through the beam search process. And then finally, of all of these sentences that you validate this way, you pick the one that achieves the highest value on this normalized log probability objective. Sometimes it’s called a normalized log likelihood objective. And then that would be the final translation, your outputs. So that’s how you implement beam search, and you get to play this yourself in this week’s problem exercise.

Finally, a few implementational details, how do you choose the beam width B? The larger B is, the more possibilities you’re considering, and does the better the sentence you probably find. But the larger B is, the more computationally expensive your algorithm is, because you’re also keeping a lot more possibilities around. All right, so finally, let’s just wrap up with some thoughts on how to choose the beam width B. So here are the pros and cons of setting B to be very large versus very small. If the beam width is very large, then you consider a lot of possibilities, and so you tend to get a better result because you are consuming a lot of different options, but it will be slower. And the memory requirements will also grow, will also be compositionally slower. Whereas if you use a very small beam width, then you get a worse result because you’re just keeping less possibilities in mind as the algorithm is running. But you get a result faster and the memory requirements will also be lower. So in the previous video, we used in our running example a beam width of three, so we’re keeping three possibilities in mind. In practice, that is on the small side. In production systems, it’s not uncommon to see a beam width maybe around 10, and I think beam width of 100 would be considered very large for a production system, depending on the application. But for research systems where people want to squeeze out every last drop of performance in order to publish the paper with the best possible result. It’s not uncommon to see people use beam widths of 1,000 or 3,000, but this is very application, that’s why it’s a domain dependent. So I would say try other variety of values of B as you work through your application. But when B gets very large, there is often diminishing returns. So for many applications, I would expect to see a huge gain as you go from a beam widht of 1, which is very greedy search, to 3, to maybe 10. But the gains as you go from 1,000 to 3,000 in beam width might not be as big. And for those of you that have taken maybe a lot of computer science courses before, if you’re familiar with computer science search algorithms like BFS, Breadth First Search, or DFS, Depth First Search. The way to think about beam search is that, unlike those other algorithms which you have learned about in a computer science algorithms course, and don’t worry about it if you’ve not heard of these algorithms. But if you’ve heard of Breadth First Search and Depth First Search then unlike those algorithms, which are exact search algorithms. Beam search runs much faster but does not guarantee to find the exact maximum for this argmax that you would like to find. If you haven’t heard of breadth first search or depth first search, don’t worry about it, it’s not important for our purposes. But if you have, this is how beam search relates to those algorithms.

So that’s it for beam search, which is a widely used algorithm in many production systems, or in many commercial systems. Now, in the circles in the sequence of courses of deep learning, we talked a lot about error analysis. It turns out, one of the most useful tools I’ve found is to be able to do error analysis on beam search. So you sometimes wonder, should I increase my beam width? Is my beam width working well enough? And there’s some simple things you can compute to give you guidance on whether you need to work on improving your search algorithm. Let’s talk about that in the next video.

05_error-analysis-in-beam-search

In the third course of this sequence of five courses, you saw how error analysis can help you focus your time on doing the most useful work for your project. Now, beam search is an approximate search algorithm, also called a heuristic search algorithm. And so it doesn’t always output the most likely sentence. It’s only keeping track of B equals 3 or 10 or 100 top possibilities. So what if beam search makes a mistake? In this video, you’ll learn how error analysis interacts with beam search and how you can figure out whether it is the beam search algorithm that’s causing problems and worth spending time on. Or whether it might be your RNN model that is causing problems and worth spending time on. Let’s take a look at how to do error analysis with beam search.

Let’s use this example of Jane visite l’Afrique en septembre. So let’s say that in your machine translation dev set, your development set, the human provided this translation and Jane visits Africa in September, and I’m going to call this y*. So it is a pretty good translation written by a human. Then let’s say that when you run beam search on your learned RNN model and your learned translation model, it ends up with this translation, which we will call y-hat, Jane visited Africa last September, which is a much worse translation of the French sentence. It actually changes the meaning, so it’s not a good translation. Now, your model has two main components. There is a neural network model, the sequence to sequence model. We shall just call this your RNN model. It’s really an encoder and a decoder. And you have your beam search algorithm, which you’re running with some beam width b. And wouldn’t it be nice if you could attribute this error, this not very good translation, to one of these two components? Was it the RNN or really the neural network that is more to blame, or is it the beam search algorithm, that is more to blame? And what you saw in the third course of the sequence is that it’s always tempting to collect more training data that never hurts. So in similar way, it’s always tempting to increase the beam width that never hurts or pretty much never hurts. But just as getting more training data by itself might not get you to the level of performance you want. In the same way, increasing the beam width by itself might not get you to where you want to go. But how do you decide whether or not improving the search algorithm is a good use of your time? So just how you can break the problem down and figure out what’s actually a good use of your time. Now, the RNN, the neural network, what was called RNN really means the encoder and the decoder. It computes P(y given x). So for example, for a sentence, Jane visits Africa in September, you plug in Jane visits Africa. Again, I’m ignoring upper versus lowercase now, right, and so on. And this computes P(y given x). So it turns out that the most useful thing for you to do at this point is to compute using this model to compute P(y* given x) as well as to compute P(y-hat given x) using your RNN model. And then to see which of these two is bigger. So it’s possible that the left side is bigger than the right hand side. It’s also possible that P(y*) is less than P(y-hat) actually, or less than or equal to, right? Depending on which of these two cases hold true, you’d be able to more clearly ascribe this particular error, this particular bad translation to one of the RNN or the beam search algorithm being had greater fault. So let’s take out the logic behind this.

Here are the two sentences from the previous slide. And remember, we’re going to compute P(y* given x) and P(y-hat given x) and see which of these two is bigger. So there are going to be two cases. In case 1, P(y* given x) as output by the RNN model is greater than P(y-hat given x). What does this mean? Well, the beam search algorithm chose y-hat, right? The way you got y-hat was you had an RNN that was computing P(y given x). And beam search’s job was to try to find a value of y that gives that arg max. But in this case, y* actually attains a higher value for P(y given x) than the y-hat. So what this allows you to conclude is beam search is failing to actually give you the value of y that maximizes P(y given x) because the one job that beam search had was to find the value of y that makes this really big. But it chose y-hat, the y* actually gets a much bigger value. So in this case, you could conclude that beam search is at fault. Now, how about the other case? In case 2, P(y* given x) is less than or equal to P(y-hat given x), right? And then either this or this has gotta be true. So either case 1 or case 2 has to hold true. What do you conclude under case 2? Well, in our example, y* is a better translation than y-hat. But according to the RNN, P(y*) is less than P(y-hat), so saying that y* is a less likely output than y-hat. So in this case, it seems that the RNN model is at fault and it might be worth spending more time working on the RNN. There’s some subtleties here pertaining to length normalizations that I’m glossing over. There’s some subtleties pertaining to length normalizations that I’m glossing over. And if you are using some sort of length normalization, instead of evaluating these probabilities, you should be evaluating the optimization objective that takes into account length normalization. But ignoring that complication for now, in this case, what this tells you is that even though y* is a better translation, the RNN ascribed y* in lower probability than the inferior translation. So in this case, I will say the RNN model is at fault. So the error analysis process looks as follows.

You go through the development set and find the mistakes that the algorithm made in the development set. And so in this example, let’s say that P(y* given x) was 2 x 10 to the -10, whereas, P(y-hat given x) was 1 x 10 to the -10. Using the logic from the previous slide, in this case, we see that beam search actually chose y-hat, which has a lower probability than y*. So I will say beam search is at fault. So I’ll abbreviate that B. And then you go through a second mistake or second bad output by the algorithm, look at these probabilities. And maybe for the second example, you think the model is at fault. I’m going to abbreviate the RNN model with R. And you go through more examples. And sometimes the beam search is at fault, sometimes the model is at fault, and so on. And through this process, you can then carry out error analysis to figure out what fraction of errors are due to beam search versus the RNN model. And with an error analysis process like this, for every example in your dev sets, where the algorithm gives a much worse output than the human translation, you can try to ascribe the error to either the search algorithm or to the objective function, or to the RNN model that generates the objective function that beam search is supposed to be maximizing. And through this, you can try to figure out which of these two components is responsible for more errors. And only if you find that beam search is responsible for a lot of errors, then maybe is we’re working hard to increase the beam width. Whereas in contrast, if you find that the RNN model is at fault, then you could do a deeper layer of analysis to try to figure out if you want to add regularization, or get more training data, or try a different network architecture, or something else. And so a lot of the techniques that you saw in the third course in the sequence will be applicable there.

So that’s it for error analysis using beam search. I found this particular error analysis process very useful whenever you have an approximate optimization algorithm, such as beam search that is working to optimize some sort of objective, some sort of cost function that is output by a learning algorithm, such as a sequence-to-sequence model or a sequence-to-sequence RNN that we’ve been discussing in these lectures. So with that, I hope that you’ll be more efficient at making these types of models work well for your applications.

06_bleu-score-optional

One of the challenges of machine translation is that, given a French sentence, there could be multiple English translations that are equally good translations of that French sentence. So how do you evaluate a machine translation system if there are multiple equally good answers, unlike, say, image recognition where there’s one right answer? You just measure accuracy. If there are multiple great answers, how do you measure accuracy? The way this is done conventionally is through something called the BLEU score. So, in this optional video, I want to share with you, I want to give you a sense of how the BLEU score works.

Let’s say you are given a French sentence Le chat est sur le tapis. And you are given a reference, human generated translation of this, which is the the cat is on the mat. But there are multiple, pretty good translations of this. So a different human, different person might translate it as there is a cat on the mat. And both of these are actually just perfectly fine translations of the French sentence. What the BLEU score does is given a machine generated translation, it allows you to automatically compute a score that measures how good is that machine translation. And the intuition is so long as the machine generated translation is pretty close to any of the references provided by humans, then it will get a high BLEU score. BLEU, by the way, stands for bilingual evaluation, Understudy. So in the theater world, an understudy is someone that learns the role of a more senior actor so they can take over the role of the more senior actor, if necessary. And motivation for BLEU is that, whereas you could ask human evaluators to evaluate the machine translation system, the BLEU score is an understudy, could be a substitute for having humans evaluate every output of a machine translation system. So the BLEU score was due to Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. This paper has been incredibly influential, and is, actually, quite a readable paper. So I encourage you to take a look if you have time. So, the intuition behind the BLEU score is we’re going to look at the machine generated output and see if the types of words it generates appear in at least one of the human generated references. And so these human generated references would be provided as part of the depth set or as part of the test set. Now, let’s look at a somewhat extreme example. Let’s say that the machine translation system abbreviating machine translation is MT. So the machine translation, or the MT output, is the the the the the the the. So this is clearly a pretty terrible translation. So one way to measure how good the machine translation output is, is to look at each the words in the output and see if it appears in the references. And so, this would be called a precision of the machine translation output. And in this case, there are seven words in the machine translation output. And every one of these 7 words appears in either Reference 1 or Reference 2, right? So the word the appears in both references. So each of these words looks like a pretty good word to include. So this will have a precision of 7 over 7. It looks like it was a great precision. So this is why the basic precision measure of what fraction of the words in the MT output also appear in the references. This is not a particularly useful measure, because it seems to imply that this MT output has very high precision. So instead, what we’re going to use is a modified precision measure in which we will give each word credit only up to the maximum number of times it appears in the reference sentences. So in Reference 1, the word, the, appears twice. In Reference 2, the word, the, appears just once. So 2 is bigger than 1, and so we’re going to say that the word, the, gets credit up to twice. So, with a modified precision, we will say that, it gets a score of 2 out of 7, because out of 7 words, we’ll give it a 2 credits for appearing. So here, the denominator is the count of the number of times the word, the, appears of 7 words in total. And the numerator is the count of the number of times the word, the, appears. We clip this count, we take a max, or we clip this count, at 2. So this gives us the modified precision measure. Now, so far, we’ve been looking at words in isolation. In the BLEU score, you don’t want to just look at isolated words. You maybe want to look at pairs of words as well.

Let’s define a portion of the BLEU score on bigrams. And bigrams just means pairs of words appearing next to each other. So now, let’s see how we could use bigrams to define the BLEU score. And this will just be a portion of the final BLEU score. And we’ll take unigrams, or single words, as well as bigrams, which means pairs of words into account as well as maybe even longer sequences of words, such as trigrams, which means three words pairing together. So, let’s continue our example from before. We have to same Reference 1 and Reference 2. But now let’s say the machine translation or the MT System has a slightly better output. The cat the cat on the mat. Still not a great translation, but maybe better than the last one. So here, the possible bigrams are, well there’s the cat, but ignore case. And then there’s cat the, that’s another bigram. And then there’s the cat again, but I’ve already had that, so let’s skip that. And then cat on is the next one. And then on the, and the mat. So these are the bigrams in the machine translation output. And so let’s count up, How many times each of these bigrams appear. The cat appears twice, cat the appears once, and the others all appear just once. And then finally, let’s define the clipped count, so count, and then subscript clip. And to define that, let’s take this column of numbers, but give our algorithm credit only up to the maximum number of times that that bigram appears in either Reference 1 or Reference 2. So the cat appears a maximum of once in either of the references. So I’m going to clip that count to 1. Cat the, well, it doesn’t appear in Reference 1 or Reference 2, so I clip that to 0. Cat on, yep, that appears once. We give it credit for once. On the appears once, give that credit for once, and the mat appears once. So these are the clipped counts. We’re taking all the counts and clipping them, really reducing them to be no more than the number of times that bigram appears in at least one of the references. And then, finally, our modified bigram precision will be the sum of the count clipped. So that’s 1, 2, 3, 4 divided by the total number of bigrams. That’s 2, 3, 4, 5, 6, so 4 out of 6 or two-thirds is the modified precision on bigrams.

So let’s just formalize this a little bit further. With what we had developed on unigrams, we defined this modified precision computed on unigrams as P subscript 1. The P stands for precision and the subscript 1 here means that we’re referring to unigrams. But that is defined as sum over the unigrams. So that just means sum over the words that appear in the machine translation output. So this is called y hat of count clip, Of that unigram. Divided by sum of our unigrams in the machine translation output of count, number of counts of that unigram, right? And so this is what we had gotten I guess is 2 out of 7, 2 slides back. So the 1 here refers to unigram, meaning we’re looking at single words in isolation. You can also define Pn as the n-gram version, Instead of unigram, for n-gram. So this would be sum over the n-grams in the machine translation output of count clip of that n-gram divided by sum over n-grams of the count of that n-gram. And so these precisions, or these modified precision scores, measured on unigrams or on bigrams, which we did on a previous slide, or on trigrams, which are triples of words, or even higher values of n for other n-grams. This allows you to measure the degree to which the machine translation output is similar or maybe overlaps with the references. And one thing that you could probably convince yourself of is if the MT output is exactly the same as either Reference 1 or Reference 2, then all of these values P1, and P2 and so on, they’ll all be equal to 1.0. So to get a modified precision of 1.0, you just have to be exactly equal to one of the references. And sometimes it’s possible to achieve this even if you aren’t exactly the same as any of the references. But you kind of combine them in a way that hopefully still results in a good translation.

Finally, Finally, let’s put this together to form the final BLEU score. So P subscript n is the BLEU score computed on n-grams only. Also the modified precision computed on n-grams only. And by convention to compute one number, you compute P1, P2, P3 and P4, and combine them together using the following formula. It’s going to be the average, so sum from n = 1 to 4 of Pn and divide that by 4. So basically taking the average. By convention the BLEU score is defined as, e to the this, then exponentiations, and linear operate, exponentiation is strictly monotonically increasing operation and then we actually adjust this with one more factor called the, BP penalty. So BP, Stands for brevity penalty. The details maybe aren’t super important. But to just give you a sense, it turns out that if you output very short translations, it’s easier to get high precision. Because probably most of the words you output appear in the references. But we don’t want translations that are very short. So the BP, or the brevity penalty, is an adjustment factor that penalizes translation systems that output translations that are too short. So the formula for the brevity penalty is the following.

$${P_n}{\rm{ = }}\frac{{\sum\limits_{n - gram \in \widehat y} {Coun{t_{clip}}(n - gram)} }}{{\sum\limits_{n - gram \in \widehat y} {Count(n - gram)} }}$$ $$BP \exp(\dfrac{1}{4}\sum_{n=1}^{4}P_{n})$$ $$BP = \left\{ \begin{array}{l} 1, if{\kern 1pt} {\kern 1pt} MT\_length > reference\_length{\kern 1pt} {\kern 1pt} \\ \exp (1 - MT\_length/reference\_length), otherwise \end{array} \right.$$

It’s equal to 1 if your machine translation system actually outputs things that are longer than the human generated reference outputs. And otherwise is some formula like that that overall penalizes shorter translations. So, in the details you can find in this paper.

So, once again, earlier in this set of courses, you saw the importance of having a single real number evaluation metric. Because it allows you to try out two ideas, see which one achieves a higher score, and then try to stick with the one that achieved the higher score. So the reason the BLEU score was revolutionary for machine translation was because this gave a pretty good, by no means perfect, but pretty good single real number evaluation metric. And so that accelerated the progress of the entire field of machine translation. I hope this video gave you a sense of how the BLEU score works.

In practice, few people would implement a BLEU score from scratch. There are open source implementations that you can download and just use to evaluate your own system. But today, BLEU score is used to evaluate many systems that generate text, such as machine translation systems, as well as the example I showed briefly earlier of image captioning systems where you would have a system, have a neural network generated image caption. And then use the BLEU score to see how much that overlaps with maybe a reference caption or multiple reference captions that were generated by people. So the BLEU score is a useful single real number evaluation metric to use whenever you want your algorithm to generate a piece of text. And you want to see whether it has similar meaning as a reference piece of text generated by humans. This is not used for speech recognition, because in speech recognition, there’s usually one ground truth. And you just use other measures to see if you got the speech transcription on pretty much, exactly word for word correct. But for things like image captioning, and multiple captions for a picture, it could be about equally good or for machine translations. There are multiple translations, but equally good. The BLEU score gives you a way to evaluate that automatically and therefore speed up your development. So with that, I hope you have a sense of how the BLEU score works.

07_attention-model-intuition

For most of this week, you’ve been using a Encoder-Decoder architecture for machine translation. Where one RNN reads in a sentence and then different one outputs a sentence. There’s a modification to this called the Attention Model, that makes all this work much better. The attention algorithm, the attention idea has been one of the most influential ideas in deep learning. Let’s take a look at how that works.

Get a very long French sentence like this. What we are asking this green encoder in your network to do is, to read in the whole sentence and then memorize the whole sentences and store it in the activations conveyed here. Then for the purple network, the decoder network till then generate the English translation. Jane went to Africa last September and enjoyed the culture and met many wonderful people; she came back raving about how wonderful her trip was, and is tempting me to go too. Now, the way a human translator would translate this sentence is not to first read the whole French sentence and then memorize the whole thing and then regurgitate an English sentence from scratch. Instead, what the human translator would do is read the first part of it, maybe generate part of the translation. Look at the second part, generate a few more words, look at a few more words, generate a few more words and so on. You kind of work part by part through the sentence, because it’s just really difficult to memorize the whole long sentence like that. What you see for the Encoder-Decoder architecture above is that, it works quite well for short sentences, so we might achieve a relatively high Bleu score, but for very long sentences, maybe longer than 30 or 40 words, the performance comes down. The Bleu score might look like this as the sentence that varies and short sentences are just hard to translate, hard to get all the words, right? Long sentences, it doesn’t do well on because it’s just difficult to get in your network to memorize a super long sentence.

In this and the next video, you’ll see the Attention Model which translates maybe a bit more like humans might, looking at part of the sentence at a time and with an Attention Model, machine translation systems performance can look like this, because by working one part of the sentence at a time, you don’t see this huge dip which is really measuring the ability of a neural network to memorize a long sentence which maybe isn’t what we most badly need a neural network to do. In this video, I want to just give you some intuition about how attention works and then we’ll flesh out the details in the next video.

The Attention Model was due to Dimitri, Bahdanau, Camcrun Cho, Yoshe Bengio and even though it was obviously developed for machine translation, it spread to many other application areas as well. This is really a very influential, I think very seminal paper in the deep learning literature. Let’s illustrate this with a short sentence, even though these ideas were maybe developed more for long sentences, but it’ll be easier to illustrate these ideas with a simpler example. We have our usual sentence, Jane visite l’Afrique en Septembre. Let’s say that we use a R and N, and in this case, I’m going to use a bidirectional R and N, in order to compute some set of features for each of the input words and you have to understand it, bidirectional RNN with outputs Y1 to Y3 and so on up to Y5 but we’re not doing a word for word translation, let me get rid of the Y’s on top. But using a bidirectional R and N, what we’ve done is for each other words, really for each of the five positions into sentence, you can compute a very rich set of features about the words in the sentence and maybe surrounding words in every position. Now, let’s go ahead and generate the English translation. We’re going to use another RNN to generate the English translations. Here’s my RNN note as usual and instead of using A to denote the activation, in order to avoid confusion with the activations down here, I’m just going to use a different notation, I’m going to use S to denote the hidden state in this RNN up here, so instead of writing A1 I’m going to right S1 and so we hope in this model that the first word it generates will be Jane, to generate Jane visits Africa in September. Now, the question is, when you’re trying to generate this first word, this output, what part of the input French sentence should you be looking at? Seems like you should be looking primarily at this first word, maybe a few other words close by, but you don’t need to be looking way at the end of the sentence. What the Attention Model would be computing is a set of attention weights and we’re going to use $\alpha^{<1, 1>}$ to denote when you’re generating the first words, how much should you be paying attention to this first piece of information here. And then we’ll also come up with a second that’s called Attention Weight, $\alpha^{<1, 2>}$ which tells us what we’re trying to compute the first work of Jane, how much attention we’re paying to this second work from the inputs and so on and the $\alpha^{<1, 3>}$ and so on, and together this will tell us what is exactly the context from denoter C that we should be paying attention to, and that is input to this RNN unit to then try to generate the first words. That’s one step of the R and N, we will flesh out all these details in the next video. For the second step of this R and N, we’re going to have a new hidden state S two and we’re going to have a new set of the attention weights. We’re going to have $\alpha^{<2, 1>}$ to tell us when we generate in the second word. I guess this will be visits maybe that being the ground trip label. How much should we paying attention to the first word in the french input and also, $\alpha^{<2, 2>}$ and so on. How much should we paying attention the word visite, how much should we pay attention to the free and so on. And of course, the first word we generate in Jane is also an input to this, and then we have some context that we’re paying attention to and the second step, there’s also an input and that together will generate the second word and that leads us to the third step, S three, where this is an input and we have some new context C that depends on the various $\alpha^{<3, t>}$ for the different time sets, that tells us how much should we be paying attention to the different words from the input French sentence and so on. So, some things I haven’t specified yet, but that will go further into detail in the next video of this, how exactly this context defines and the goal of the context is for the third word is really should capture that maybe we should be looking around this part of the sentence. The formula you use to do that will defer to the next video as well as how do you compute these attention weights. And you see in the next video that $\alpha^{<3, t>}$, which is, when you’re trying to generate the third word, I guess this would be the Africa, just getting the right output. The amounts that this RNN step should be paying attention to the French word that time T, that depends on the activations of the bidirectional RNN at time T, I guess it depends on the fourth activations and the, backward activations at time T and it will depend on the state from the previous steps, it will depend on S two, and these things together will influence, how much you pay attention to a specific word in the input French sentence. But we’ll flesh out all these details in the next video. But the key intuition to take away is that this way the RNN marches forward generating one word at a time, until eventually it generates maybe the EOS and at every step, there are these attention weighs. $\alpha^{<t, t’>}$ that tells it, when you’re trying to generate the T, English word, how much should you be paying attention to the T prime French words.And this allows it on every time step to look only maybe within a local window of the French sentence to pay attention to, when generating a specific English word.

I hope this video conveys some intuition about Attention Model and that we now have a rough sense of, maybe how the algorithm works. Let’s go to the next video to flesh out the details of the Attention Model.

08_attention-model

In the last video, you saw how the attention model allows a neural network to pay attention to only part of an input sentence while it’s generating a translation, much like a human translator might. Let’s now formalize that intuition into the exact details of how you would implement an attention model.

So same as in the previous video, let’s assume you have an input sentence and you use a bidirectional RNN, or bidirectional GRU, or bidirectional LSTM to compute features on every word. In practice, GRUs and LSTMs are often used for this, with maybe LSTMs be more common. And so for the forward occurrence, you have a forward occurrence first time step. Activation backward occurrence, first time step. Activation forward occurrence, second time step. Activation backward and so on. For all of them in just a forward fifth time step a backwards fifth time step. We had a zero here technically we can also have I guess a backwards sixth as a factor of all zero, actually that’s a factor of all zeroes. And then to simplify the notation going forwards at every time step, even though you have the features computed from the forward occurrence and from the backward occurrence in the bidirectional RNN. I’m just going to use $a^{t’}$ to represent both of these concatenated together, $a^{<t^{<\prime>}>}=({\overrightarrow a^{<t^{\prime}>}},{\overleftarrow a^{<t^{\prime}>}})$. So a of t is going to be a feature vector for time step t. Although to be consistent with notation, we’re using second, I’m going to call this $t^\prime$. Actually, I’m going to use $t^{\prime}$ to index into the words in the French sentence. Next, we have our forward only, so it’s a single direction RNN with state s to generate the translation. And so the first time step, it should generate $y^{<1>}$ and just will have as input some context C. And if you want to index it with time I guess you could write a $C^{<1>}$ but sometimes I just right C without the superscript one. And this will depend on the attention parameters so $\alpha^{<1,1>}$, $\alpha^{<1,2>}$ and so on tells us how much attention. And so these alpha parameters tells us how much the context would depend on the features we’re getting or the activations we’re getting from the different time steps. And so the way we define the context is actually be a way to some of the features from the different time steps weighted by these attention weights. So more formally the attention weights will satisfy this that they are all be non-negative, so it will be a zero positive and they’ll sum to one. We’ll see later how to make sure this is true. And we will have the context or the context at time one often drop that superscript that’s going to be sum over $t^{\prime}$, all the values of $t^{\prime}$ of this weighted sum of these activations $c^{<1>} = \sum\alpha^{<1, t^{\prime}>}a^{<t^{\prime}>}$. So this term, $\alpha^{<1, t^{\prime}>}$, here are the attention weights and this term, $a^{<t^{\prime}>}$, here comes from here $a^{<t^{\prime}>}=({\overrightarrow a^{<t^{\prime}>}},{\overleftarrow a ^{<t^{\prime}>}})$. So $\alpha^{<t, t^{\prime}>}$ is the amount of attention that’s $y^t$ should pay to $a^{t^{\prime}}$. So in other words, when you’re generating the t of the output words, how much you should be paying attention to the $t^{\prime}$th input to word. So that’s one step of generating the output and then at the next time step, you generate the second output and is again done some of where now you have a new set of attention weights on they to find a new way to sum. That generates a new context. This, $y^{}$, is also input and that allows you to generate the second word. Only now just this way to sum becomes the context of the second time step is $c^{<2>} = \sum\alpha^{<2, t^{\prime}>}a^{<t^{\prime}>}$. So using these context vectors. $c^{<1>}$ right there back, $c^{<2>}$, and so on. This network uo here, which circled in purple color, here looks like a pretty standard RNN sequence with the context vectors as output and we can just generate the translation one word at a time. We have also define how to compute the context vectors in terms of these attention ways and those features of the input sentence. So the only remaining thing to do is to define how to actually compute these attention weights. Let’s do that on the next slide. So just to recap, $\alpha^{<t, t^{\prime}>}$ is the amount of attention you should paid to $a^{<t^{\prime}>}$ when you’re trying to generate the $t^{th}$ words in the output translation.

So let me just write down the formula and we talk of how this works. This is formula you could use the compute $\alpha^{<t, t^{\prime}>}$ which is going to compute these terms $e^{<t, t^{\prime}>}$ and then use essentially a softmax to make sure that these weights sum to one if you sum over $t^{\prime}$. So for every fix value of t, these things, ${\alpha^{}} =\frac{{\exp({e^{}})}}{{\sum\limits_{t^{\prime} = 1}^{{T_x}} {\exp({e^{}})}}}$ , sum to one if you’re summing over $t^{\prime}$. And using this softmax prioritization, just ensures this properly sums to one. Now how do we compute these factors e. Well, one way to do so is to use a small neural network as follows.

So $s^{}$ was the neural network state from the previous time step. So here is the network we have.

If you’re trying to generate $y^t$ then $s^{}$ was the hidden state from the previous step that just fell into $s^t$ and that’s one input to very small neural network. Usually, one hidden layer in neural network because you need to compute these a lot. And then $a^{<t^{\prime}>}$ the features from time step $t^{\prime}$ is the other inputs. And the intuition is, if you want to decide how much attention to pay to the activation of $t^{\prime}$. Well, the things that seems like it should depend the most on is what is your own hidden state activation from the previous time step. You don’t have the current state activation yet because of context feeds into this so you haven’t computed that. But look at whatever you’re hidden stages of this RNN generating the upper translation and then for each of the positions, each of the words look at their features. So it seems pretty natural that $\alpha^{<t, t^{\prime}>}$ and $e^{<t, t^{\prime}>}$ should depend on these two quantities. But we don’t know what the function is. So one thing you could do is just train a very small neural network to learn whatever this function should be. And trust that back propagation trust gradient descent to learn the right function. And it turns out that if you implemented this whole model and train it with gradient descent, the whole thing actually works. This little neural network does a pretty decent job telling you how much attention $y^t$ should pay to $a^{<t^{\prime}>}$

and this formula ${\alpha^{}} =\frac{{\exp({e^{}})}}{{\sum\limits_{t^{\prime} = 1}^{{T_x}} {\exp({e^{}})}}}$ makes sure that the attention weights sum to one and then as you chug along generating one word at a time, this neural network actually pays attention to the right parts of the input sentence that learns all this automatically using gradient descent.

Now, one downside to this algorithm is that it does take quadratic time or quadratic cost to run this algorithm. If you have $T_x$ words in the input and $T_y$ words in the output then the total number of these attention parameters are going to be $T_x$ times $T_y$. And so this algorithm runs in quadratic cost. Although in machine translation applications where neither input nor output sentences is usually that long maybe quadratic cost is actually acceptable. Although, there is some research work on trying to reduce costs as well. Now, so far up in describing the attention idea in the context of machine translation. Without going too much into detail this idea has been applied to other problems as well. So just image captioning. So in the image capturing problem the task is to look at the picture and write a caption for that picture. So in this paper set to the bottom by Kevin Chu, Jimmy Barr, Ryan Kiros, Kelvin Shaw, Aaron Korver, Russell Zarkutnov, Virta Zemo, and Andrew Benjo they also showed that you could have a very similar architecture. Look at the picture and pay attention only to parts of the picture at a time while you’re writing a caption for a picture. So if you’re interested, then I encourage you to take a look at that paper as well. And you get to play with all this and more in the programming exercise.

Whereas machine translation is a very complicated problem in the prior exercise you get to implement and play of the attention while you yourself for the date normalization problem. So the problem inputting a date like this. This actually has a date of the Apollo Moon landing and normalizing it into standard formats or a date like this and having a neural network a sequence, sequence model normalize it to this format. This by the way is the birthday of William Shakespeare. Also it’s believed to be. And what you see in prior exercises as you can train a neural network to input dates in any of these formats and have it use an attention model to generate a normalized format for these dates. One other thing that sometimes fun to do is to look at the visualizations of the attention weights. So here’s a machine translation example and here were plotted in different colors. the magnitude of the different attention weights. I don’t want to spend too much time on this but you find that the corresponding input and output words you find that the attention weights will tend to be high. Thus, suggesting that when it’s generating a specific word in output is, usually paying attention to the correct words in the input and all this including learning where to pay attention when was all learned using propagation with an attention model.

So that’s it for the attention model really one of the most powerful ideas in deep learning. I hope you enjoy implementing and playing with these ideas yourself later in this week’s programming exercises.

02_speech-recognition-audio-data

01_speech-recognition

One of the most exciting developments were sequence-to-sequence models has been the rise of very accurate speech recognition. We’re nearing the end of the course, we want to take just a couple of videos to give you a sense of how these sequence-to-sequence models are applied to audio data, such as the speech.

So, what is the speech recognition problem? You’re given an audio clip, x, and your job is to automatically find a text transcript, y. So, an audio clip, if you plot it looks like this, the horizontal axis here is time, and what a microphone does is it really measures minuscule changes in air pressure, and the way you’re hearing my voice right now is that your ear is detecting little changes in air pressure, probably generated either by your speakers or by a headset. And some audio clips like this plots with the air pressure against time. And, if this audio clip is of me saying, “the quick brown fox”, then hopefully, a speech recognition algorithm can input that audio clip and output that transcript. And because even the human ear doesn’t process raw wave forms, but the human ear has physical structures that measures the amounts of intensity of different frequencies, there is, a common pre-processing step for audio data is to run your raw audio clip and generate a spectrogram. So, this is the plots where the horizontal axis is time, and the vertical axis is frequencies, and intensity of different colors shows the amount of energy. So, how loud is the sound at different frequencies? At different times? And so, these types of spectrograms, or you might also hear people talk about false blank outputs, is often commonly applied pre-processing step before audio is pass into in the running algorithm. And the human ear does a computation pretty similar to this pre-processing step. So, one of the most exciting trends in speech recognition is that, once upon a time, speech recognition systems used to be built using phonemes and this where, I want to say hand-engineered basic units of cells. So, the quick brown fox represented as phonemes. I’m going to simplify a bit, let say, “The” has a “de” and “e” sound and Quick, has a “ku” and “wu”, “ik”, “k” sound, and linguist used to write off these basic units of sound, and try the Greek language down to these basic units of sound. So, brown, this aren’t the official phonemes which are written with more complicated notation, but linguists use to hypothesize that writing down audio in terms of these basic units of sound called phonemes would be the best way to do speech recognition. But with end-to-end deep learning, we’re finding that phonemes representations are no longer necessary. But instead, you can built systems that input an audio clip and directly output a transcript without needing to use hand-engineered representations like these. One of the things that made this possible was going to much larger data sets. So, academic data sets on speech recognition might be as a 300 hours, and in academia, 3000 hour data sets of transcribed audio would be considered reasonable size, so lot of research has been done, a lot of research papers that are written on data sets there are several thousand voice. But, the best commercial systems are now trains on over 10,000 hours and sometimes over a 100,000 hours of audio. And, it’s really moving to a much larger audio data sets, transcribe audio data sets were both x and y, together with deep learning algorithm, that has driven a lot of progress is speech recognition. So, how do you build a speech recognition system?

In the last video, we’re talking about the attention model. So, one thing you could do is actually do that, where on the horizontal axis, you take in different time frames of the audio input, and then you have an attention model try to output the transcript like, “the quick brown fox”, or what it was said.

One other method that seems to work well is to use the CTC cost for speech recognition. CTC stands for Connection is Temporal Classification and is due to Alex Graves, Santiago Fernandes, Faustino Gomez, and Jürgen Schmidhuber. So, here’s the idea. Let’s say the audio clip was someone saying, “the quick brown fox”. We’re going to use a new network structured like this with an equal number of input x’s and output y’s, and I have drawn a simple of what uni-directional for the RNN for this, but in practice, this will usually be a bidirectional LSTM and bidirectional GRU and usually, a deeper model. But notice that the number of time steps here is very large and in speech recognition, usually the number of input time steps is much bigger than the number of output time steps. So, for example, if you have 10 seconds of audio and your features come at a 100 hertz so 100 samples per second, then a 10 second audio clip would end up with a thousand inputs. Right, so it’s 100 hertz times 10 seconds, and so with a thousand inputs. But your output might not have a thousand alphabets, might not have a thousand characters. So, what do you do? The CTC cost function allows the RNN to generate an output like this ttt, there’s a special character called the blank character, which we’re going to write as an underscore here, h_eee___, and then maybe a space, we’re going to write like this, so that a space and then ___ qqq__. And, this is considered a correct output for the first parts of the space, quick with the Q, and the basic rule for the CTC cost function is to collapse repeated characters not separated by “blank”. So, to be clear, I’m using this underscore to denote a special blank character and that’s different than the space character. So, there is a space here between the and quick, so I should output a space. But, by collapsing repeated characters, not separated by blank, it actually collapse the sequence into t, h, e, and then space, and q, and this allows your network to have a thousand outputs by repeating characters allow the times. So, inserting a bunch of blank characters and still ends up with a much shorter output text transcript. So, this phrase here “the quick brown fox” including spaces actually has 19 characters, and if somehow, the newer network is forced upwards of a thousand characters by allowing the network to insert blanks and repeated characters and can still represent this 19 character upwards with this 1000 outputs of values of Y. So, this paper by Alex Grace, as well as by those deep speech recognition system, which I was involved in, used this idea to build effective Speech recognition systems.

So, I hope that gives you a rough sense of how speech recognition models work. Attention like models work and CTC models work and present two different options of how to go about building these systems. Now, today, building effective where production skills speech recognition system is a pretty significant effort and requires a very large data set. But, what I like to do in the next video is share you, how you can build a trigger word detection system, where keyword detection system which is actually much easier and can be done with even a smaller or more reasonable amount of data. So, let’s talk about that in the next video.

02_trigger-word-detection

you’ve now learned so much about deep learning and sequence models that we can actually describe a trigger word system quite simply just on one slide as you see in this video but when the rise of speech recognition have been more and more devices you can wake up with your voice and those are sometimes called trigger word detection systems so let’s see how you can build a trigger word system.

Examples of triggering systems include Amazon echo which is broken out with that word Alexa. The Baidu DuerOs part devices woken up with face xiaodunihao. Apple Siri working out with hey Siri and Google home woken up with Ok Google. So stands the trigger word detection that if you have say an Amazon echo in your living room, you can walk the living room and just say: “Alexa what time is it” and have it wake up. It’ll be triggered by the words of Alexa and answer your voice query. So if you can build a trigger word detection system maybe you can make your computer do something by telling it computer activate. One of my friends also works on turning on an offer particular lamp using a trigger word kind of as a fun project but what I want to show you is how you can build a trigger word detection system.

Now the trigger word detection literature is still evolving so there actually isn’t a single universally agreed on algorithm for trigger word detection yet the literature on trigger word detection algorithm is still evolving so there isn’t wide consensus yet on what’s the best algorithm for trigger word detection so I’m just going to show you one example of an algorithm you can use. now you’ve seen our ends like this and what we really do is take an audio clip maybe compute spectrogram features and that generates features $x^{<1>} x^{<2>} x^{<3>}$ or audio features $x^{<1>} x^{<2>} x^{<3>}$ that you pass through an RNN and so all that remains to be done is to define the target labels Y so if this point in the audio clip is when someone just finished saying the trigger word such as “Alexa”, “nihaobaidu” or “hey Siri” or “Okay Google” then in the training sets you can set the target labels to be zero for everything before that point and right after that to set the target label of one and then if a little bit later on you know the trigger word was set again and the trigger word said at this point then you can again set the target label to be one right after that now this type of labeling scheme for an RNN you know could work actually this won’t actually work reasonably well. One slight disadvantage of this is it creates a very imbalanced training set so if a lot more zeros than ones.

So one other thing you could do that it’s getting a little bit of a hack but could make them all the little bit easy to train is instead of setting only a single time step to output one you can actually make an output a few ones for several times or for a fixed period of time before reverting back to zero so and that slightly evens out the ratio of ones to zeros but this is a little bit of a hack. But if this is when in the audio clipper trigger where the set then right after that you can set the target label to one and if this is the trigger words said again, then right after that just when you want the RNN to output one so you get to play more of this as well in the programming exercise but so I think you should feel quite proud of yourself we’ve learned enough about the learning that it just takes one picture at one slide to this to describe something as complicated as trigger word detection and based on this I hope you’d be able to implement something that works and allows you to detect trigger words but you see more of this in the program exercise.

So that’s it for trigger words and I hope you feel quite proud of yourself for how much you’ve learned about deep learning that you can now describe trigger words in just one slide in a few minutes and that you’ve been hopeful II implemented and get it to work maybe even make it do something fun in your house that I’m like turn on or turn off um you could do something like a computer when you’re when someone else says they trigger words on this is the last technical video of this course and to wrap up in this course on sequence models you learned about rnns including both gr use and LS TMS and then in the second week you learned a lot about word embeddings and how they learn representations of words and then in this week you learned about the attention model as well as how to use it to process audio data and I hope you have fun implementing all of these ideas in this beast program sighs let’s go on to the last video.

conclusion-and-thank-you

congratulations on making it this far I just wanna wrap up and leave you with a few final thoughts we’ve been on quite a journey together but if you’ve taken the whole specialization then you’ve learned about new networks and deep learning how to improve deep neural networks of the structure machine learning projects convolutional neural networks and then in this most recent course sequence models and I know you work really hard and I also hope you feel very proud of yourself for your hard work and for how much you’ve done.

so I want to leave you one maybe important thought which is that I think deep learning is a superpower with deep learning algorithms you can make a computer see you can have a computer synthesize novel art or synthesized music or you can have a computer translate from one language to another maybe have it locally radiology image and render a medical diagnosis or build pieces of a car that can drive itself and if that isn’t a superpower I don’t know what is and as we wrap up this sequence of courses as we wrap up this specialization I hope that you will find ways to use these ideas to further your career to pursue your dreams but perhaps most important to do whatever you think is the best work you can do our humanity the world today has challenges but with the power of a on power of deep learning I think we can make it a much better place and now that you have this superpower I hope you will use it to go out there and make life better for yourself but also for other people and of course I also hope you feel very proud of your accomplishments in the power far you’ve come and of all that you’ve learned and when you complete this sequence of causes you should also share it on social media like Twitter or Facebook and let your friends know.

and finally the very last thing I want to say to you is congratulations on Nikolas I hope you feel great about your accomplishments but also I want to thank you very much I know that you have a busy life but despite that spends a lot of time watching these videos and maybe spent a long time also working on the quizzes and the programming exercises I hope you enjoyed it and you got a lot out of the process but I’m also very grateful for all your time you spend and for all your hard work you put into learning these materials so thank you very much.