01_photo-ocr
01_problem-description-and-pipeline
In this and the next few videos, I want to tell you about a machine learning application example, or a machine learning application history centered around an application called Photo OCR . There are three reasons why I want to do this, first I wanted to show you an example of how a complex machine learning system can be put together. Second, once told the concepts of a machine learning a type line and how to allocate resources when you’re trying to decide what to do next. And this can either be in the context of you working by yourself on the big application Or it can be the context of a team of developers trying to build a complex application together. And then finally, the Photo OCR problem also gives me an excuse to tell you about just a couple more interesting ideas for machine learning. One is some ideas of how to apply machine learning to computer vision problems, and second is the idea of artificial data synthesis, which we’ll see in a couple of videos. So, let’s start by talking about what is the Photo OCR problem.
Photo OCR stands for Photo Optical Character Recognition. With the growth of digital photography and more recently the growth of camera in our cell phones we now have tons of visual pictures that we take all over the place. And one of the things that has interested many developers is how to get our computers to understand the content of these pictures a little bit better. The photo OCR problem focuses on how to get computers to read the text to the purest in images that we take. Given an image like this it might be nice if a computer can read the text in this image so that if you’re trying to look for this picture again you type in the words, lulu bees and and have it automatically pull up this picture, so that you’re not spending lots of time digging through your photo collection Maybe hundreds of thousands of pictures in.
The Photo OCR problem does exactly this, and it does so in several steps. First, given the picture it has to look through the image and detect where there is text in the picture. And after it has done that or if it successfully does that it then has to look at these text regions and actually read the text in those regions, and hopefully if it reads it correctly, it’ll come up with these transcriptions of what is the text that appears in the image. Whereas OCR, or optical character recognition of scanned documents is relatively easier problem, doing OCR from photographs today is still a very difficult machine learning problem, and you can do this. Not only can this help our computers to understand the content of our though images better, there are also applications like helping blind people, for example, if you could provide to a blind person a camera that can look at what’s in front of them, and just tell them the words that my be on the street sign in front of them. With car navigation systems. For example, imagine if your car could read the street signs and help you navigate to your destination. In order to perform photo OCR, here’s what we can do. First we can go through the image and find the regions where there’s text and image. So, shown here is one example of text and image that the photo OCR system may find.
Second, given the rectangle around that text region, we can then do character segmentation, where we might take this text box that says “Antique Mall” and try to segment it out into the locations of the individual characters.
And finally, having segmented out into individual characters, we can then run a crossfire, which looks at the images of the visual characters, and tries to figure out the first character’s an A, the second character’s an N, the third character is a T, and so on, so that up by doing all this how that hopefully you can then figure out that this phrase is Rulegee’s antique mall and similarly for some of the other words that appear in that image. I should say that there are some photo OCR systems that do even more complex things, like a bit of spelling correction at the end. So if, for example, your character segmentation and character classification system tells you that it sees the word c 1 e a n i n g. Then, you know, a sort of spelling correction system might tell you that this is probably the word ‘cleaning’, and your character classification algorithm had just mistaken the l for a 1. But for the purpose of what we want to do in this video, let’s ignore this last step and just focus on the system that does these three steps of text detection, character segmentation, and character classification.
A system like this is what we call a machine learning pipeline. In particular, here’s a picture showing the photo OCR pipeline. We have an image, which then fed to the text detection system text regions, we then segment out the characters–the individual characters in the text–and then finally we recognize the individual characters. In many complex machine learning systems, these sorts of pipelines are common, where you can have multiple modules–in this example, the text detection, character segmentation, character recognition modules–each of which may be machine learning component, or sometimes it may not be a machine learning component but to have a set of modules that act one after another on some piece of data in order to produce the output you want, which in the photo OCR example is to find the transcription of the text that appeared in the image. If you’re designing a machine learning system one of the most important decisions will often be what exactly is the pipeline that you want to put together. In other words, given the photo OCR problem, how do you break this problem down into a sequence of different modules. And you design the pipeline and each the performance of each of the modules in your pipeline. will often have a big impact on the final performance of your algorithm. If you have a team of engineers working on a problem like this is also very common to have different individuals work on different modules. So I could easily imagine tech easily being the of anywhere from 1 to 5 engineers, character segmentation maybe another 1-5 engineers, and character recognition being another 1-5 engineers, and so having a pipeline like often offers a natural way to divide up the workload amongst different members of an engineering team, as well. Although, or course, all of this work could also be done by just one person if that’s how you want to do it. In complex machine learning systems the idea of a pipeline, of a machine of a pipeline, is pretty pervasive. And what you just saw is a specific example of how a Photo OCR pipeline might work.
In the next few videos I’ll tell you a little bit more about this pipeline, and we’ll continue to use this as an example to illustrate–I think–a few more key concepts of machine learning.
02_sliding-windows
In the previous video, we talked about the photo OCR pipeline and how that worked. In which we would take an image and pass the Through a sequence of machine learning components in order to try to read the text that appears in an image. In this video I like to. A little bit more about how the individual components of the pipeline works. In particular most of this video will center around the discussion. of whats called a sliding windows.
The first stage of the filter was the Text detection where we look at an image like this and try to find the regions of text that appear in this image. Text detection is an unusual problem in computer vision. Because depending on the length of the text you’re trying to find, these rectangles that you’re trying to find can have different aspect. So in order to talk about detecting things in images let’s start with a simpler example of pedestrian detection and we’ll then later go back to. Ideas that were developed in pedestrian detection and apply them to text detection. So in pedestrian detection you want to take an image that looks like this and the whole idea is the individual pedestrians that appear in the image. So there’s one pedestrian that we found, there’s a second one, a third one a fourth one, a fifth one. And a one. This problem is maybe slightly simpler than text detection just for the reason that the aspect ratio of most pedestrians are pretty similar. Just using a fixed aspect ratio for these rectangles that we’re trying to find. So by aspect ratio I mean the ratio between the height and the width of these rectangles. They’re all the same. for different pedestrians but for text detection the height and width ratio is different for different lines of text Although for pedestrian detection, the pedestrians can be different distances away from the camera and so the height of these rectangles can be different depending on how far away they are. but the aspect ratio is the same.
In order to build a pedestrian detection system here’s how you can go about it. Let’s say that we decide to standardize on this aspect ratio of 82 by 36 and we could have chosen some rounded number like 80 by 40 or something, but 82 by 36 seems alright. What we would do is then go out and collect large training sets of positive and negative examples. Here are examples of 82 X 36 image patches that do contain pedestrians and here are examples of images that do not. On this slide I show 12 positive examples of y1 and 12 examples of y0. In a more typical pedestrian detection application, we may have anywhere from a 1,000 training examples up to maybe 10,000 training examples, or even more if you can get even larger training sets. And what you can do, is then train in your network or some other learning algorithm to take this input, an MS patch of dimension 82 by 36, and to classify ‘y’ and to classify that image patch as either containing a pedestrian or not. So this gives you a way of applying supervised learning in order to take an image patch can determine whether or not a pedestrian appears in that image capture.
Now, lets say we get a new image, a test set image like this and we want to try to find a pedestrian’s picture image. What we would do is start by taking a rectangular patch of this image. Like that shown up here, so that’s maybe a 82 X 36 patch of this image, and run that image patch through our classifier to determine whether or not there is a pedestrian in that image patch, and hopefully our classifier will return y equals 0 for that patch, since there is no pedestrian.
Next, we then take that green rectangle and we slide it over a bit and then run that new image patch through our classifier to decide if there’s a pedestrian there.
And having done that, we then slide the window further to the right and run that patch through the classifier again. The amount by which you shift the rectangle over each time is a parameter, that’s sometimes called the step size of the parameter, sometimes also called the slide parameter, and if you step this one pixel at a time. So you can use the step size or stride of 1, that usually performs best, that is more cost effective, and so using a step size of maybe 4 pixels at a time, or eight pixels at a time or some large number of pixels might be more common, since you’re then moving the rectangle a little bit more each time.
So, using this process, you continue stepping the rectangle over to the right a bit at a time and running each of these patches through a classifier, until eventually, as you slide this window over the different locations in the image, first starting with the first row and then we go further rows in the image, you would then run all of these different image patches at some step size or some stride through your classifier. Now, that was a pretty small rectangle, that would only detect pedestrians of one specific size.
What we do next is start to look at larger image patches. So now let’s take larger images patches, like those shown here and run those through the classifier as well.
And by the way when I say take a larger image patch, what I really mean is when you take an image patch like this, what you’re really doing is taking that image patch, and resizing it down to 82 X 36, say. So you take this larger patch and re-size it to be smaller image and then it would be the smaller size image that is what you would pass through your classifier to try and decide if there is a pedestrian in that patch.
And finally you can do this at an even larger scales and run that side of Windows to the end.
And after this whole process hopefully your algorithm will detect whether theres pedestrian appears in the image, so thats how you train a the classifier, and then use a sliding windows classifier, or use a sliding windows detector in order to find pedestrians in the image.
Let’s now return to the text detection example and talk about that stage in our photo OCR pipeline, where our goal is to find the text regions in unit.
similar to pedestrian detection you can come up with a label training set with positive examples and negative examples with examples corresponding to regions where text appears. So instead of trying to detect pedestrians, we’re now trying to detect texts. And so positive examples are going to be patches of images where there is text. And negative examples is going to be patches of images where there isn’t text. Having trained this we can now apply it to a new image, into a test set image.
So here’s the image that we’ve been using as example.
Now, last time we run, for this example we are going to run a sliding windows at just one fixed scale just for purpose of illustration, meaning that I’m going to use just one rectangle size. But lets say I run my little sliding windows classifier on lots of little image patches like this if I do that, what Ill end up with is a result like this where the white region show where my text detection system has found text and so the axis’ of these two figures are the same. So there is a region up here, of course also a region up here, so the fact that this black up here represents that the classifier does not think it’s found any texts up there, whereas the fact that there’s a lot of white stuff here, that reflects that classifier thinks that it’s found a bunch of texts. over there on the image. What i have done on this image on the lower left is actually use white to show where the classifier thinks it has found text. And different shades of grey correspond to the probability that was output by the classifier, so like the shades of grey corresponds to where it thinks it might have found text but has lower confidence the bright white response to whether the classifier, up with a very high probability, estimated probability of there being pedestrians in that location.
We aren’t quite done yet because what we actually want to do is draw rectangles around all the region where this text in the image, so were going to take one more step which is we take the output of the classifier and apply to it what is called an expansion operator. So what that does is, it take the image here, and it takes each of the white blobs, it takes each of the white regions and it expands that white region.
Mathematically, the way you implement that is, if you look at the image on the right, what we’re doing to create the image on the right is, for every pixel we are going to ask, is it withing some distance of a white pixel in the left image. And so, if a specific pixel is within, say, five pixels or ten pixels of a white pixel in the leftmost image, then we’ll also color that pixel white in the rightmost image. And so, the effect of this is, we’ll take each of the white blobs in the leftmost image and expand them a bit, grow them a little bit, by seeing whether the nearby pixels, the white pixels, and then coloring those nearby pixels in white as well.
Finally, we are just about done. We can now look at this rightmost image and just look at the connecting components and look at the contiguous white regions and draw bounding boxes around them. And in particular, if we look at all the white regions, like this one, this one, this one, and so on, and if we use a simple heuristic to rule out rectangles whose aspect ratios look funny because we know that boxes around text should be much wider than they are tall. And so if we ignore the thin, tall blocks like this one and this one, and we discard these ones because they are too tall and thin, and we then draw a the rectangles around the ones whose aspect ratio thats a height to what ratio looks like for text regions, then we can draw rectangles, the bounding boxes around this text region, this text region, and that text region, corresponding to the Lula B’s antique mall logo, the LULA B’s, and this little open sign. Of over there.
This example by the actually misses one piece of text. This is very hard to read, but there is actually one piece of text there. That says [xx] are corresponding to this but the aspect ratio looks wrong so we discarded that one. So you know it’s ok on this image, but in this particular example the classifier actually missed one piece of text. It’s very hard to read because there’s a piece of text written against a transparent window. So that’s text detection using sliding windows. And having found these rectangles with the text in it, we can now just cut out these image regions and then use later stages of pipeline to try to meet the texts.
Now, you recall that the second stage of pipeline was character segmentation, so given an image like that shown on top, how do we segment out the individual characters in this image? So what we can do is again use a supervised learning algorithm with some set of positive and some set of negative examples, what were going to do is look in the image patch and try to decide if there is split between two characters right in the middle of that image match. So for initial positive examples. This first cross example, this image patch looks like the middle of it is indeed the middle has splits between two characters and the second example again this looks like a positive example, because if I split two characters by putting a line right down the middle, that’s the right thing to do. So, these are positive examples, where the middle of the image represents a gap or a split between two distinct characters, whereas the negative examples, well, you know, you don’t want to split two characters right in the middle, and so these are negative examples because they don’t represent the midpoint between two characters. So what we will do is, we will train a classifier, maybe using new network, maybe using a different learning algorithm, to try to classify between the positive and negative examples. Having trained such a classifier, we can then run this on this sort of text that our text detection system has pulled out.
As we start by looking at that rectangle, and we ask, “Gee, does it look like the middle of that green rectangle, does it look like the midpoint between two characters?”. And hopefully, the classifier will say no, then we slide the window over
and this is a one dimensional sliding window classifier, because were going to slide the window only in one straight line from left to right, theres no different rows here. There’s only one row here. But now, with the classifier in this position, we ask, well, should we split those two characters or should we put a split right down the middle of this rectangle. And hopefully, the classifier will output y equals one, in which case we will decide to draw a line down there, to try to split two characters.
Then we slide the window over again, optic process, don’t close the gap, slide over again, optic says yes, do split there and so on, and we slowly slide the classifier over to the right and hopefully it will classify this as another positive example and so on.
And we will slide this window over to the right, running the classifier at every step, and hopefully it will tell us, you know, what are the right locations to split these characters up into, just split this image up into individual characters.
And so thats 1D sliding windows for character segmentation.
So, here’s the overall photo OCR pipe line again. In this video we’ve talked about the text detection step, where we use sliding windows to detect text. And we also use a one-dimensional sliding windows to do character segmentation to segment out, you know, this text image in division of characters. The final step through the pipeline is the character qualification step and that step you might already be much more familiar with the early videos on supervised learning where you can apply a standard supervised learning within maybe on your network or maybe something else in order to take it’s input, an image like that and classify which alphabet or which 26 characters A to Z, or maybe we should have 36 characters if you have the numerical digits as well, the multi class classification problem where you take it’s input and image contained a character and decide what is the character that appears in that image?
So that was the photo OCR pipeline and how you can use ideas like sliding windows classifiers in order to put these different components to develop a photo OCR system.
In the next few videos we keep on using the problem of photo OCR to explore somewhat interesting issues surrounding building an application like this.
03_getting-lots-of-data-and-artificial-data
I’ve seen over and over that one of the most reliable ways to get a high performance machine learning system is to take a low bias learning algorithm and to train it on a massive training set. But where did you get so much training data from? Turns out that the machine learnings there’s a fascinating idea called artificial data synthesis, this doesn’t apply to every single problem, and to apply to a specific problem, often takes some thought and innovation and insight. But if this idea applies to your machine, only problem, it can sometimes be a an easy way to get a huge training set to give to your learning algorithm.
The idea of artificial data synthesis comprises of two variations, main the first is if we are essentially creating data from [xx], creating new data from scratch. And the second is if we already have it’s small label training set and we somehow have amplify that training set or use a small training set to turn that into a larger training set and in this video we’ll go over both those ideas.
artificial data synthesis for Photo OCR
To talk about the artificial data synthesis idea, let’s use the character portion of the photo OCR pipeline, we want to take it’s input image and recognize what character it is.
If we go out and collect a large label data set, here’s what it is and what it look like. For this particular example, I’ve chosen a square aspect ratio. So we’re taking square image patches. And the goal is to take an image patch and recognize the character in the middle of that image patch. And for the sake of simplicity, I’m going to treat these images as grey scale images, rather than color images. It turns out that using color doesn’t seem to help that much for this particular problem. So given this image patch, we’d like to recognize that that’s a T. Given this image patch, we’d like to recognize that it’s an ‘S’. Given that image patch we would like to recognize that as an ‘I’ and so on.
So all of these, our examples of row images, how can we come up with a much larger training set? Modern computers often have a huge font library and if you use a word processing software, depending on what word processor you use, you might have all of these fonts and many, many more Already stored inside. And, in fact, if you go different websites, there are, again, huge, free font libraries on the internet we can download many, many different types of fonts, hundreds or perhaps thousands of different fonts.
So if you want more training examples, one thing you can do is just take characters from different fonts and paste these characters against different random backgrounds. So you might take this —- and paste that c against a random background. If you do that you now have a training example of an image of the character C. So after some amount of work, you know this, and it is a little bit of work to synthisize realistic looking data. But after some amount of work, you can get a synthetic training set like that.
Every image shown on the right was actually a synthesized image. Where you take a font, maybe a random font downloaded off the web and you paste an image of one character or a few characters from that font against this other random background image. And then apply maybe a little blurring operators —–of app finder, distortions that app finder, meaning just the sharing and scaling and little rotation operations and if you do that you get a synthetic training set, on what the one shown here. And this is work, grade, it is, it takes thought at work, in order to make the synthetic data look realistic, and if you do a sloppy job in terms of how you create the synthetic data then it actually won’t work well. But if you look at the synthetic data looks remarkably similar to the real data. And so by using synthetic data you have essentially an unlimited supply of training examples for artificial training synthesis And so, if you use this source synthetic data, you have essentially unlimited supply of label data to create a improvised learning algorithm for the character recognition problem. So this is an example of artificial data synthesis where youre basically creating new data from scratch, you just generating brand new images from scratch.
The other main approach to artificial data synthesis is where you take a examples that you currently have, that we take a real example, maybe from real image, and you create additional data, so as to amplify your training set. So here is an image of a compared to a from a real image, not a synthesized image, and I have overlayed this with the grid lines just for the purpose of illustration. Actually have these —-. So what you can do is then take this alphabet here, take this image and introduce artificial warping or artificial distortions into the image so they can take the image a and turn that into 16 new examples. So in this way you can take a small label training set and amplify your training set to suddenly get a lot more examples, all of it.
Again, in order to do this for application, it does take thought and it does take insight to figure out what our reasonable sets of distortions, or whether these are ways that amplify and multiply your training set, and for the specific example of character recognition, introducing these warping seems like a natural choice, but for a different learning machine application, there may be different the distortions that might make more sense.
Synthesizing data by introducing distortions: Speech recognition
Let me just show one example from the totally different domain of speech recognition. So the speech recognition, let’s say you have audio clips and you want to learn from the audio clip to recognize what were the words spoken in that clip. So let’s see how one labeled training example. So let’s say you have one labeled training example, of someone saying a few specific words. So let’s play that audio clip here.
Original Audio
0 -1-2-3-4-5. Alright, so someone counting from 0 to 5, and so you want to try to apply a learning algorithm to try to recognize the words said in that. So, how can we amplify the data set? Well, one thing we do is introduce additional audio distortions into the data set. So here I’m going to add background sounds to simulate a bad cell phone connection. When you hear beeping sounds, that’s actually part of the audio track, that’s nothing wrong with the speakers, I’m going to play this now.
0-1-2-3-4-5. Right, so you can listen to that sort of audio clip and recognize the sounds, that seems like another useful training example to have, here’s another example, noisy background.
Zero, one, two, three four five you know of cars driving past, people walking in the background, here’s another one, so taking the original clean audio clip so taking the clean audio of someone saying 0 1 2 3 4 5 we can then automatically synthesize these additional training examples and thus amplify one training example into maybe four different training examples. So let me play this final example, as well.
0-1 3-4-5 So by taking just one labelled example, we have to go through the effort to collect just one labelled example of the 012 to 5, and by synthesizing additional distortions, by introducing different background sounds, we’ve now multiplied this one example into many more examples. Much work by just automatically adding these different background sounds to the clean audio.
Just one word of warning about synthesizing data by introducing distortions: if you try to do this yourself, the distortions you introduce should be representative the source of noises, or distortions, that you might see in the test set.
So, for the character recognition example, you know, the working things begin introduced are actually kind of reasonable, because an image A that looks like that, that’s, could be an image that we could actually see in a test set. Reflect a fact And, you know, that image on the upper-right, that could be an image that we could imagine seeing. And for audio, well, we do wanna recognize speech, even against a bad self internal connection, against different types of background noise, and so for the audio, we’re again synthesizing examples are actually representative of the sorts of examples that we want to classify, that we want to recognize correctly. In contrast, usually it does not help perhaps you actually a meaning as noise to your data. I’m not sure you can see this, but what we’ve done here is taken the image, and for each pixel, in each of these 4 images, has just added some random Gaussian noise to each pixel. To each pixel, is the pixel brightness, it would just add some, you know, maybe Gaussian random noise to each pixel. So it’s just a totally meaningless noise, right? And so, unless you’re expecting to see these sorts of pixel wise noise in your test set, this sort of purely random meaningless noise is less likely to be useful. But the process of artificial data synthesis it is you know a little bit of an art as well and sometimes you just have to try it and see if it works. But if you’re trying to decide what sorts of distortions to add, you know, do think about what other meaningful distortions you might add that will cause you to generate additional training examples that are at least somewhat representative of the sorts of images you expect to see in your test sets.
Discussion on getting more data
Finally, to wrap up this video, I just wanna say a couple of words, more about this idea of getting lot’s of data via artificial data synthesis. As always, before expending a lot of effort, you know, figuring out how to create artificial training examples, it’s often a good practice is to make sure that you really have a low biased classifier and having a lot more training data will be of help. And standard way to do this is to plot the learning curves, and make sure that you only have a low bias as well, high variance classifier. Or if you don’t have a low bias classifier, you know, one other thing that’s worth trying is to keep increasing the number of features that your classifier has, increasing the number of hidden units in your network, saying, until you actually have a low bias falsifier, and only then, should you put the effort into creating a large, artificial training set, so what you really want to avoid is to, you know, spend a whole week or spend a few months figuring out how to get a great artificially synthesized data set. Only to realize afterward, that, you know, your learning algorithm, performance doesn’t improve that much, even when you’re given a huge training set. So that’s about my usual advice about of a testing that you really can make use of a large training set before spending a lot of effort going out to get that large training set.
Second is, when i’m working on machine learning problems, one question I often ask the team I’m working with, often ask my students, which is, how much work would it be to get 10 times as much date as we currently had. When I face a new machine learning application very often I will sit down with a team and ask exactly this question, I’ve asked this question over and over and over and I’ve been very surprised how often this answer has been that. You know, it’s really not that hard, maybe a few days of work at most, to get ten times as much data as we currently have for a machine learning application and very often if you can get ten times as much data there will be a way to make your algorithm do much better. So, you know, if you ever join the product team working on some machine learning application product this is a very good questions ask yourself ask the team don’t be too surprised if after a few minutes of brainstorming if your team comes up with a way to get literally ten times this much data, in which case, I think you would be a hero to that team, because with 10 times as much data, I think you’ll really get much better performance, just from learning from so much data. So there are several waysand that comprised both the ideas of generating data from scratch using random fonts and so on. As well as the second idea of taking an existing example and and introducing distortions that amplify to enlarge the training set A couple of other examples of ways to get a lot more data are to collect the data or to label them yourself. So one useful calculation that I often do is, you know, how many minutes, how many hours does it take to get a certain number of examples, so actually sit down and figure out, you know, suppose it takes me ten seconds to label one example then and, suppose that, for our application, currently we have 1000 labeled examples examples so ten times as much of that would be if n were equal to ten thousand. A second way to get a lot of data is to just collect the data and you label it yourself. So what I mean by this is I will often set down and do a calculation to figure out how much time, you know just like how many hours will it take, how many hours or how many days will it take for me or for someone else to just sit down and collect ten times as much data, as we have currently, by collecting the data ourselves and labeling them ourselves. So, for example, that, for our machine learning application, currently we have 1,000 examples, so M 1,000. That what we do is sit down and ask, how long does it take me really to collect and label one example. And sometimes maybe it will take you, you know ten seconds to label one new example, and so if I want 10 X as many examples, I’d do a calculation. If it takes me 10 seconds to get one training example. If I wanted to get 10 times as much data, then I need 10,000 examples. So I do the calculation, how long is it gonna take to label, to manually label 10,000 examples, if it takes me 10 seconds to label 1 example. So when you do this calculation, often I’ve seen many you would be surprised, you know, how little, or sometimes a few days at work, sometimes a small number of days of work, well I’ve seen many teams be very surprised that sometimes how little work it could be, to just get a lot more data, and let that be a way to give your learning app to give you a huge boost in performance, and necessarily, you know, sometimes when you’ve just managed to do this, you will be a hero and whatever product development, whatever team you’re working on, because this can be a great way to get much better performance.
Third and finally, one sometimes good way to get a lot of data is to use what’s now called crowd sourcing. So today, there are a few websites or a few services that allow you to hire people on the web to, you know, fairly inexpensively label large training sets for you. So this idea of crowd sourcing, or crowd sourced data labeling, is something that has, is obviously, like an entire academic literature, has some of it’s own complications and so on, pertaining to labeler reliability. Maybe, you know, hundreds of thousands of labelers, around the world, working fairly inexpensively to help label data for you, and that I’ve just had mentioned, there’s this one alternative as well. And probably Amazon Mechanical Turk systems is probably the most popular crowd sourcing option right now. This is often quite a bit of work to get to work, if you want to get very high quality labels, but is sometimes an option worth considering as well. If you want to try to hire many people, fairly inexpensively on the web, our labels launch miles of data for you.
So this video, we talked about the idea of artificial data synthesis of either creating new data from scratch, looking, using the ramming funds as an example, or by amplifying an existing training set, by taking existing label examples and introducing distortions to it, to sort of create extra label examples. And finally, one thing that I hope you remember from this video this idea of if you are facing a machine learning problem, it is often worth doing two things. One just a sanity check, with learning curves, that having more data would help. And second, assuming that that’s the case, I will often seat down and ask yourself seriously: what would it take to get ten times as much creative data as you currently have, and not always, but sometimes, you may be surprised by how easy that turns out to be, maybe a few days, a few weeks at work, and that can be a great way to give your learning algorithm a huge boost in performance
04_ceiling-analysis-what-part-of-the-pipeline-to-work-on-next
In earlier videos, I’ve said over and over that, when you’re developing a machine learning system, one of the most valuable resources is your time as the developer, in terms of picking what to work on next. Or, if you have a team of developers or a team of engineers working together on a machine learning system. Again, one of the most valuable resources is the time of the engineers or the developers working on the system. And what you really want to avoid is that you or your colleagues your friends spend a lot of time working on some component. Only to realize after weeks or months of time spent, that all that worked just doesn’t make a huge difference on the performance of the final system. In this video what I’d like to do is something called ceiling analysis. When you’re the team working on the pipeline machine on your system, this can sometimes give you a very strong signal, a very strong guidance on what parts of the pipeline might be the best use of your time to work on.
To talk about ceiling analysis I’m going to keep on using the example of the photo OCR pipeline. And see right here each of these boxes, text detection, character segmentation, character recognition, each of these boxes can have even a small engineering team working on it. Or maybe the entire system is just built by you, either way.
But the question is where should you allocate resources? Which of these boxes is most worth your effort of trying to improve the performance of. In order to explain the idea of ceiling analysis, I’m going to keep using the example of our photo OCR pipeline. As I mentioned earlier, each of these boxes here, each of these machines and components could be the work of a small team of engineers, or the whole system could be built by just one person. But the question is, where should you allocate scarce resources? That is, which of these components, which one or two or maybe all three of these components is most worth your time, to try to improve the performance of. So here’s the idea of ceiling analysis. As in the development process for other machine learning systems as well, in order to make decisions on what to do for developing the system is going to be very helpful to have a single rolled number evaluation metric for this learning system. So let’s say we pick character level accuracy. So if you’re given a test set image, what is the fraction of alphabets or characters in a test image that we recognize correctly? Or you can pick some other single road number evaluation that you could, if you want. But let’s say for whatever evaluation measure we pick, we find that the overall system currently has 72% accuracy. So in other words, we have some set of test set images. And from each test set images, we run it through text detection, then character segmentation, then character recognition. And we find that on our test set the overall accuracy of the entire system was 72% on whatever metric you chose.
Now here’s the idea behind ceiling analysis, which is that we’re going to go through, let’s say the first module of our machinery pipeline, say text detection. And what we’re going to do, is we’re going to monkey around with the test set. We’re gonna go to the test set. For every test example, which is going to provide it the correct text detection outputs, so in other words, we’re going to go to the test set and just manually tell the algorithm where the text is in each of the test examples. So in other words gonna simulate what happens if you have a text detection system with a hundred percent accuracy, for the purpose of detecting text in an image. And really the way you do that’s pretty simple, right? Instead of letting your learning algorhtim detect the text in the images. You wouldn’t say go to the images and just manually label what is the location of the text in my test set image. And you would then let these correct or let these ground truth labels of where is the text be part of your test set. And just use these ground truth labels as what you feed in to the next stage of the pipeline, so the character segmentation pipeline. Okay? So just to say that again. By putting a checkmark over here, what I mean is I’m going to go to my test set and just give it the correct answers. Give it the correct labels for the text detection part of the pipeline. So that as if I have a perfect test detection system on my test set. What we need to do then is run this data through the rest of the pipeline. Through character segmentation and character recognition. And then use the same evaluation metric as before, to measure what was the overall accuracy of the entire system. And with perfect text detection, hopefully the performance will go up. And in this example, it goes up by by 89%.
And then we’re gonna keep going, let’s go to the next stage of the pipeline, so character segmentation. So again, I’m gonna go to my test set, and now I’m going to give it the correct text detection output and give it the correct character segmentation output. So go to the test set and manually label the correct segmentations of the text into individual characters, and see how much that helps. And let’s say it goes up to 90% accuracy for the overall system. Right? So as always the accuracy of the overall system. So is whatever the final output of the character recognition system is. Whatever the final output of the overall pipeline, is going to measure the accuracy of that. And finally I’m going to build a character recognition system and give that correct labels as well, and if I do that too then no surprise I should get 100% accuracy.
Now the nice thing about having done this analysis is, we can now understand what is the upside potential of improving each of these components? So we see that if we get perfect text detection, our performance went up from 72 to 89%. So that’s a 17% performance gain. So this means that if we take our current system we spend a lot of time improving text detection, that means that we could potentially improve our system’s performance by 17%. It seems like it’s well worth our while. Whereas in contrast, when going from text detection when we gave it perfect character segmentation, performance went up only by 1%, so that’s a more sobering message. It means that no matter how much time you spend on character segmentation. Maybe the upside potential is going to be pretty small, and maybe you do not want to have a large team of engineers working on character segmentation. This sort of analysis shows that even when you give it the perfect character segmentation, you performance goes up by only one percent. That really estimates what is the ceiling, or what is an upper bound on how much you can improve the performance of your system and working on one of these components. And finally, going from character, when we get better character recognition with the forms went up by ten percent. So again you can decide is ten percent improvement, how much is worth your while? This tells you that maybe with more effort spent on the last stage of the pipeline, you can improve the performance of the systems as well. Another way of thinking about this, is that by going through these sort of analysis you’re trying to think about what is the upside potential of improving each of these components. Or how much could you possibly gain if one of these components became absolutely perfect? And this really places an upper bound on the performance of that system.
So the idea of ceiling analysis is pretty important, let me just answer this idea again but with a different example but more complex one. Let’s say that you want to do face recognition from images. You want to look at the picture and recognize whether or not the person in this picture is a particular friend of yours, and try to recognize the person Shown in this image. This is a slightly artificial example, this isn’t actually how face recognition is done in practice. But we’re going to set for an example, what a pipeline might look like to give you another example of how a ceiling analysis process might look. So we have a camera image, and let’s say that we design a pipeline as follows, the first thing you wanna do is pre-processing of the image.
So let’s take this image like we have shown on the upper right, and let’s say we want to remove the background. So do pre-processing and the background disappears. Next we want to say detect the face of the person, that’s usually done on the learning So we’ll run a sliding Windows crossfire to draw a box around a person’s face. Having detected the face, it turns out that if you want to recognize people, it turns out that the eyes is a highly useful cue. We actually are, in terms of recognizing your friends the appearance of their eyes is actually one of the most important cues that you use. So lets run another crossfire to detect the eyes of the person. So the segment of the eyes and then since this will give us useful features to recognize the person. And then other parts of the face of physical interest. Maybe segment of the nose, segment of the mouth. And then having found the eyes, the nose, and the mouth, all of these give us useful features to maybe feed into a logistic regression classifier. And there’s a job with a cost priority, they’d give us the overall label, to find the label for who we think is the identity of this person. So this is a kind of complicated pipeline, it’s actually probably more complicated than you should be using if you actually want to recognize people, but there’s an illustrative example that’s useful to think about for ceiling analysis.
So how do you go through ceiling analysis for this pipeline. Well se step through these pieces one at a time. Let’s say your overall system has 85% accuracy. The first thing I do is go to my test set and manually give it the full background segmentation. So manually go to the test set. And use Photoshop or something to just tell it where’s the background and just manually remove the graph background, so this is a ground true background, and see how much the accuracy changes. In this example the accuracy goes up by 0.1%. So this is a strong sign that even if you have perfect background segmentation, the form is, even with perfect background removal the performance or your system isn’t going to go up that much. So it’s maybe not worth a huge effort to work on pre-processing on background removal. Then quickly goes to test set give it the correct face detection images then again step though the eyes nose and mouth segmentation in some order just pick one order. Just give the correct location of the eyes. Correct location in noses, correct location in mouth, and then finally if I just give it the correct overall label I can get 100% accuracy. And so as I go through the system and just give more and more components, the correct labels in the test set, the performance of the overall system goes up and you can look at how much the performance went up on different steps. So from giving it the perfect face detection, it looks like the overall performance of the system went up by 5.9%. So that’s a pretty big jump. It means that maybe it’s worth quite a bit effort on better face detection. Went up 4% there, it went up 1% there. 1% there, and 3% there. So it looks like the components that most work are while are, when I gave it perfect face detection system went up by 5.9 performance when given perfect eyes segmentation went to four percent. And then my final which is cost for well there’s another three percent, gap there maybe. And so this tells maybe whether the components are most worthwhile working on. And by the way I want to tell you a true cautionary story. The reason I put this is in this in preprocessing background removal is because I actually know of a true story where there was a research team that actually literally had to people spend about a year and a half, spend 18 months working on better background removal. But actually I’m obscuring the details for obvious reasons, but there was a computer vision application where there’s a team of two engineers that literally spent about a year and a half working on better background removal, actually worked out really complicated algorithms and ended up publishing one research paper. But after all that work they found that it just did not make huge difference to the overall performance of the actual application they were working on and if only someone were to do ceiling analysis before hand maybe they could have realized. And one of them said to me afterward. If only you’ve did this sort of analysis like this maybe they could have realized before their 18 months of work. That they should have spend their effort focusing on some different component then literally spending 18 months working on background removal.
So to summarize, pipelines are pretty pervasive in complex machine learning applications. And when you’re working on a big machine learning application, your time as developer is so valuable, so just don’t waste your time working on something that ultimately isn’t going to matter. And in this video we’ll talk about this idea of ceiling analysis, which I’ve often found to be a very good tool for identifying the component of a video as you put focus on that component and make a big difference. Will actually have a huge effect on the overall performance of your final system. So over the years working machine learning, I’ve actually learned to not trust my own gut feeling about what components to work on. So very often, I’ve work on machine learning for a long time, but often I look at a machine learning problem, and I may have some gut feeling about oh, let’s jump on that component and just spend all the time on that. But over the years, I’ve come to even trust my own gut feelings and learn not to trust gut feelings that much. And instead, if you have a sort of machine learning problem where it’s possible to structure things and do a ceiling analysis, often there’s a much better and much more reliable way for deciding where to put a focused effort, to really improve the performance of some component. And be kind of reassured that, when you do that, it won’t actually have a huge effect on the final performance of the overall system.