Efficient Forward Architecture Search

Efficient Forward Architecture Search


[MUSIC]>>So Debadeepta is going to tell us about Efficient Forward
Architecture Search, it’s a very fast approach to doing neural architecture
search. Go ahead.>>Thanks, John. So today, I’ll talk about a method for
doing neural architecture search. But before I will talk
about a particular method, let’s see why we care about
searching for architectures. So this was when I was visiting
John in New York in April 2018. This is actually a
picture from DenseNet. Actually, before we even begin, how many of you would
say are very comfortable with at least basics of
training a neural network? Okay. Great. TensorFlow, PyTorch, CNTK, you know one of those. Supervised learning, you have trained a neural network
on some dataset. So often, what happens is that, “Hey, we have to figure out
how many layers we need.” Let’s say this is a
dataset that is new. Let’s say, it’s not the
standard well-known dataset on which there has been a lot of communal
domain knowledge, people have tried
stuff, published stuff. But this is your own proprietary first-party dataset and it’s massive. So what should be the
capacity of your network? What should be the
particular stride land? Or how should you connect these things to each other,
the layers to each other? This is all very frustrating. You are sitting there and
you’re trying things, you have an ARG GPU
budget, I am sure. It’s not unlimited and you want
to try out many of these things, and you want to achieve some kind of performance
that you care about, like, I want good performance. I don’t want to use too much GPU RAM because during production time, that means I have to allocate
too many machines for serving. I also wanted to not have this
much more than this much latency. So there are all these
like multiple objectives that you want to ideally do. But before you get to all of that, in the beginning you
have a new dataset, new paths, you just want to figure out something that works reasonably. Let’s take the supervised learning. We are not going to talk
about reinforcement learning. We will talk reinforcement
learning in the solutions methods, but the tasks will be
supervised learning. Here’s the truth, nobody really knows how to set these things
like including people who are experts in the field and ever invented neural
networks back in the day. I can denote this is
all trial and error. This is very empirical work. This can feel very magic, like we have all these magic
numbers floating around, and not only this, we
have magic numbers floating around in the
training parameters. That is not even in this picture. What’s the learning rate? What’s
the learning rate schedule? How should we warm start? What should we initialize
the weights as? This is the big motivation for trying to dig this guesswork out of humans
into the hands of algorithms, and hence there is this
very nice sub field of machine learning now called
neural architecture search, and we have all these problems we talked about and there’s also
like a massive talent shockers, and I think this room has a biased sampling but in the real
world getting people who have the right skill sets and hiring them and training them if
needed is not easy, right? If you are out there in the field, you probably know this
much better than I do. So we’re going to do first before we dive into our particular method a
15-minute literature overview, and before we even go there
into the literature overview, we’re going to look at what
is the problem definition. So the problem definition is, we have a subspace, and remember this is
not very difficult, this is not very complicated. All I actually want to do is
given a search space which is the set of all realizable
neural network architectures, I want to find an architecture that fix all my criteria like
it has this much accuracy, this much latency, this much things, if possible or come as close to
them as possible and I’m done, and the training procedure
that hopefully can train it, and I’m all good. So as with any search algorithm, this is no different than
any search algorithm. If you’re doing search in any ways, all the same things come into play. There’s some challenges
which is the third column. But the first one is
we need a subspace. If you’re searching
something, you need to define a space to search it. We will see that there are
usually two kinds of space and informal lingo will either
be called macro or micro. We will see examples of that. So right, and a lot of right now, domain engineering goes into what
should be the right subspace, because here’s the trick, if I can somehow come
up with a subspace, which I know will have all the right primitives like it has the conv layers,
it has transformers, it has self attention units
and all of these things, batch normalization and
all the right operations, and I can also inject some priors
into the search space that, hey, I know that if I have a conv unit, after that I should not be doing
a batch normalization this way. I know how layer normalization
works from other empirical papers. I can prune out the search space by just injecting a lot of
human domain knowledge, and this debate right now is how much domain knowledge
should I inject into the subspace and which
one should we work on. So we will look into some details
of that and pros and cons. Now once you have a subspace, the big problem in architecture
search is that your search is very expensive because
here’s what really happens, in order for you to try out
architectures one after the other and find the right one, you literally have to somehow
find an architecture, bring it all the way to the end and imagine you
have a dataset which is ImageNet size or if your language this is training
a bot model or something. This becomes your subroutine. There is some algorithm which is search algorithm which is
searching for architectures and giving you an architecture
somehow and your worker threads have to go and try
them on your dataset and then they will report
back some statistics like, oh, this particular architecture
has this much memory uses, this much latency, this
much accuracy, and so on. You will take this datasets and
update your search algorithm on what to try next or what
other things to try next, and all of these things
actually happen asynchronously in any proper implementation. But all the expensiveness of neural architecture
search comes down to the fact that every time
in order to evaluate, a point function evaluation
is super expensive. So I have this search space
of neural architectures, I pick a particular point, that point is an architecture. I have to go and train it. This training can take on
four GPU box two days. So this is super expensive, and there are uncountably
many of them, depending upon how I
define my subspace. There could be easily, these are discrete things usually, I’m putting aside
continuous hyperparameters, so they’re not really
uncountably many but easily you could have millions
and millions of things to try. So which ones should I try? I don’t want to use 10 million
hours of GPU hours, right? So the search strategy
is all about how efficiently can we search
through the things. Then performance
estimation is more like, a sub-part of search strategy, is when I’m searching, when somehow this looks like a good candidate to try by
using whatever algorithm, I’m going to go down and
then see how well it performs but I don’t want to
actually train it for two days. So let’s say it takes two days
to train an imaginary data size, dataset on an 84 or a
GPU box, how quickly, how approximately can I
evaluate that function, derive a reliable signal of
how well it performs or not and get out without being fully committed to it because
that takes up a lot of time. So usually if you look at neural
architecture search literature, most of the differences are
along these three dimensions. What’s your search space,
what’s the strategy, and how you are doing performance
estimation to do things. So let’s look at, define some of the search space
because this is important. So this is what is called
as a cell search space and what happens in the
cell search space is, let’s see how many of you are
familiar with let’s say ResNet? Okay, good. So if you know ResNet, you know that what happens
in ResNet is that there is this really nice like skip connection and there’s a ResNet cell and they’re regular cell and then
there’s a reduction cell. That’s too, down-sample
the resolution increase the number of
channels, work, so on. But really in ResNet, the design is not that complicated. If I want to create, there is a cell which is basically this little motif structure right here and in ResNet all
the cells at the same, in this one is a little
bit more general, the motives are different and all you’re doing is lining up the cells and connecting
them via skip connections. If you want, let’s say,
I want a bigger thing, add more cells and all the
cells are designed the same. This is very popular in literature
because the nice part is you can inject a lot of human domain knowledge and
reduce the search space. Your search space is
not fully general, it’s you’re, so let’s say, you say that, hey, I’m going to use the ResNet like backbone which is the
skip connections that you have between cells and I’m going to keep the search space constrained to finding the what the motif should be, the cell should be. Then if I want a bigger network, I just add many of
these cells together, stack them, I just stamp many of these cells together and
I get bigger networks. As you can imagine, instead of, you are not searching for what
the outer skeleton should be and this really reduces the
number of things you have to try with the number
of your search space, there is the lot of human
domain knowledge injection. So some smart people came up with ResNet and we are going to utilize this information that
skip connections are good and having these cells is enough and it seams and we are going to figure out what the
cell through design should be, and this is the most
popular search space. Then of course, there’s the other
one which is what you would actually expect if you didn’t
know about the cell search space, which is, well, I have a fully general search
space, anything goes, any particular type of
connections and any size, any wonky weird shape is all on the table and this is obviously
much bigger and a lot of research shies away from using this search space because
even for doing this research, you need a lot of GPUs and I will show you results
later that will show that actually you can use this search space and you
can be competitive with cell search space even if you use a much bigger search pace
and get similar results. This right now, the slides we’ll make
available either through the website or e-mail me. So all the links will be there,
so you don’t need to worry. Some really good people maintain this really nice list of
architecture search papers. To just give you an example
of how hot this field is, let me see, okay, so actually the browser is on. So remember, we are only in October, there’s two month left in the year. So there’s a group in Fribourg who maintains this
list of machine learning papers in neural architecture search related papers which either published or not published and I think the bold ones are published
and this is 2019. I’m scrolling down,
we are still in 2019, a little bit more. We’re still in 2019,
even I’m surprised, I didn’t expect I had to do so many. Okay, 2018, I see ’18, good. Oh, boy, even ’18 was pretty big, hey, ’17, this is good. Seventeen will be much shorter, ’16, and then the list ends. From ’16, 2017, ’16, you go all the way back
to ’94, ’90, ’89, ’88. So there was this big gap for, I forget how to do the math, 25 years or something where we
didn’t do anything on that topic. So this is pretty incredible. I think there are 50-60
papers on archive. I drag waking up in the morning
and looking at archives. So this is really nice survey article I recommend reading because as it happens with any field in
which it is going viral, you need to read all the papers
that came out just this month, you will have to give up your day
job and just sit and read that, you don’t have to,
that’s the good news. Because a lot of it is nice and there are few key ideas that
you really need to know, if you want to know when, and I recommend that survey
paper for more details and within this short thing, I’ll spend five more
minutes to talk about three or four key papers
and then we’ll move on to our paper which will come in or come out in December but it’s on archive. So the thing that actually
started or I should say, restarted the modern era of
interest in an architecture search, and as you saw it, this
is not the first paper, so this is probably like
the 10th or 11th paper. The first paper was 1988, the second paper was 1989 at CMU, but this one really restarted. So four or five years into the deep learning
revolution reawakening, this paper came out, and and
this paper was super-simple. Some of you I think took
the classes on Explore, Exploit, and Basics of
Reinforcement Learning. So if you did, this should look pretty trivial. So I’m going to do this
policy gradient method, and if you don’t know
RL or policy gradient, these words are alien to you, you don’t need to worry about it
at all. Here’s the simple thing. I’m going to take one neural network. In this case, it’s an RNN, which basically to get
a seq to seq model because we are going to encode neural networks as a
sequence of outputs. Like, oh, it’s gone five
followed by a [inaudible] , followed by a max pool,
and so on and so forth. So this RNN is going
to like output me by unrolling the island
and it’s going to output me a network definition. So that’s, I’m going to
call back the controller, and when I get the network
definition from it, I’m going to send it
off to let’s say what got creator machine
somewhere in the Cloud, and I’m going to train it on the data set and let’s
say I have imaginary, at CIFAR or something as the
dataset to just to give an example, it will train, I’ll get some accuracy and some
other statistics back, and I am going to update this. So if really good things
happened for that particular, like your network that was
sampled by this controller, during the update process, I’m going to update
in such a way such that those kinds of networks
would be more likely. There’s a little bit off like
very trivial math that goes into how you will do the update from
this but it’s very simple. Like you’re just going
to make the network, this network which spits
out other networks, definitions make good networks that the results of those training more likely and the
other ones less likely, and we are going to use
essentially like supervised learning updates to do this. To give you a very concrete example, that controller network
takes goes along unrolled, and it gives you this
like number of filters, filter height, filter
weird stride and height. So this is one. This is layer n. Then you keep unrolling
and it’ll give you another layer so on
and so forth until it is used stop and that’s
your network definition. This became quite popular, this is a very well-cited paper. Unfortunately, and I
will show you results comparing all of these later. This was extremely expensive like to do if handle
search on CIFAR-10. I forget the exact number, but it was like 1,800
or 3,000 GPU hours, and this is probably way
more than you want to afford in terms of GPU cost. So the next paper that came out
made this cost go from 1,800 GPU hours to obtain a network which
it does well on CIFAR-10, and imaginary to in one day. So at that time, it achieved, I think it
baked humans at that time. Don’t quote me on that though on CIFAR-10 and as well
as on like Pen Treebank, all those numbers have gone with more accurate because this was
2018 and this is ’19. So this is ensured by
deep learning type. In all the experiments, they used a single, one ATTI GPU. Now this starts to look much
more attractive because the paper before even though
it was the paper that started, let’s say, restarted the
interest was pretty impractical, they didn’t seem like a good method. It wasn’t efficient at all. This starts to look much more
interesting and attractive. So let’s talk about how
you go about doing this. The student did the same
thing as before where you had the RNN sampling networks and then you went and trained
it and you updated the RNN, and obviously you are doing all
these sampling things in parallel. You sampled, let’s say, 1,000 of networks in parallel and you send them off to 1,000
worker threads and whatnot. But note the big expense
what is the biggest expense? The biggest expense is the cost of doing
forward-prop and backprop, and mostly backprop is much more dominating and
updating all the weights, that is if I sample 10 different architectures
and I send them off, I’m doing 10 times
how many epochs times how many dataset points
are in every epoch, well, divided by the
number of minibatch size. That many number of backdrops and that’s where all my
forward-prop and backprop, and that’s where all
my compute is going. This is the very expensive part. What if, so this was a completely crazy idea
and then this will look barbaric when I mentioned it
is what if we didn’t do that? What if we, let’s say, I sampled five architectures
to keep it simple, and the five architectures are
very similar to each other, except they differ in some edges. I have one architecture and it looks almost the same
as the one next to it, except like it has this edge, it doesn’t have this edge. Let’s share their weights. So I’m going to take
those two architectures, put them in the same
computational graphs. So let’s say they have all
these edges common to them. I’m going to let them share it, and I’m going to do
one forward-prop and one backprop through these two
architectures at the same time, and I’m going to update their individual loss
functions individually. But the gradients are like
flowing back in aggregate, and so the edges
which are not common, they only see gradients for each of their respective
loss functions, but the rest of the network, they all see all the bad
gradients coming in, and this sounds crazy, but very efficient because
what you are doing is I’m just doing one
forward-prop and backprop, and let say I have key networks and all the key networks I can evaluate
or train at the same time.>>So you’re saying that
basically aggregating all these network
structures together so you have partially connected networks.>>So here’s the example. So what you do is here’s
one big limitation, you have to assume that
there is a big super graph, like there is this
big graph from which all this which is your
subspace which contains all possible networks and you
are sampling from that and let’s say so for illustrative
purposes, this is the big graph. The fully densely, almost
densely connected one. This red one is one sub-graph
which is a particular network, and that RNN sampled one
of these, and I took it. Then I ask that and then
sample me another one, and let’s say it sampled
another one which looked like all these
edges were common, but instead of this
edge it went this way, and maybe something else happened. So all these edges which were common, I’m going to create this
new graph which has basically the superposition
of these two graphs together, take the union of those two graphs. So I will have some edges
which are different, but all the other edges are common, and I will treat that as one graph, and I’m going to do just one
forward-prop and backprop through this superposed graph. It’s not really superposition, but just take it as the union. The edges which are, there could
be edge which is common across, let’s say, 10 different graphs, all at the same time, and this
really amortizes the cost of such. So you come down from, so you might ask why will this ever work because the edges
should get specialized. So let’s say there is an edge which is a part of
key different graphs. Isn’t this really bad because that
edge is now trying to be good for in terms of its
particular way to values. It will take on trying to be
useful to many graphs at once. Instead of specializing
to the particular graph each individual
graph it belongs to. The conjecture is that we have some good regularization
effects, but this works. Well, if you read the
paper, I couldn’t find any deep understanding of why
this would or wouldn’t work. In many ways, maybe you can think
of it as multi-task learning. You’re regularizing the
weights by saying that, “Hey, you have to be good
across many different graphs.”>>[inaudible]>>Multitask for the
weights perspective. So the weights have to learn
a value such that they’re good across many different graphs. If you treat every different
graph as a task, like lose. You don’t have to call it a task, but I get some
regularization effects. You can use the same
tricks to do sample. Get an RNN to sample another RNN. In RNN actually what happens is, your cell search and macro
search are equivalent. There is no macrospace. RNNs are cells. You are just rolling out the
cell so everything is nice, and so on and so forth. If you look at very quickly
at the results, let’s see. Yeah, you come down from GPUs, look at the number of GPUs, 800 GPUs, these are the number of days that this was the
paper I showed you. So we multiply this by this, this is a huge GPU budget. Then we come down to one. After doing this, you
are getting this, and now we come down to this. There’s also the
number of parameters. So in almost every
respect you become, this is strictly better, uniformly better on all the
fronts that you care about. So this is a good paper to know. They discovered networks on image datasets that you will probably not come up with if
you are a human being. I don’t know about you, but I will not come up with this. There’s no way I will
figure this network out. If you are doing cell search, this was the things and
this was the things, but it doesn’t really matter. The whole thing is that these are not intuitive things that you will
sit and come up with very likely. This is actually my favorite paper, and this is what we did with
DeepMind and CMU called DARTS which came out
sometime mid-last year. This one is nice. There is
no reinforcement learning. There is no controllers. By the way, feel free to stop me
and ask me questions anytime. It will be more fun that way. Because if you think about it, it was weird that this thing, the 2017 paper was great as a full
reinforcement learning problem, more so than that they pose it as a stateful RL problem and
the policy gradients. Because you can actually have
gradient information here. So this is like feature selection. So why would I pose it as the
hardest possible setting and that’s obvious in the
numbers that you get back that it’s so inefficient
in finding something good. There are some limitations
here that they do only like cell-based and you
will see very quickly why. They can’t really handle
big search spaces. They did really well in
terms of search time. So here’s the idea. This is very similar to
the ENAS idea is that, hey, imagine the following
that I have got. So this is a cell and
the cell, let’s say, has four nodes, this is just for illustration purposes, 0, 1, 2, 3. The nodes here are tensors and
they’re connected by operations. So from this node, so let’s say from going
from zero to one, let’s say I pick ReLu. Then it will take the tensor at zero, apply ReLu to it. You get a new tensor and that
new tensor will show up at one. Usually, this is a discrete
optimization problem. If I keep the
hyperparameters constant, finding architectures, it’s a
discrete optimization problem. I have to figure out which
discrete architecture I want. They did this nice, continuous relaxation will
basically means that I’m going to consider soft weights, weights over all
possible architectures. For illustration purposes,
I have three operators, ReLu, Conv 3 comma 3, and Maxpool. From zero, I can go to one or to any of these nodes by using
any of these three operators. Depending upon which specific
choice of operator I choose, I get a particular
architecture at every place. I could instantiate a particular
architecture based on, let’s say, this one was ReLu, this one was Maxpool, and this one here was Conv and so on and so forth I get
a particular architecture. But instead of that,
I’m going to keep all possible architectures
around at the same time. I’m going to keep weights on them. So now we have two kinds of weights. We have weights on the architect which are like the weights that you would
traditionally understand. if I have a Conv layer, I have
the weights and biases to train if you have biases
at all to train in that. But then I’m going to have
this new kind of weights called weights over the architecture. So I’m going to have, let’s say, a weight Alpha 1, Alpha 2, Alpha 3, Alpha 4, Alpha 5, Alpha 6, so on and so forth where one Alpha corresponds
to one operation. If that Alpha is very high, that means that the
network, for example here, if Alpha 2 is high that
means that the optimizer is trying to say that for
this particular connection, let’s say ReLu is really good, and for this one ReLu is really good, this one Convs are good, and so on and so forth. But note that we’re
keeping everything around, but based on the optimization, whatever values of
weights these take on, are giving you a continuum
of architectures. This is key. You will ask, but I want one architecture
at the end of the day. You take the arg max over all the edges and you get
a particular architecture. Take the best weight and the
corresponding operations. It turns out that you can do this. There’s some optimization
background like this. There’s a thing called
bi-level optimization. There are two levels here
as you can understand is. The inner loop is I want to train
my network the regular way, and the upper level is like maintaining weights
over architectures. So I’m going to do this inner
loop and do this outer loop. During all this time,
this optimization is still efficient in terms of backprop, forward-prop because my cost or backprop is amortized across
all of them all at once. In order to evaluate these weights, it turns out that all I need to do is one forward-prop and one backprop, and I can get to update this. So again, ENAS things of, now I’m going to just
keep continuous values over all the architectures
and I’m happy. So one problem that
you might guess is, what do you think would
be a problem with this. This sounds good. There’s some problems. The
optimization is approximate. Unless you delve into the math, you might not find. But practically, would
you want to do this? Or what do you foresee
as the problem?>>Run time for this?>>Run time, this is less than a day. The results are really good at least on CIFAR and you
mentioned it and whatnot. They’re just as good
as ENAS if not better. Something must really bother. Okay.>>So on those graphs, what are those three lines
laid on the trajectory. What do each line [inaudible] I will done it like three times?>>No. Each line represents
a particular operation. It’s an operator. So let’s say
you have a set of operators. Like value, max pool, gone and each line represents how much weight did you give over
each one of those operators.>>But the number of operators
is almost uncountable, so it can have any number
of hyper parameter in terms of inside that operator which
makes it a new operator, right?>>So assume the
hyper-parameters are constant.>>But if you’re catonating three by three but if you do it six by six or would it be 12 by 12? So you [inaudible] different.>>What you do is you have a bag
off like 15 or 20 operators. Usually, it’s like 11. It’s like count three by
three is one operator, count five-by-five is one operator. This is actually what
you are getting at is this other thing like secret wish.>>Your greatly
shortening [inaudible]. You’re constraining
your search place.>>You’re constraining. Yes.
But that’s not common to dots. That’s common in any
cell cell space, right? Like so you usually take this
discrete bag of operators, like Conv 3 comma 3 is
one operator which is distinct from Conv 5 comma 5, right? They would just show
up as different lines here. Your point is good. So if I have many of these, then if I have 20 of those I will
have 20 lines everywhere, right? Actually what you are getting
at is that now I have to put all of these things
into GPU RAM, right? GPU RAM is not that plentiful, right? Like let’s say I have 16 gigs or 24 gigs and I don’t have just
have this one cell, right? Like I have many of these cells
together and each one of them has let’s say a floating 0.16 bit
rate over each operator, right? So I have this like massive
graph which includes the architecture super graph
that I have to put into GPU. So it’s not very memory efficient. So which again means that
if I increase the number of operations to be more reasonable
and consider more choices, this thing is not going
to fit into GPU, right? So this is the thing that
you should bother you. So then then this paper immediately came out like a few
months after that. I think less than three
months after that. That came out in by
December this was out. Where what they did is like
if you imagine that I can I only keep in GPU RAM one
bot through all the cells. So let’s say I have two
three cells, right? Like they all have all these
operators and the waves over them. I’m going to sample a path
here through this cell. I’m going to sample
another path through this cell and sample another
path through the cell. So that gives me a full path
through all these cells. Then I’m going to
only at one time take that sample path and in practice I will sample many
of these things together. Then do the inner streak
of sharing the weights, which are common amongst
all the sample so I have efficient in forward
bot and backward. Only put those sampled bots in
GPU at any given moment of time. This basically what it
allows you to do is, so these are like the
architecture parameter weights. So let say Alpha, Beta,
I have four weights. I somehow sampled. I did like
a binomial distribution, flipped a coin, and
said pick this one. Which means everything else
gets masked out, right? I picked that
architecture and trained it including all the other
samples I had written in. That’s why they call it proxy less because this is the other caveat in neural architecture
search world is that all the time what people
usually do is they, because things are so expensive, that they find a proxy data
set which is usually like and I’m going to take reduced
ImageNet or CIFAR 10. Nobody really does mNest, mNest just doesn’t transfer. I’m going to take something
which is much tinier than the data set that
I actually care about. But you imagine you
are in product world. Your data sets are probably
way bigger than ImageNet. Let’s say your data set
which is like 10 terabytes. This is not something
you would want to do even on a 10 terabyte
data set, right? So what the common thing that people do is that you sub-sample
like very tiny data set. You’ll train on that,
find a good architecture, and then you scale up. When I say scale up,
usually what people will do is they will take do
cell-based architecture, search, find a good cell design, and then they will just stack
many of these cells together. Then train that on the full
big data set once, right? That you can afford to do. Then the other things you can do
is take into account explicitly. So this is probably important
when you are observing time. Let’s say you have a KPI, I cannot be more than 10 milliseconds during
the forward drop like you know. You want to put that into the
search like bias the search by giving me models which are not
just have good loss but also have, let’s say good latency, right? Like you know not too high-latency because this is also important to me. There are many other things you
can just put them in there. That is not specific to this. You can do it. Okay. So last year
actually in the same series, I gave this YouTube,
the same talk but it was much more of a tutorial
format, an hour and a half. So if you wanted to
know more details and including what really happens in the policy gradient updates have a look at this and the
slides will be made available. So you will have the links.>>Back to the slide
that you’re asking. What’s the potential issues for
the multiple cells put together? Is it like for example, do they like in 0 1, 2, 3 so they follow different
steps before all this stuff. So like a query like convergence
from one to three is a different from two to three like the errors or like how you
convert the network together. Could they have have come on that? What’s the potential issues that? I want to hear more from your.>>So I don’t think I
caught the details. So are you saying that do we
have like convergence issues?>>No, no, no. You are asking for this architecture, put
the cells together. What’s the potential issues?>>All I was pointing out is
that if you’re do this thing, then you maintain weights over all
your architecture, your memory, the memory required for representing
this graph blows up, right? So let’s say you have a graph
with N nodes and V vertices, for every vertex you now actually
have times K vertices, right? Because K is the number
of operations you have in your bag that you are considering. So this becomes like
an n node times KV. That’s the size of your
graph and this is massive, so you can’t fit it into GPU memory.>>My question is, for example, for instance, one, two, three which is like
a two, two, three, you reach from a two
reach to the final step, what if they have a conflict like the result doesn’t converge or you
have some issues with it? Well, if it has conflict
between these two subset?>>Which two subsets?>>In the graph, like from one, two, three; zero,
two, three; two, two, three, eventually, it’s using the same
architecture to do the work or you use
three different subs? Then the result out from different
cells might have conflict.>>I think I understand
what you mean. Okay. This is a small detail
but it’s an important detail. You can have compilation failures. So let’s say the RNN or
whatever method you used, you sample the graph and the output. So let say from here to here. So let’s say you used a cons
two by two which [inaudible] , this downsample data, but the tensor coming from here is twice the sample. So now three has to deal
with two inputs coming in, which are of different resolutions. So this is compilation failure. Usually, when I’m writing it by hand, I will just make sure that I
will do the math in my head, that all this will be
a tensor of this size, this many channels and I’m
applying this operator, so I can do the math in
my head and make sure; otherwise, PyTorch or
TensorFlow will throw you a size mismatch error
in the first try. So what usually happens is
you do all these hacks. You make sure you’ll write
extra boilerplate code that if they do things
that are coming in are not of the same size, I’m going to do a
one-cross-one convolution and increase the number
of filters to be, let’s say, the max
of these two so that everybody gets on the
same tensor size. So if I add them, or concatenate them, or whatnot, I don’t get compilation failures. So that’s an important
implementation deal. You can very easily get compilation failure when
you sample a network. Okay. So there’s much more details here and the slide deck is I
think for that is also available. Okay. So now, let’s see.
We have 20 minutes. So we’re calling this
Project Petri Dish because as opposed to most of the methods that I
showed you where we were doing basically something
called backward search. So DARTS, ENAS, and that RNN, the first paper that I showed you, which are good three principle
components of literature, there is this implicit assumption in the technique that somehow
you have this super graph, that I get this big
graph from which I can subsample and prune down and
search for a good architecture. Okay. That’s also human domain
knowledge that you have to inject. Let’s say you get a dataset
which is not Vision, which is not just an LP and whatnot, so you have really no
intuition on what is a good super graph to start with
and how big should I make it. The bigger you make it the more
search time you will take. You may be able to squeeze
out more performance at the end if you are willing to spend more search budget in GPU hours, but this also mean you
may be very wasteful. So what if instead of
doing backward search, which is all 99 percent
of the papers out there, we could do forward search. Forward search, basically,
you should think of, we call it Petri dish because
like in a Petri dish, we are just growing
the network as needed. It has the following flavor that we will start from a very seed network, let’s say one or two layers, and then we will see, hey, should I add another layer or not? If I want to add a layer, what kind should it be? This is the mathematical
question we are going to pose and we are going to try to find an algorithm that will do this. Now, what’s the nice
part about doing this forward instead of backwards is
I don’t need all that things. Second, I also can do macro
search, which is the more general. I don’t need to assume that I have
a cell and I have to do that, go and reduce the search space, we’re assuming a lot of human
domain knowledge into a cell. I also don’t need
another knowledge of a backbone like if you go into your ResNet like backbone
or DenseNet like backbone. Then one nice part is hey, let’s say you have a
10 terabyte dataset. You have a really
good model for that. They comes, I usually go by dev, and says, “Hey, I have
this really cool method, you should try this.” But instead of trying that
what you want is like, “Hey, I have this really good thing. Can you start with this
and modify this instead of undoing all the work we have
done to come up with a good model? Because we actually have put
in a lot of expertise and domain knowledge to come up with this model and right
now it satisfies us.” So you want that property and
it’s not easy to do those things with the previous methods because
they have the flavor that hey, give me a dataset, I will try things and
give you an architecture. So it’s not easy to warm start from
something that already exists. Then, also, explore prior knowledge. Okay. So how many of you know gradient boosting
or does the word ring a bell? Okay. All right. So even if you don’t know, it doesn’t matter for this. But if you do know, then keep gradient
boosting in your head and it’ll come in useful
and you can understand why. It’s just, for example, if you are training,
for example, XGBooster and training like a forest of decision trees via gradient boosting, which is a very powerful method especially for tabular data where you know the low-level features already pretty well or they
have semantic meaning, you are doing boosting. Boosting is this general term, where you’re adding
weak learners one by one and you’re growing
in scope and capacity. If you know gradient boosting, this
will look very much like that. In fact, one of the earliest papers. So this paper was 1989, I think this depth report is 1991. It’s Cascade-Correlation or a CMU. Actually, I’ll skip this because I can just show you the correct method. You start with a small model. Let’s say this model is tiny and you somehow have
these candidates. Let’s say these candidates are
like grab bag of operations. Like a con five-by-five, maxpool, so on and so forth. I want to know where I
should add these candidates, without actually adding them in, I want to know how useful these candidates are going to
be if I were to add them in. So this is like asking a very
hard question like, “Well, you say you want to know
how useful they are, why don’t you just add
them and evaluate them?” But that’s going to run up my GPU costs because if I add
them into the network, then if I have 10
different candidates, I have 10 different networks, I’m going to have to print 10 different things and
I don’t want to do that. I want to try them all at once
but without really trying them and without affecting
the original model. If some of them are useful, I’m going to add them in. Add them in, means they will
just become a new network, a regular network with that
particular thing added in and now I can repeat this process. I may take that model
which has become modified, put it here, and repeat
this process again. How am I going to actually do this
because this seems like magic? So let’s say I have this simple original model and this is a regular blue
lines with regular connections. Here’s really the fun part. So let’s say I have a
candidate, one candidate, and I want to add it in such a way during the
initialization phase, where I see whether it’s
actually what signal it can provide without actually affecting the gradients
the activations, all the gradients which are flowing
back through the blue things. I can actually do that,
it’s not that hard. Let say this candidate,
I decide to add it here. I connect it by this particular type of this dotted line where
the forward is identity, the backward is zero. So electrical engineering, if you’ve heard of
the term Zener diode, this is basically like that. That this is a special
type of layer where forward activations will
flow through just as this, but backward gradients
will be stopped. Then we have the reverse of back, which is here, this one. Which is the forward is zero, which means that if I’m
doing forward prop, this is zero, this is just as it is. So the sum is just whatever
the activations of B are, and then it goes through C. So we have verified that
during forward activations, we don’t change the activations that would have flown
through the blue ones. Everything is unchanged. During backward
propagation, let’s see. So you will get signals. This one, the backward is identity, so you will get gradients up to here, but no more gradients here. The gradient from here will
actually be the original one of B because there is no
dependency on this candidate. So during forward and backward, nothing in the original graph change, but I get to see some gradients here. This should remind you
of gradient boosting. Because now, I can accumulate what the gradient is in that
candidate with respect to B. Because the B is right next to it, and I can see in that candidate
what gradients I’m getting. Here’s the intuition, this is
the inspiration that comes from Scott’s original paper is if this gradient is actually very
highly correlated with your error, which gradients are, then what I’m going to do is that
gives me a signal, a proxy signal that adding
this candidate is good. Because then if I add this candidate into the main network and train it, it can learn weight and
codings which are reverse of the error and minimize
the error residue. Let’s say right now
I have some error. If I were to add this in, I have it on good faith
that this will be able to actually minimize the residual of the error with respect to what
the target values I want. You can do whatever, regression
classification, it doesn’t matter. So that was the middle one. If I decide that I want
to actually add it, so I will add it in gently. There is no theoretical reason
for doing this other than pure intuition is that I’m
going to just take out, replace these with regular
connections that I should have, no more special Zener
diode-like connections. But I’m going to make this one, one and this one, zero in the beginning, the weights. So that I don’t have
this big disruption in the distribution of gradients
that each layer sees. Let back prop and
optimizer figure out what the right values of these weights
should be but gently from here. So in practice these are
called stop gradient layers, and this one that is called
a stop forward layer. Then once you decide
that you want them in, you are going to remove
these things and create just the network and with a scale value and then let the
network decide what it should be.>>I had a quick question. You said that you
end that now gently, do you edit in the middle of the train or do you
start training after? Do you start re-training
after you add this node?>>Let’s hold that for
three more slides. If you still have that
problem, tell me again. Hopefully, this will become clear. So one of the things was, in order to be efficient,
we know that we want to amortize our cluster
forward prop and back prop. This is a common running theme. Without that, you’re not
really efficient in GPU hours. So I want to try whether
on all these operations. So let’s say I have K operations, where K can be 10, 20. I want to try whether I should have which one of these K
operations I should pick here, and which one of the same K
operations I should pick here. All in parallel. So I can
add them up again by doing this particular stop
gradient, stop forward trick. One of the things we do is we impose L1 sparsity penalty on
all the operations. So then we pick the operation
that has the most weight amongst the K. So there’s an L1
regularizer around all of this, plus an L1 regularizer
on all of this, and we greedily do selection. So this is again like
gradient boosting. We are doing greedy selection. Let’s say we decided to
add something together. So then it will become
a part of the model. That happens in this middle stage. So there we will remove them and
add the gentle introduction. Then that model will move on to the third stage where it will
actually get trained fully. Depending upon the signal
we derived from that, we will decide whether it will
go back to being a parent model. These are the parents.
All the candidates are being evaluated on the parents. The parents produce children, and if the children are
good and seem promising, they get to be parents and see
whether they are doing there. Except this will remind you of evolutionary search just
because of the terms I’m using, but we’re not really doing
evolutionary search here. It is just as worse as RL. So one of the things we do
is actually we do one start. So all the model
weights we keep around and just these new layers
that we’ve got added in, those layers get their new weights. By back prop, they get updated, and then they’re always getting incrementally trained. So over time.>>[inaudible] add layers at the top?>>No. So we can add layers anywhere. So that’s actually part of the
decision that you are saying. That’s a good question, is that, you flip a coin and you pick one
of the layers as the input one. You flip the coin again and that
gives you another input two. Then you pick where
you want to be out. So that means you are keeping a
lot of these things together. It’s important because you don’t know where you should be connecting to. So that’s part of the sad space. This is just saying what I just said. This whole thing is always
happening in asynchronously, in a distributed manner. That you have a pool of
parents they are going, and for every parent that
we are trying to figure out which candidates we should
add to which parents where. We train them if we decide we like them and then
they get to come back. Okay. Now the thing that
I haven’t told you is, how do we decide we like a candidate? I told you the mechanism by
which we would like a candidate. How do we know that these
candidates when they are added, are what going back to the parent
view? So here’s what we do. This is nice because, this is important in
the real world is as you are trying things out
asynchronously in all these skews, you are training them,
you get to keep. So this is a scattered plot. So let’s say you care about flops. This is loss and this is cost. So ideally you want to
find the Pareto frontier. This could be memory,
this could be latency, anything you want or
all three of them. Ideally, you want to find the
Pareto frontier, and know that, what is the best possible
performance you can get for a given flops budget and gives you the lowest
loss, and look that up. If tomorrow you get faster GPUs or your serving needs change
and you’re like “Hey, 40 milliseconds is okay.” In these kinds of scenarios, I want to just pick this model out. Like I don’t want to retrain my
architecture search and say, “Hey, give me a model which
is good for 40 milliseconds.” I already have this
gallery of models which is along this approximate
Pareto optimal frontier. Note, I don’t know where the
Pareto optimal frontier is, but as I’m training, I get to have this scatter plot. This scatter plot is building up, because these are all
the things that are at the end of the third
queue, the child models. I train them and I get
to just plot them here. So if you think about it, I know for a fact that
there are models, for example like this one. If this was a parent, it is very highly unlikely that
a small modification, like adding a small
candidate here would result in a model which will
lower the Pareto frontier. It’s highly unlikely I will go from here because I’m adding something, so which means I will go this way, but I will go have a
huge drop in loss. I’m making some reasonable
smoothness assumption that networks which look similar
have similar performance. So which means that there
is really no reason for me to add these models
to the parent queue, because all I really
care about is here. I want to lower this as
quickly as possible, but I don’t know where
exactly the frontier is. It is quite possible that
this one even though it is slightly above the
Pareto frontier here, I know that it is possible, like this is an exploration
exploitation problem that if I were to take this
and make a bent to it, that it will go really down, and it will lower the frontier, and then that’s good news for me. So we explicitly maintain this
Pareto Frontier during search and we take a small tolerance region around the boundary because we don’t know which model may be promising. Based on their distance from
the current estimate of the lower convex hull of
the frontier, we explore. We get to add them to
the parent queue and see if things will make sense. Does that answer your question? Okay. So results. Before we go to results though, before I show you numbers,
it’s easy to show numbers but there’s a lot of caveats
in this kind of research. Right now in the community, we have this like comparison and
fairness reproducibility crisis. True in just all of ML but NAS
makes it even harder, why? Because everybody comes up
with their own search space. So then it’s hard to compare the
merits of a search algorithm. Not only that, one of the things we don’t address is
like hyper parameter tuning. It is very well known like let’s
say if you are a competitor, you have a computing algorithm. Just because you have a
better training subroutine, you can come up with a very good like your accuracy is
maybe much higher. If you’re training subroutine
for training a model, a network has all the tips
and tricks like Cutout, Cut mix, this kind of
augmentation has cosine, warm up in learning rates and this and that and that
makes a huge difference. You will not even be
on the leader board even though you have a
great search algorithm. So that makes a huge difference,
what’s your hardware? People are reporting GPU hours, kTGPU hour is not the
same as a V100 or, like a TPU GPU hour, or an FPGA or GPU hour,
so what’s your hardware? We know we have stochasticity
in training on GPUs and like, how many random seeds? How many trials you did? So unless you are a
lab with lots of GPUs, this may be expensive for you
to do and report properly. This is also a multi-objective
optimization problem, and that it’s not
just about accuracy, it’s about accuracy at certain
flops or memory or latency. We know that like, you know, hey, if you by using very fancy
distributed training, I can train like a two gigabyte
model on ImageNet and be sort out, but that’s not very interesting. You are not going to use that two
gigabyte model in production, and that’s a really
uninteresting result. So there’s lots of work being done. Like there’s one NAS
bench which came out. This is from Google and [inaudible] people whose
link I was showing. But they actually went
and trained close 1.8 million different architectures
and put it in a lookup table. So if you want to like, for example, evaluate your search algorithm, as long as you embrace that
implicit search space, you can just do look
up on your laptop. You can do NAS research on a laptop by just downloading
that lookup table, because it’s like a few MV of text. But like things like weight-sharing and DARTS-like search spaces
cannot really work there, including our incremental thing, but more benchmarks are coming soon. And the community I’m pretty
sure will rally around them and it will also bring NAS research
to communities who cannot afford, unless you’re a big
industry lab with like 5000 GPUs you won’t even touch this. So it will democratize
it a little bit more. Okay. Results. As I said, look at Zoph and Le, this was like 1600 GPU days, so converted into hours. We do well on both of these fronts. This is not going to blow
you out of the park, but you should not be doing any RL. So this is not an RL problem. We actually showed
that, hey using macro and you will see that some of these numbers have plus-minus
and some of these don’t, because there’s also like variance
in how people report results. Like we actually run multiple
trials with different random seeds, and there’s a list of best
practices that one should do for reporting any results on
any machine learning algorithm. Because we found that hey, a lot of people are taking
the best result they found, during all their experiments
and reporting that. That’s essentially like, hey
I threw a number of DARTS, and one of them struck gold, and I’m going to report that and put up the trained model on GitHub, and that’s not very honest. You should really have
variances and reproducibility. But the results won’t blow
you away because again, there’s no easy way
for me to show you that the gallery of
models we produce. We are just picking
one of the points that our algorithm produces and
putting it on this here, just so that I can show
you some comparisons. But this is not a
very fair comparison like apples-to-apples by any means, and there’s a huge discussion
that goes into there. One of the things we
were able to show that hey if you transfer to ImageNet, which is this process
of picking a cell and adding many cells and
then training it, you can actually like
with macro value, you have no knowledge of
backbone, ourselves or anything. You can do quite well as
close to sell as such. I’ll skip the language modeling
and we are out of time. Just to end, there’s lots
of research problems and NAS has really useful successes, but it’s no means as hogged
problem, because if it was hogged, we won’t have to write
architectures by hand anymore. But right now, if you
are like I always tell PhD students or
potential PhD students, that if your research agenda in
academia suddenly becomes let me hand designed a new architecture
for doing object recognition, you should really rethink
that because nobody beats NAS right now on
those kinds of things. We have an implementation on GitHub but a lot of that implementation
is in TensorFlow, and I know TensorFlow is not
the most popular right now. So we’re redoing
everything in by touch, and we’re also making the code
a little bit more robust. So watch that GitHub link, we’ll send out a blog post
with all the right things. But one of the asks we have is you’re from all
over the company is, can we get access to
first-party data sets. We don’t need to publish.
The method is published. We have a patron or
defensive patron.net. Interesting new datasets. They don’t need to be just a vision and language, they can be anything. We would love to get our
hands-on and work with you and try to see what we can do, just for our own curiosity. If you have production serving needs and we have all these kinds of
you want to take into account. Give me the best model which
has less than three MB GPU RAM, like that kind of scenarios. Would love to get our hands on that. I will end just to so that
you have a bit of a break.

One Comments

  • hadooken face

    January 10, 2020

    Debadeeta sir, please bring toilets to us in India. We have AI egsperts here a lots MS programers here.

    Reply

Leave a Reply