[MUSIC]>>So Debadeepta is going to tell us about Efficient Forward

Architecture Search, it’s a very fast approach to doing neural architecture

search. Go ahead.>>Thanks, John. So today, I’ll talk about a method for

doing neural architecture search. But before I will talk

about a particular method, let’s see why we care about

searching for architectures. So this was when I was visiting

John in New York in April 2018. This is actually a

picture from DenseNet. Actually, before we even begin, how many of you would

say are very comfortable with at least basics of

training a neural network? Okay. Great. TensorFlow, PyTorch, CNTK, you know one of those. Supervised learning, you have trained a neural network

on some dataset. So often, what happens is that, “Hey, we have to figure out

how many layers we need.” Let’s say this is a

dataset that is new. Let’s say, it’s not the

standard well-known dataset on which there has been a lot of communal

domain knowledge, people have tried

stuff, published stuff. But this is your own proprietary first-party dataset and it’s massive. So what should be the

capacity of your network? What should be the

particular stride land? Or how should you connect these things to each other,

the layers to each other? This is all very frustrating. You are sitting there and

you’re trying things, you have an ARG GPU

budget, I am sure. It’s not unlimited and you want

to try out many of these things, and you want to achieve some kind of performance

that you care about, like, I want good performance. I don’t want to use too much GPU RAM because during production time, that means I have to allocate

too many machines for serving. I also wanted to not have this

much more than this much latency. So there are all these

like multiple objectives that you want to ideally do. But before you get to all of that, in the beginning you

have a new dataset, new paths, you just want to figure out something that works reasonably. Let’s take the supervised learning. We are not going to talk

about reinforcement learning. We will talk reinforcement

learning in the solutions methods, but the tasks will be

supervised learning. Here’s the truth, nobody really knows how to set these things

like including people who are experts in the field and ever invented neural

networks back in the day. I can denote this is

all trial and error. This is very empirical work. This can feel very magic, like we have all these magic

numbers floating around, and not only this, we

have magic numbers floating around in the

training parameters. That is not even in this picture. What’s the learning rate? What’s

the learning rate schedule? How should we warm start? What should we initialize

the weights as? This is the big motivation for trying to dig this guesswork out of humans

into the hands of algorithms, and hence there is this

very nice sub field of machine learning now called

neural architecture search, and we have all these problems we talked about and there’s also

like a massive talent shockers, and I think this room has a biased sampling but in the real

world getting people who have the right skill sets and hiring them and training them if

needed is not easy, right? If you are out there in the field, you probably know this

much better than I do. So we’re going to do first before we dive into our particular method a

15-minute literature overview, and before we even go there

into the literature overview, we’re going to look at what

is the problem definition. So the problem definition is, we have a subspace, and remember this is

not very difficult, this is not very complicated. All I actually want to do is

given a search space which is the set of all realizable

neural network architectures, I want to find an architecture that fix all my criteria like

it has this much accuracy, this much latency, this much things, if possible or come as close to

them as possible and I’m done, and the training procedure

that hopefully can train it, and I’m all good. So as with any search algorithm, this is no different than

any search algorithm. If you’re doing search in any ways, all the same things come into play. There’s some challenges

which is the third column. But the first one is

we need a subspace. If you’re searching

something, you need to define a space to search it. We will see that there are

usually two kinds of space and informal lingo will either

be called macro or micro. We will see examples of that. So right, and a lot of right now, domain engineering goes into what

should be the right subspace, because here’s the trick, if I can somehow come

up with a subspace, which I know will have all the right primitives like it has the conv layers,

it has transformers, it has self attention units

and all of these things, batch normalization and

all the right operations, and I can also inject some priors

into the search space that, hey, I know that if I have a conv unit, after that I should not be doing

a batch normalization this way. I know how layer normalization

works from other empirical papers. I can prune out the search space by just injecting a lot of

human domain knowledge, and this debate right now is how much domain knowledge

should I inject into the subspace and which

one should we work on. So we will look into some details

of that and pros and cons. Now once you have a subspace, the big problem in architecture

search is that your search is very expensive because

here’s what really happens, in order for you to try out

architectures one after the other and find the right one, you literally have to somehow

find an architecture, bring it all the way to the end and imagine you

have a dataset which is ImageNet size or if your language this is training

a bot model or something. This becomes your subroutine. There is some algorithm which is search algorithm which is

searching for architectures and giving you an architecture

somehow and your worker threads have to go and try

them on your dataset and then they will report

back some statistics like, oh, this particular architecture

has this much memory uses, this much latency, this

much accuracy, and so on. You will take this datasets and

update your search algorithm on what to try next or what

other things to try next, and all of these things

actually happen asynchronously in any proper implementation. But all the expensiveness of neural architecture

search comes down to the fact that every time

in order to evaluate, a point function evaluation

is super expensive. So I have this search space

of neural architectures, I pick a particular point, that point is an architecture. I have to go and train it. This training can take on

four GPU box two days. So this is super expensive, and there are uncountably

many of them, depending upon how I

define my subspace. There could be easily, these are discrete things usually, I’m putting aside

continuous hyperparameters, so they’re not really

uncountably many but easily you could have millions

and millions of things to try. So which ones should I try? I don’t want to use 10 million

hours of GPU hours, right? So the search strategy

is all about how efficiently can we search

through the things. Then performance

estimation is more like, a sub-part of search strategy, is when I’m searching, when somehow this looks like a good candidate to try by

using whatever algorithm, I’m going to go down and

then see how well it performs but I don’t want to

actually train it for two days. So let’s say it takes two days

to train an imaginary data size, dataset on an 84 or a

GPU box, how quickly, how approximately can I

evaluate that function, derive a reliable signal of

how well it performs or not and get out without being fully committed to it because

that takes up a lot of time. So usually if you look at neural

architecture search literature, most of the differences are

along these three dimensions. What’s your search space,

what’s the strategy, and how you are doing performance

estimation to do things. So let’s look at, define some of the search space

because this is important. So this is what is called

as a cell search space and what happens in the

cell search space is, let’s see how many of you are

familiar with let’s say ResNet? Okay, good. So if you know ResNet, you know that what happens

in ResNet is that there is this really nice like skip connection and there’s a ResNet cell and they’re regular cell and then

there’s a reduction cell. That’s too, down-sample

the resolution increase the number of

channels, work, so on. But really in ResNet, the design is not that complicated. If I want to create, there is a cell which is basically this little motif structure right here and in ResNet all

the cells at the same, in this one is a little

bit more general, the motives are different and all you’re doing is lining up the cells and connecting

them via skip connections. If you want, let’s say,

I want a bigger thing, add more cells and all the

cells are designed the same. This is very popular in literature

because the nice part is you can inject a lot of human domain knowledge and

reduce the search space. Your search space is

not fully general, it’s you’re, so let’s say, you say that, hey, I’m going to use the ResNet like backbone which is the

skip connections that you have between cells and I’m going to keep the search space constrained to finding the what the motif should be, the cell should be. Then if I want a bigger network, I just add many of

these cells together, stack them, I just stamp many of these cells together and

I get bigger networks. As you can imagine, instead of, you are not searching for what

the outer skeleton should be and this really reduces the

number of things you have to try with the number

of your search space, there is the lot of human

domain knowledge injection. So some smart people came up with ResNet and we are going to utilize this information that

skip connections are good and having these cells is enough and it seams and we are going to figure out what the

cell through design should be, and this is the most

popular search space. Then of course, there’s the other

one which is what you would actually expect if you didn’t

know about the cell search space, which is, well, I have a fully general search

space, anything goes, any particular type of

connections and any size, any wonky weird shape is all on the table and this is obviously

much bigger and a lot of research shies away from using this search space because

even for doing this research, you need a lot of GPUs and I will show you results

later that will show that actually you can use this search space and you

can be competitive with cell search space even if you use a much bigger search pace

and get similar results. This right now, the slides we’ll make

available either through the website or e-mail me. So all the links will be there,

so you don’t need to worry. Some really good people maintain this really nice list of

architecture search papers. To just give you an example

of how hot this field is, let me see, okay, so actually the browser is on. So remember, we are only in October, there’s two month left in the year. So there’s a group in Fribourg who maintains this

list of machine learning papers in neural architecture search related papers which either published or not published and I think the bold ones are published

and this is 2019. I’m scrolling down,

we are still in 2019, a little bit more. We’re still in 2019,

even I’m surprised, I didn’t expect I had to do so many. Okay, 2018, I see ’18, good. Oh, boy, even ’18 was pretty big, hey, ’17, this is good. Seventeen will be much shorter, ’16, and then the list ends. From ’16, 2017, ’16, you go all the way back

to ’94, ’90, ’89, ’88. So there was this big gap for, I forget how to do the math, 25 years or something where we

didn’t do anything on that topic. So this is pretty incredible. I think there are 50-60

papers on archive. I drag waking up in the morning

and looking at archives. So this is really nice survey article I recommend reading because as it happens with any field in

which it is going viral, you need to read all the papers

that came out just this month, you will have to give up your day

job and just sit and read that, you don’t have to,

that’s the good news. Because a lot of it is nice and there are few key ideas that

you really need to know, if you want to know when, and I recommend that survey

paper for more details and within this short thing, I’ll spend five more

minutes to talk about three or four key papers

and then we’ll move on to our paper which will come in or come out in December but it’s on archive. So the thing that actually

started or I should say, restarted the modern era of

interest in an architecture search, and as you saw it, this

is not the first paper, so this is probably like

the 10th or 11th paper. The first paper was 1988, the second paper was 1989 at CMU, but this one really restarted. So four or five years into the deep learning

revolution reawakening, this paper came out, and and

this paper was super-simple. Some of you I think took

the classes on Explore, Exploit, and Basics of

Reinforcement Learning. So if you did, this should look pretty trivial. So I’m going to do this

policy gradient method, and if you don’t know

RL or policy gradient, these words are alien to you, you don’t need to worry about it

at all. Here’s the simple thing. I’m going to take one neural network. In this case, it’s an RNN, which basically to get

a seq to seq model because we are going to encode neural networks as a

sequence of outputs. Like, oh, it’s gone five

followed by a [inaudible] , followed by a max pool,

and so on and so forth. So this RNN is going

to like output me by unrolling the island

and it’s going to output me a network definition. So that’s, I’m going to

call back the controller, and when I get the network

definition from it, I’m going to send it

off to let’s say what got creator machine

somewhere in the Cloud, and I’m going to train it on the data set and let’s

say I have imaginary, at CIFAR or something as the

dataset to just to give an example, it will train, I’ll get some accuracy and some

other statistics back, and I am going to update this. So if really good things

happened for that particular, like your network that was

sampled by this controller, during the update process, I’m going to update

in such a way such that those kinds of networks

would be more likely. There’s a little bit off like

very trivial math that goes into how you will do the update from

this but it’s very simple. Like you’re just going

to make the network, this network which spits

out other networks, definitions make good networks that the results of those training more likely and the

other ones less likely, and we are going to use

essentially like supervised learning updates to do this. To give you a very concrete example, that controller network

takes goes along unrolled, and it gives you this

like number of filters, filter height, filter

weird stride and height. So this is one. This is layer n. Then you keep unrolling

and it’ll give you another layer so on

and so forth until it is used stop and that’s

your network definition. This became quite popular, this is a very well-cited paper. Unfortunately, and I

will show you results comparing all of these later. This was extremely expensive like to do if handle

search on CIFAR-10. I forget the exact number, but it was like 1,800

or 3,000 GPU hours, and this is probably way

more than you want to afford in terms of GPU cost. So the next paper that came out

made this cost go from 1,800 GPU hours to obtain a network which

it does well on CIFAR-10, and imaginary to in one day. So at that time, it achieved, I think it

baked humans at that time. Don’t quote me on that though on CIFAR-10 and as well

as on like Pen Treebank, all those numbers have gone with more accurate because this was

2018 and this is ’19. So this is ensured by

deep learning type. In all the experiments, they used a single, one ATTI GPU. Now this starts to look much

more attractive because the paper before even though

it was the paper that started, let’s say, restarted the

interest was pretty impractical, they didn’t seem like a good method. It wasn’t efficient at all. This starts to look much more

interesting and attractive. So let’s talk about how

you go about doing this. The student did the same

thing as before where you had the RNN sampling networks and then you went and trained

it and you updated the RNN, and obviously you are doing all

these sampling things in parallel. You sampled, let’s say, 1,000 of networks in parallel and you send them off to 1,000

worker threads and whatnot. But note the big expense

what is the biggest expense? The biggest expense is the cost of doing

forward-prop and backprop, and mostly backprop is much more dominating and

updating all the weights, that is if I sample 10 different architectures

and I send them off, I’m doing 10 times

how many epochs times how many dataset points

are in every epoch, well, divided by the

number of minibatch size. That many number of backdrops and that’s where all my

forward-prop and backprop, and that’s where all

my compute is going. This is the very expensive part. What if, so this was a completely crazy idea

and then this will look barbaric when I mentioned it

is what if we didn’t do that? What if we, let’s say, I sampled five architectures

to keep it simple, and the five architectures are

very similar to each other, except they differ in some edges. I have one architecture and it looks almost the same

as the one next to it, except like it has this edge, it doesn’t have this edge. Let’s share their weights. So I’m going to take

those two architectures, put them in the same

computational graphs. So let’s say they have all

these edges common to them. I’m going to let them share it, and I’m going to do

one forward-prop and one backprop through these two

architectures at the same time, and I’m going to update their individual loss

functions individually. But the gradients are like

flowing back in aggregate, and so the edges

which are not common, they only see gradients for each of their respective

loss functions, but the rest of the network, they all see all the bad

gradients coming in, and this sounds crazy, but very efficient because

what you are doing is I’m just doing one

forward-prop and backprop, and let say I have key networks and all the key networks I can evaluate

or train at the same time.>>So you’re saying that

basically aggregating all these network

structures together so you have partially connected networks.>>So here’s the example. So what you do is here’s

one big limitation, you have to assume that

there is a big super graph, like there is this

big graph from which all this which is your

subspace which contains all possible networks and you

are sampling from that and let’s say so for illustrative

purposes, this is the big graph. The fully densely, almost

densely connected one. This red one is one sub-graph

which is a particular network, and that RNN sampled one

of these, and I took it. Then I ask that and then

sample me another one, and let’s say it sampled

another one which looked like all these

edges were common, but instead of this

edge it went this way, and maybe something else happened. So all these edges which were common, I’m going to create this

new graph which has basically the superposition

of these two graphs together, take the union of those two graphs. So I will have some edges

which are different, but all the other edges are common, and I will treat that as one graph, and I’m going to do just one

forward-prop and backprop through this superposed graph. It’s not really superposition, but just take it as the union. The edges which are, there could

be edge which is common across, let’s say, 10 different graphs, all at the same time, and this

really amortizes the cost of such. So you come down from, so you might ask why will this ever work because the edges

should get specialized. So let’s say there is an edge which is a part of

key different graphs. Isn’t this really bad because that

edge is now trying to be good for in terms of its

particular way to values. It will take on trying to be

useful to many graphs at once. Instead of specializing

to the particular graph each individual

graph it belongs to. The conjecture is that we have some good regularization

effects, but this works. Well, if you read the

paper, I couldn’t find any deep understanding of why

this would or wouldn’t work. In many ways, maybe you can think

of it as multi-task learning. You’re regularizing the

weights by saying that, “Hey, you have to be good

across many different graphs.”>>[inaudible]>>Multitask for the

weights perspective. So the weights have to learn

a value such that they’re good across many different graphs. If you treat every different

graph as a task, like lose. You don’t have to call it a task, but I get some

regularization effects. You can use the same

tricks to do sample. Get an RNN to sample another RNN. In RNN actually what happens is, your cell search and macro

search are equivalent. There is no macrospace. RNNs are cells. You are just rolling out the

cell so everything is nice, and so on and so forth. If you look at very quickly

at the results, let’s see. Yeah, you come down from GPUs, look at the number of GPUs, 800 GPUs, these are the number of days that this was the

paper I showed you. So we multiply this by this, this is a huge GPU budget. Then we come down to one. After doing this, you

are getting this, and now we come down to this. There’s also the

number of parameters. So in almost every

respect you become, this is strictly better, uniformly better on all the

fronts that you care about. So this is a good paper to know. They discovered networks on image datasets that you will probably not come up with if

you are a human being. I don’t know about you, but I will not come up with this. There’s no way I will

figure this network out. If you are doing cell search, this was the things and

this was the things, but it doesn’t really matter. The whole thing is that these are not intuitive things that you will

sit and come up with very likely. This is actually my favorite paper, and this is what we did with

DeepMind and CMU called DARTS which came out

sometime mid-last year. This one is nice. There is

no reinforcement learning. There is no controllers. By the way, feel free to stop me

and ask me questions anytime. It will be more fun that way. Because if you think about it, it was weird that this thing, the 2017 paper was great as a full

reinforcement learning problem, more so than that they pose it as a stateful RL problem and

the policy gradients. Because you can actually have

gradient information here. So this is like feature selection. So why would I pose it as the

hardest possible setting and that’s obvious in the

numbers that you get back that it’s so inefficient

in finding something good. There are some limitations

here that they do only like cell-based and you

will see very quickly why. They can’t really handle

big search spaces. They did really well in

terms of search time. So here’s the idea. This is very similar to

the ENAS idea is that, hey, imagine the following

that I have got. So this is a cell and

the cell, let’s say, has four nodes, this is just for illustration purposes, 0, 1, 2, 3. The nodes here are tensors and

they’re connected by operations. So from this node, so let’s say from going

from zero to one, let’s say I pick ReLu. Then it will take the tensor at zero, apply ReLu to it. You get a new tensor and that

new tensor will show up at one. Usually, this is a discrete

optimization problem. If I keep the

hyperparameters constant, finding architectures, it’s a

discrete optimization problem. I have to figure out which

discrete architecture I want. They did this nice, continuous relaxation will

basically means that I’m going to consider soft weights, weights over all

possible architectures. For illustration purposes,

I have three operators, ReLu, Conv 3 comma 3, and Maxpool. From zero, I can go to one or to any of these nodes by using

any of these three operators. Depending upon which specific

choice of operator I choose, I get a particular

architecture at every place. I could instantiate a particular

architecture based on, let’s say, this one was ReLu, this one was Maxpool, and this one here was Conv and so on and so forth I get

a particular architecture. But instead of that,

I’m going to keep all possible architectures

around at the same time. I’m going to keep weights on them. So now we have two kinds of weights. We have weights on the architect which are like the weights that you would

traditionally understand. if I have a Conv layer, I have

the weights and biases to train if you have biases

at all to train in that. But then I’m going to have

this new kind of weights called weights over the architecture. So I’m going to have, let’s say, a weight Alpha 1, Alpha 2, Alpha 3, Alpha 4, Alpha 5, Alpha 6, so on and so forth where one Alpha corresponds

to one operation. If that Alpha is very high, that means that the

network, for example here, if Alpha 2 is high that

means that the optimizer is trying to say that for

this particular connection, let’s say ReLu is really good, and for this one ReLu is really good, this one Convs are good, and so on and so forth. But note that we’re

keeping everything around, but based on the optimization, whatever values of

weights these take on, are giving you a continuum

of architectures. This is key. You will ask, but I want one architecture

at the end of the day. You take the arg max over all the edges and you get

a particular architecture. Take the best weight and the

corresponding operations. It turns out that you can do this. There’s some optimization

background like this. There’s a thing called

bi-level optimization. There are two levels here

as you can understand is. The inner loop is I want to train

my network the regular way, and the upper level is like maintaining weights

over architectures. So I’m going to do this inner

loop and do this outer loop. During all this time,

this optimization is still efficient in terms of backprop, forward-prop because my cost or backprop is amortized across

all of them all at once. In order to evaluate these weights, it turns out that all I need to do is one forward-prop and one backprop, and I can get to update this. So again, ENAS things of, now I’m going to just

keep continuous values over all the architectures

and I’m happy. So one problem that

you might guess is, what do you think would

be a problem with this. This sounds good. There’s some problems. The

optimization is approximate. Unless you delve into the math, you might not find. But practically, would

you want to do this? Or what do you foresee

as the problem?>>Run time for this?>>Run time, this is less than a day. The results are really good at least on CIFAR and you

mentioned it and whatnot. They’re just as good

as ENAS if not better. Something must really bother. Okay.>>So on those graphs, what are those three lines

laid on the trajectory. What do each line [inaudible] I will done it like three times?>>No. Each line represents

a particular operation. It’s an operator. So let’s say

you have a set of operators. Like value, max pool, gone and each line represents how much weight did you give over

each one of those operators.>>But the number of operators

is almost uncountable, so it can have any number

of hyper parameter in terms of inside that operator which

makes it a new operator, right?>>So assume the

hyper-parameters are constant.>>But if you’re catonating three by three but if you do it six by six or would it be 12 by 12? So you [inaudible] different.>>What you do is you have a bag

off like 15 or 20 operators. Usually, it’s like 11. It’s like count three by

three is one operator, count five-by-five is one operator. This is actually what

you are getting at is this other thing like secret wish.>>Your greatly

shortening [inaudible]. You’re constraining

your search place.>>You’re constraining. Yes.

But that’s not common to dots. That’s common in any

cell cell space, right? Like so you usually take this

discrete bag of operators, like Conv 3 comma 3 is

one operator which is distinct from Conv 5 comma 5, right? They would just show

up as different lines here. Your point is good. So if I have many of these, then if I have 20 of those I will

have 20 lines everywhere, right? Actually what you are getting

at is that now I have to put all of these things

into GPU RAM, right? GPU RAM is not that plentiful, right? Like let’s say I have 16 gigs or 24 gigs and I don’t have just

have this one cell, right? Like I have many of these cells

together and each one of them has let’s say a floating 0.16 bit

rate over each operator, right? So I have this like massive

graph which includes the architecture super graph

that I have to put into GPU. So it’s not very memory efficient. So which again means that

if I increase the number of operations to be more reasonable

and consider more choices, this thing is not going

to fit into GPU, right? So this is the thing that

you should bother you. So then then this paper immediately came out like a few

months after that. I think less than three

months after that. That came out in by

December this was out. Where what they did is like

if you imagine that I can I only keep in GPU RAM one

bot through all the cells. So let’s say I have two

three cells, right? Like they all have all these

operators and the waves over them. I’m going to sample a path

here through this cell. I’m going to sample

another path through this cell and sample another

path through the cell. So that gives me a full path

through all these cells. Then I’m going to

only at one time take that sample path and in practice I will sample many

of these things together. Then do the inner streak

of sharing the weights, which are common amongst

all the sample so I have efficient in forward

bot and backward. Only put those sampled bots in

GPU at any given moment of time. This basically what it

allows you to do is, so these are like the

architecture parameter weights. So let say Alpha, Beta,

I have four weights. I somehow sampled. I did like

a binomial distribution, flipped a coin, and

said pick this one. Which means everything else

gets masked out, right? I picked that

architecture and trained it including all the other

samples I had written in. That’s why they call it proxy less because this is the other caveat in neural architecture

search world is that all the time what people

usually do is they, because things are so expensive, that they find a proxy data

set which is usually like and I’m going to take reduced

ImageNet or CIFAR 10. Nobody really does mNest, mNest just doesn’t transfer. I’m going to take something

which is much tinier than the data set that

I actually care about. But you imagine you

are in product world. Your data sets are probably

way bigger than ImageNet. Let’s say your data set

which is like 10 terabytes. This is not something

you would want to do even on a 10 terabyte

data set, right? So what the common thing that people do is that you sub-sample

like very tiny data set. You’ll train on that,

find a good architecture, and then you scale up. When I say scale up,

usually what people will do is they will take do

cell-based architecture, search, find a good cell design, and then they will just stack

many of these cells together. Then train that on the full

big data set once, right? That you can afford to do. Then the other things you can do

is take into account explicitly. So this is probably important

when you are observing time. Let’s say you have a KPI, I cannot be more than 10 milliseconds during

the forward drop like you know. You want to put that into the

search like bias the search by giving me models which are not

just have good loss but also have, let’s say good latency, right? Like you know not too high-latency because this is also important to me. There are many other things you

can just put them in there. That is not specific to this. You can do it. Okay. So last year

actually in the same series, I gave this YouTube,

the same talk but it was much more of a tutorial

format, an hour and a half. So if you wanted to

know more details and including what really happens in the policy gradient updates have a look at this and the

slides will be made available. So you will have the links.>>Back to the slide

that you’re asking. What’s the potential issues for

the multiple cells put together? Is it like for example, do they like in 0 1, 2, 3 so they follow different

steps before all this stuff. So like a query like convergence

from one to three is a different from two to three like the errors or like how you

convert the network together. Could they have have come on that? What’s the potential issues that? I want to hear more from your.>>So I don’t think I

caught the details. So are you saying that do we

have like convergence issues?>>No, no, no. You are asking for this architecture, put

the cells together. What’s the potential issues?>>All I was pointing out is

that if you’re do this thing, then you maintain weights over all

your architecture, your memory, the memory required for representing

this graph blows up, right? So let’s say you have a graph

with N nodes and V vertices, for every vertex you now actually

have times K vertices, right? Because K is the number

of operations you have in your bag that you are considering. So this becomes like

an n node times KV. That’s the size of your

graph and this is massive, so you can’t fit it into GPU memory.>>My question is, for example, for instance, one, two, three which is like

a two, two, three, you reach from a two

reach to the final step, what if they have a conflict like the result doesn’t converge or you

have some issues with it? Well, if it has conflict

between these two subset?>>Which two subsets?>>In the graph, like from one, two, three; zero,

two, three; two, two, three, eventually, it’s using the same

architecture to do the work or you use

three different subs? Then the result out from different

cells might have conflict.>>I think I understand

what you mean. Okay. This is a small detail

but it’s an important detail. You can have compilation failures. So let’s say the RNN or

whatever method you used, you sample the graph and the output. So let say from here to here. So let’s say you used a cons

two by two which [inaudible] , this downsample data, but the tensor coming from here is twice the sample. So now three has to deal

with two inputs coming in, which are of different resolutions. So this is compilation failure. Usually, when I’m writing it by hand, I will just make sure that I

will do the math in my head, that all this will be

a tensor of this size, this many channels and I’m

applying this operator, so I can do the math in

my head and make sure; otherwise, PyTorch or

TensorFlow will throw you a size mismatch error

in the first try. So what usually happens is

you do all these hacks. You make sure you’ll write

extra boilerplate code that if they do things

that are coming in are not of the same size, I’m going to do a

one-cross-one convolution and increase the number

of filters to be, let’s say, the max

of these two so that everybody gets on the

same tensor size. So if I add them, or concatenate them, or whatnot, I don’t get compilation failures. So that’s an important

implementation deal. You can very easily get compilation failure when

you sample a network. Okay. So there’s much more details here and the slide deck is I

think for that is also available. Okay. So now, let’s see.

We have 20 minutes. So we’re calling this

Project Petri Dish because as opposed to most of the methods that I

showed you where we were doing basically something

called backward search. So DARTS, ENAS, and that RNN, the first paper that I showed you, which are good three principle

components of literature, there is this implicit assumption in the technique that somehow

you have this super graph, that I get this big

graph from which I can subsample and prune down and

search for a good architecture. Okay. That’s also human domain

knowledge that you have to inject. Let’s say you get a dataset

which is not Vision, which is not just an LP and whatnot, so you have really no

intuition on what is a good super graph to start with

and how big should I make it. The bigger you make it the more

search time you will take. You may be able to squeeze

out more performance at the end if you are willing to spend more search budget in GPU hours, but this also mean you

may be very wasteful. So what if instead of

doing backward search, which is all 99 percent

of the papers out there, we could do forward search. Forward search, basically,

you should think of, we call it Petri dish because

like in a Petri dish, we are just growing

the network as needed. It has the following flavor that we will start from a very seed network, let’s say one or two layers, and then we will see, hey, should I add another layer or not? If I want to add a layer, what kind should it be? This is the mathematical

question we are going to pose and we are going to try to find an algorithm that will do this. Now, what’s the nice

part about doing this forward instead of backwards is

I don’t need all that things. Second, I also can do macro

search, which is the more general. I don’t need to assume that I have

a cell and I have to do that, go and reduce the search space, we’re assuming a lot of human

domain knowledge into a cell. I also don’t need

another knowledge of a backbone like if you go into your ResNet like backbone

or DenseNet like backbone. Then one nice part is hey, let’s say you have a

10 terabyte dataset. You have a really

good model for that. They comes, I usually go by dev, and says, “Hey, I have

this really cool method, you should try this.” But instead of trying that

what you want is like, “Hey, I have this really good thing. Can you start with this

and modify this instead of undoing all the work we have

done to come up with a good model? Because we actually have put

in a lot of expertise and domain knowledge to come up with this model and right

now it satisfies us.” So you want that property and

it’s not easy to do those things with the previous methods because

they have the flavor that hey, give me a dataset, I will try things and

give you an architecture. So it’s not easy to warm start from

something that already exists. Then, also, explore prior knowledge. Okay. So how many of you know gradient boosting

or does the word ring a bell? Okay. All right. So even if you don’t know, it doesn’t matter for this. But if you do know, then keep gradient

boosting in your head and it’ll come in useful

and you can understand why. It’s just, for example, if you are training,

for example, XGBooster and training like a forest of decision trees via gradient boosting, which is a very powerful method especially for tabular data where you know the low-level features already pretty well or they

have semantic meaning, you are doing boosting. Boosting is this general term, where you’re adding

weak learners one by one and you’re growing

in scope and capacity. If you know gradient boosting, this

will look very much like that. In fact, one of the earliest papers. So this paper was 1989, I think this depth report is 1991. It’s Cascade-Correlation or a CMU. Actually, I’ll skip this because I can just show you the correct method. You start with a small model. Let’s say this model is tiny and you somehow have

these candidates. Let’s say these candidates are

like grab bag of operations. Like a con five-by-five, maxpool, so on and so forth. I want to know where I

should add these candidates, without actually adding them in, I want to know how useful these candidates are going to

be if I were to add them in. So this is like asking a very

hard question like, “Well, you say you want to know

how useful they are, why don’t you just add

them and evaluate them?” But that’s going to run up my GPU costs because if I add

them into the network, then if I have 10

different candidates, I have 10 different networks, I’m going to have to print 10 different things and

I don’t want to do that. I want to try them all at once

but without really trying them and without affecting

the original model. If some of them are useful, I’m going to add them in. Add them in, means they will

just become a new network, a regular network with that

particular thing added in and now I can repeat this process. I may take that model

which has become modified, put it here, and repeat

this process again. How am I going to actually do this

because this seems like magic? So let’s say I have this simple original model and this is a regular blue

lines with regular connections. Here’s really the fun part. So let’s say I have a

candidate, one candidate, and I want to add it in such a way during the

initialization phase, where I see whether it’s

actually what signal it can provide without actually affecting the gradients

the activations, all the gradients which are flowing

back through the blue things. I can actually do that,

it’s not that hard. Let say this candidate,

I decide to add it here. I connect it by this particular type of this dotted line where

the forward is identity, the backward is zero. So electrical engineering, if you’ve heard of

the term Zener diode, this is basically like that. That this is a special

type of layer where forward activations will

flow through just as this, but backward gradients

will be stopped. Then we have the reverse of back, which is here, this one. Which is the forward is zero, which means that if I’m

doing forward prop, this is zero, this is just as it is. So the sum is just whatever

the activations of B are, and then it goes through C. So we have verified that

during forward activations, we don’t change the activations that would have flown

through the blue ones. Everything is unchanged. During backward

propagation, let’s see. So you will get signals. This one, the backward is identity, so you will get gradients up to here, but no more gradients here. The gradient from here will

actually be the original one of B because there is no

dependency on this candidate. So during forward and backward, nothing in the original graph change, but I get to see some gradients here. This should remind you

of gradient boosting. Because now, I can accumulate what the gradient is in that

candidate with respect to B. Because the B is right next to it, and I can see in that candidate

what gradients I’m getting. Here’s the intuition, this is

the inspiration that comes from Scott’s original paper is if this gradient is actually very

highly correlated with your error, which gradients are, then what I’m going to do is that

gives me a signal, a proxy signal that adding

this candidate is good. Because then if I add this candidate into the main network and train it, it can learn weight and

codings which are reverse of the error and minimize

the error residue. Let’s say right now

I have some error. If I were to add this in, I have it on good faith

that this will be able to actually minimize the residual of the error with respect to what

the target values I want. You can do whatever, regression

classification, it doesn’t matter. So that was the middle one. If I decide that I want

to actually add it, so I will add it in gently. There is no theoretical reason

for doing this other than pure intuition is that I’m

going to just take out, replace these with regular

connections that I should have, no more special Zener

diode-like connections. But I’m going to make this one, one and this one, zero in the beginning, the weights. So that I don’t have

this big disruption in the distribution of gradients

that each layer sees. Let back prop and

optimizer figure out what the right values of these weights

should be but gently from here. So in practice these are

called stop gradient layers, and this one that is called

a stop forward layer. Then once you decide

that you want them in, you are going to remove

these things and create just the network and with a scale value and then let the

network decide what it should be.>>I had a quick question. You said that you

end that now gently, do you edit in the middle of the train or do you

start training after? Do you start re-training

after you add this node?>>Let’s hold that for

three more slides. If you still have that

problem, tell me again. Hopefully, this will become clear. So one of the things was, in order to be efficient,

we know that we want to amortize our cluster

forward prop and back prop. This is a common running theme. Without that, you’re not

really efficient in GPU hours. So I want to try whether

on all these operations. So let’s say I have K operations, where K can be 10, 20. I want to try whether I should have which one of these K

operations I should pick here, and which one of the same K

operations I should pick here. All in parallel. So I can

add them up again by doing this particular stop

gradient, stop forward trick. One of the things we do is we impose L1 sparsity penalty on

all the operations. So then we pick the operation

that has the most weight amongst the K. So there’s an L1

regularizer around all of this, plus an L1 regularizer

on all of this, and we greedily do selection. So this is again like

gradient boosting. We are doing greedy selection. Let’s say we decided to

add something together. So then it will become

a part of the model. That happens in this middle stage. So there we will remove them and

add the gentle introduction. Then that model will move on to the third stage where it will

actually get trained fully. Depending upon the signal

we derived from that, we will decide whether it will

go back to being a parent model. These are the parents.

All the candidates are being evaluated on the parents. The parents produce children, and if the children are

good and seem promising, they get to be parents and see

whether they are doing there. Except this will remind you of evolutionary search just

because of the terms I’m using, but we’re not really doing

evolutionary search here. It is just as worse as RL. So one of the things we do

is actually we do one start. So all the model

weights we keep around and just these new layers

that we’ve got added in, those layers get their new weights. By back prop, they get updated, and then they’re always getting incrementally trained. So over time.>>[inaudible] add layers at the top?>>No. So we can add layers anywhere. So that’s actually part of the

decision that you are saying. That’s a good question, is that, you flip a coin and you pick one

of the layers as the input one. You flip the coin again and that

gives you another input two. Then you pick where

you want to be out. So that means you are keeping a

lot of these things together. It’s important because you don’t know where you should be connecting to. So that’s part of the sad space. This is just saying what I just said. This whole thing is always

happening in asynchronously, in a distributed manner. That you have a pool of

parents they are going, and for every parent that

we are trying to figure out which candidates we should

add to which parents where. We train them if we decide we like them and then

they get to come back. Okay. Now the thing that

I haven’t told you is, how do we decide we like a candidate? I told you the mechanism by

which we would like a candidate. How do we know that these

candidates when they are added, are what going back to the parent

view? So here’s what we do. This is nice because, this is important in

the real world is as you are trying things out

asynchronously in all these skews, you are training them,

you get to keep. So this is a scattered plot. So let’s say you care about flops. This is loss and this is cost. So ideally you want to

find the Pareto frontier. This could be memory,

this could be latency, anything you want or

all three of them. Ideally, you want to find the

Pareto frontier, and know that, what is the best possible

performance you can get for a given flops budget and gives you the lowest

loss, and look that up. If tomorrow you get faster GPUs or your serving needs change

and you’re like “Hey, 40 milliseconds is okay.” In these kinds of scenarios, I want to just pick this model out. Like I don’t want to retrain my

architecture search and say, “Hey, give me a model which

is good for 40 milliseconds.” I already have this

gallery of models which is along this approximate

Pareto optimal frontier. Note, I don’t know where the

Pareto optimal frontier is, but as I’m training, I get to have this scatter plot. This scatter plot is building up, because these are all

the things that are at the end of the third

queue, the child models. I train them and I get

to just plot them here. So if you think about it, I know for a fact that

there are models, for example like this one. If this was a parent, it is very highly unlikely that

a small modification, like adding a small

candidate here would result in a model which will

lower the Pareto frontier. It’s highly unlikely I will go from here because I’m adding something, so which means I will go this way, but I will go have a

huge drop in loss. I’m making some reasonable

smoothness assumption that networks which look similar

have similar performance. So which means that there

is really no reason for me to add these models

to the parent queue, because all I really

care about is here. I want to lower this as

quickly as possible, but I don’t know where

exactly the frontier is. It is quite possible that

this one even though it is slightly above the

Pareto frontier here, I know that it is possible, like this is an exploration

exploitation problem that if I were to take this

and make a bent to it, that it will go really down, and it will lower the frontier, and then that’s good news for me. So we explicitly maintain this

Pareto Frontier during search and we take a small tolerance region around the boundary because we don’t know which model may be promising. Based on their distance from

the current estimate of the lower convex hull of

the frontier, we explore. We get to add them to

the parent queue and see if things will make sense. Does that answer your question? Okay. So results. Before we go to results though, before I show you numbers,

it’s easy to show numbers but there’s a lot of caveats

in this kind of research. Right now in the community, we have this like comparison and

fairness reproducibility crisis. True in just all of ML but NAS

makes it even harder, why? Because everybody comes up

with their own search space. So then it’s hard to compare the

merits of a search algorithm. Not only that, one of the things we don’t address is

like hyper parameter tuning. It is very well known like let’s

say if you are a competitor, you have a computing algorithm. Just because you have a

better training subroutine, you can come up with a very good like your accuracy is

maybe much higher. If you’re training subroutine

for training a model, a network has all the tips

and tricks like Cutout, Cut mix, this kind of

augmentation has cosine, warm up in learning rates and this and that and that

makes a huge difference. You will not even be

on the leader board even though you have a

great search algorithm. So that makes a huge difference,

what’s your hardware? People are reporting GPU hours, kTGPU hour is not the

same as a V100 or, like a TPU GPU hour, or an FPGA or GPU hour,

so what’s your hardware? We know we have stochasticity

in training on GPUs and like, how many random seeds? How many trials you did? So unless you are a

lab with lots of GPUs, this may be expensive for you

to do and report properly. This is also a multi-objective

optimization problem, and that it’s not

just about accuracy, it’s about accuracy at certain

flops or memory or latency. We know that like, you know, hey, if you by using very fancy

distributed training, I can train like a two gigabyte

model on ImageNet and be sort out, but that’s not very interesting. You are not going to use that two

gigabyte model in production, and that’s a really

uninteresting result. So there’s lots of work being done. Like there’s one NAS

bench which came out. This is from Google and [inaudible] people whose

link I was showing. But they actually went

and trained close 1.8 million different architectures

and put it in a lookup table. So if you want to like, for example, evaluate your search algorithm, as long as you embrace that

implicit search space, you can just do look

up on your laptop. You can do NAS research on a laptop by just downloading

that lookup table, because it’s like a few MV of text. But like things like weight-sharing and DARTS-like search spaces

cannot really work there, including our incremental thing, but more benchmarks are coming soon. And the community I’m pretty

sure will rally around them and it will also bring NAS research

to communities who cannot afford, unless you’re a big

industry lab with like 5000 GPUs you won’t even touch this. So it will democratize

it a little bit more. Okay. Results. As I said, look at Zoph and Le, this was like 1600 GPU days, so converted into hours. We do well on both of these fronts. This is not going to blow

you out of the park, but you should not be doing any RL. So this is not an RL problem. We actually showed

that, hey using macro and you will see that some of these numbers have plus-minus

and some of these don’t, because there’s also like variance

in how people report results. Like we actually run multiple

trials with different random seeds, and there’s a list of best

practices that one should do for reporting any results on

any machine learning algorithm. Because we found that hey, a lot of people are taking

the best result they found, during all their experiments

and reporting that. That’s essentially like, hey

I threw a number of DARTS, and one of them struck gold, and I’m going to report that and put up the trained model on GitHub, and that’s not very honest. You should really have

variances and reproducibility. But the results won’t blow

you away because again, there’s no easy way

for me to show you that the gallery of

models we produce. We are just picking

one of the points that our algorithm produces and

putting it on this here, just so that I can show

you some comparisons. But this is not a

very fair comparison like apples-to-apples by any means, and there’s a huge discussion

that goes into there. One of the things we

were able to show that hey if you transfer to ImageNet, which is this process

of picking a cell and adding many cells and

then training it, you can actually like

with macro value, you have no knowledge of

backbone, ourselves or anything. You can do quite well as

close to sell as such. I’ll skip the language modeling

and we are out of time. Just to end, there’s lots

of research problems and NAS has really useful successes, but it’s no means as hogged

problem, because if it was hogged, we won’t have to write

architectures by hand anymore. But right now, if you

are like I always tell PhD students or

potential PhD students, that if your research agenda in

academia suddenly becomes let me hand designed a new architecture

for doing object recognition, you should really rethink

that because nobody beats NAS right now on

those kinds of things. We have an implementation on GitHub but a lot of that implementation

is in TensorFlow, and I know TensorFlow is not

the most popular right now. So we’re redoing

everything in by touch, and we’re also making the code

a little bit more robust. So watch that GitHub link, we’ll send out a blog post

with all the right things. But one of the asks we have is you’re from all

over the company is, can we get access to

first-party data sets. We don’t need to publish.

The method is published. We have a patron or

defensive patron.net. Interesting new datasets. They don’t need to be just a vision and language, they can be anything. We would love to get our

hands-on and work with you and try to see what we can do, just for our own curiosity. If you have production serving needs and we have all these kinds of

you want to take into account. Give me the best model which

has less than three MB GPU RAM, like that kind of scenarios. Would love to get our hands on that. I will end just to so that

you have a bit of a break.

## hadooken face

January 10, 2020Debadeeta sir, please bring toilets to us in India. We have AI egsperts here a lots MS programers here.