Hi, I'm Aurélien Géron, and today I'm going to explain how Synthetic Gradients can
dramatically speed up training of deep neural networks, and even often improve their performance
significantly.
We will also see how they can help recurrent neural networks learn long term patterns in
your data, and more.
Synthetic Gradients were introduced in a paper called "Decoupled Neural Interfaces using
Synthetic Gradients" published on Arxiv in 2016 by Max Jaderberg and other DeepMind
researchers.
As always, I'll put all the links in the video description below.
To explain Synthetic Gradients, let's start with a super quick refresher on Backpropagation.
Here's a simple feedforward neural network that we want to train using backpropagation.
Each training iteration has two phases.
First, the Forward phase: we send the inputs X to the first hidden layer, which computes
its outputs h1 using its parameters theta1, and so on up to the output layer, and finally
we compute the loss by comparing the network's outputs and the labels.
Then the Backward phase.
The algorithm first computes delta3, which are the gradients of the loss with regards
to h3, then these gradients are propagated backwards through the network, until we reach
the first hidden layer.
The final step of Backpropagation uses the gradients we have computed to tweak the parameters
in the direction that will reduce the loss.
This is the gradient descent step.
Okay, that's it for Backpropagation.
Now suppose you want to speed up training.
You buy 3 GPU cards, and you split the neural network in three parts, with each part running
on a different GPU.
This is called model parallelism.
Unfortunately, because of how Backpropagation works, model parallelism is inefficient.
Indeed, to compute the loss, you first need to do a full forward pass sequentially.
Each GPU has to wait for the previous GPU to finish working on a training batch before
it can start working on it.
This is called the Forward Lock.
Notice that the model parameters cannot be updated before the loss is computed.
And this is called the Update Lock.
And finally, we cannot update a layer's parameters before the backward pass is complete,
at least down to the layer we want to update.
This is called the Backward lock.
The consequence of all these locks is that GPUs will spend most of their time waiting
for the other GPUs.
As a result, training on 3 GPUs using model parallelism is actually slower than training
on a single GPU.
So, the main idea behind Synthetic Gradients is to break these locks, in order to make
model parallelism actually work.
Let's see how.
First we send the inputs to the first hidden layer.
Then this layer uses its parameters theta1 to compute its outputs.
So far, nothing has changed.
But now we also send the outputs h1 to a magical little module M1, called a Synthetic Gradient
model.
We'll see how it works in a few minutes, but for now it's just a black box.
This model tries to predict what the gradients for the first hidden layer will be.
It outputs the synthetic gradients delta1 hat, which are an approximation of the true
gradients delta1.
Using these synthetic gradients, we can immediately perform a gradient descent step to update
the parameters theta1, no need to wait.
This hidden layer equipped with its Synthetic Gradient model is effectively decoupled from
the rest of the network.
This is called a Decoupled Neural Interface, or DNI.
In parallel, the second layer can do the same thing.
It uses a second Synthetic Gradient model M2 to predict what the gradients will be for
the second hidden layer.
And it performs a gradient descent step.
And so on up to the output layer.
This time instead of using a Synthetic Gradient model, we might as well compute the true gradients
directly and use these true gradients delta3 to update the parameters theta3.
And we are done!
Notice that we only did a forward pass, no backward pass.
So just like that, training could potentially be up to twice faster.
Just to be clear, the Synthetic Gradient models are only used during training.
After training, we can use the neural network as usual, based on the trained parameters
theta1, 2 and 3.Okay, now let's see how this technique enables model parallelism during
training.
Once again, let's split the network into three parts, each running on a different GPU
card.
And the CPU will take care of loading the training instances and pushing them into a
training queue.
We start by loading the first training batch.
And while the first GPU is computing h1, and updating its parameters using synthetic gradients,
we can already load batch number 2 and push it into the queue.
Then while layer 2 takes care of batch number 1, layer 1 can already take care of batch
number 2.
No need to wait!
And so on, so you get the picture.
Now each layer is working in parallel on a different batch, so all GPUs are active, they
are much less blocked waiting for other GPUs to finish their jobs.
And we can continue like this until the end of training.
As you can imagine, this can dramatically reduce training time.
However, every time we go from one layer to the next, we need to move a lot of data across
the GPU cards.
This can take a lot of time and in practice it can far outweigh the benefits of this architecture.
But if you have a deep neural network composed of, say, 30 layers then you can split it in
3 parts of 10 layers each.
You can use Synthetic Gradient models at every hidden layer, or every few hidden layers,
or just at the interfaces between the GPU cards.
With so many layers, the time required to copy the data across GPU cards is now small
compared to the total computation time, so the GPU cards spend much less time waiting
for data, and you can hope to train your network close to 3 times faster than using regular
Backpropagation on a single GPU card.
So model parallelism actually works!
Great!
Now it's time to open the black boxes and see how the Synthetic Gradient models work.
Let's focus on a hidden layer i.
It has its own Synthetic Gradient model Mi which produces synthetic gradients delta i
hat, and these synthetic gradients can be used to update the hidden layer's parameters
without waiting for the true gradients to be computed, as we have just seen.
This model can simply be a small neural network.
For example, a single linear layer, with no activation function.
Or it could have a hidden layer or two.
We will simply train the Synthetic Gradient model Mi so that it gradually learns to correctly
predict the true gradients delta i.
For this, we can just train the Synthetic Gradient model normally, by minimizing a loss
function.
We can just use regular Backpropagation here, nothing fancy.
For example, we can minimize the distance between the synthetic gradients and the true
gradients (in other words, the L2 norm of their difference), or we can minimize the
square of that distance.
But this begs the question: how do we compute the true gradients delta i?
If we need to wait for the loss function to be computed and for the true gradients to
flow backward through the network, then we have somewhat defeated the purpose of synthetic
gradients.
Fortunately, there's a neat trick to avoid this.
We can just wait for the next layer to compute its synthetic gradients delta i+1 hat and
then we just Backpropagate these synthetic gradients through layer i+1.
This does not really give us the true gradients delta i, but hopefully something pretty close.
Of course if the next layer happens to be the output layer, then we might as well compute
the true gradients and Backpropagate them.
Over time, the Synthetic Gradient models will get better and better at predicting the true
gradients, and this will be useful both for updating the parameters correctly and also
for providing accurate gradients to train the Synthetic Gradient models in lower layers.And
that's it, you now know what synthetic gradients are, how they work and how they can speed
up neural network training.
But there are a few more important things to mention.
Firstly, Synthetic Gradients can be used pretty much on any type of network, including convolutional
neural networks such as this one.
Just add Synthetic Gradient models after some hidden layers, and that's about it.
Each Synthetic Gradient model's outputs must have the same shape as its inputs, that
is the same shape as the outputs of the layer they are attached to.
For example, M1's outputs must have the same shape as the outputs of this convolutional
layer.
Suppose it's a convolutional layer with 5 feature maps of size 400x200, then that's
exactly the shape that M1 must output.
That's a 5x400x200 array.
In practice, you can use a shallow convolutional neural network that preserves the shape of
its inputs, so for example a couple convolutional layers with zero padding and stride 1 would
do just fine.
Here's another important point.
Until now, the input of each Synthetic Gradient model Mi was only the output of the corresponding
layer, hi.
But it is perfectly legal to provide additional information to the Synthetic Gradient model,
so that it can make better predictions.
For example, we can give it the labels of the current batch.
This is called a conditional Decoupled Neural Interface, or cDNI.
In the paper, the authors show that cDNI consistently performs better than regular DNI, so it should
probably be your default choice.
So in the paper, they experimented with the MNIST dataset of handwritten digits, using
various architectures and training methods.
In particular, they used this fully connected network with 3 to 6 hidden layers of 256 neurons
each.
They used Batch normalization and the ReLU activation function at each hidden layer.
And here is a graph presented in Figure 2 in the paper.
It shows the learning curves for 3 to 6 hidden layers and for various training methods.
For example, when trained using regular Backpropagation, the network reaches below 2% error on the
test set, and it gets better when you add more layers.
Using Synthetic Gradient models at each hidden layer, the final performance of the 3 layer
network ends up being better than before, but it takes time to train the synthetic models,
so overall, you know, it's a little bit longer than Backpropagation.
When you add more layers, the network's performance actually decreases, and training
time increases.
That's not great.
Note that each synthetic gradient model is actually composed of two hidden layers of
1024 neurons each, and one output layer of 256 neurons.
They also used batch normalization and the ReLU activation function in the hidden layers.
Finally, they tried training the network using conditional DNI.
The network gets better when you add more layers, and with 6 layers it actually reaches
the best performance overall.
Moreover, as you can see, this is the fastest learning architecture.
It reaches less than 2% error in just a few thousand iterations.
Surprisingly, they used very simple synthetic gradient models, without any hidden layers
here.
I am curious to know why they did not use the same synthetic models for DNI and cDNI,
because it feels like we are comparing apples and oranges.
Anyway, it clearly demonstrates that cDNI performs much better than Backpropagation
on this task, both in terms of final accuracy and training speed.
There are many more results in the paper, if you're interested, in particular great
results with Convolutional Neural Nets.
Another great application of Synthetic Gradients is in Recurrent Neural Networks.
At each time step t, a recurrent layer takes the inputs Xt, as well as its own outputs
from the previous time step h_t-1, and it produces the output h_t.
It is convenient to represent RNNs by unrolling them through time, across the horizontal axis,
like this.
First the recurrent layer takes the inputs at time t=0, and it has no previous outputs.
It then outputs h_t=0
And at the next time step, it takes the inputs X_t=1 and the previous outputs h_t=0.
To be clear, these two boxes represent the same recurrent layer at two points in time.
Then it outputs h_t=1
And we could go on and on and on…
However, during training, we have to stop at one point, or else we will run out of memory.
We can then compute the loss based on the outputs produced so far.
And we can perform Backpropagation.
And finally we can update the parameters of the recurrent layer.
This technique is called Truncated Backpropagation through time.
It works well, but it has its limits.
In particular, since we only computed the loss on a few outputs, we know nothing about
the future losses.
So in practice, this means that the network cannot learn long-term patterns.
So let's see how Synthetic Gradients can help solve this problem.
Instead of stopping at time step t=3, let's unroll the network for just one additional
time step.
But instead of using its outputs to compute the loss, we send them to a Synthetic Gradient
model.
It estimates the gradients for that time step, delta_t=4_hat.
And we backpropagate these gradients through the layer to get an estimate of delta_t=3.
We can then perform regular Backpropagation through time, by mixing the true gradients
and the estimated future gradients.
Finally, once we have all the gradients we need, we can update the parameters of the
recurrent layer by performing a gradient descent step.
We must not touch the last unrolled cell, because this would change its output h_t=4,
and we are going to need it in a minute to train the Synthetic Gradient model.So by using
Synthetic Gradients in a recurrent neural network like this, we can capture long term
patterns in the data even if we unroll the network through just a few time steps.
Now, let's see how we can train the Synthetic Gradient model.
For this, we will need to run the network on the next few time steps, so let's move
forward in time.
Okay, clean up a bit and push this to the left to have more space.
Okay, now we run the RNN on the next few time steps.
Okay, we compute the loss.
We add an extra time step and we use the Synthetic Gradient model to estimate the gradients for
that time step.
And just like earlier, we Backpropagate these synthetic gradients and we mix them with true
gradients.
And now this process gives us something pretty close to the true gradients for time step
4, and we can use these gradients to train the Synthetic Gradient model.
Next, we can use the gradients we computed to update the RNN's parameters.
And boom!
Of course we could repeat this process many times, and both the RNN and the Synthetic
Gradient model would get better and better.It does add some complexity, but you can bet
that the main Deep Learning libraries will soon hide this complexity from us, hopefully.
And if you need some motivation, here are some amazing results.
This graph is a simplified version of Figure 4 in the paper, and it comes from DeepMind's
great blog post about Synthetic Gradients, which I highly encourage you to read (the
link is in the video description below).
It shows the performance of various RNNs on the Penn Treebank task, which is a language
modelling task.
The horizontal axis shows training time, and the vertical axis shows the model's error,
measured in bits per character (BPC).
The three dashed lines are the learning curves of a regular RNN using Backpropagation through
time, unrolled through 8, 20 or 40 time steps.
So the more you unroll the RNN, the longer it takes to train, and the more data it requires,
but also the better the performance it eventually reaches.
Now compare these three dashed lines to the solid line on the left: it shows the learning
curve of an RNN trained using Backpropagation through time unrolled through just 8 time
steps, but this time using synthetic gradients.
As you can see, the model reaches the lowest error, even better than the model unrolled
through 40 time steps, and it takes roughly half as much time and data to train.
That's really impressive!
Okay next!
Yet another really interesting idea in the paper aims to break the forward lock.
Recall that the Forward lock is the fact that we need to wait for the lower layers to finish
before we can compute the top layers.
It may sound impossible to break this lock, but it is in fact quite simple: you can just
equip any layer you want with a Synthetic Input model.
For example, let's add a Synthetic Input model I3 to layer 3, which is the output layer.
It allows us to skip the hidden layers 1 and 2 by computing h2_hat, an approximation of
h2, the inputs of layer 3.
We can just feed h2_hat directly to the output layer.
And ta-da!
We've just broken the forward lock.
As you might guess, once we eventually get the output of the hidden layer 2 we can use
it to train the Synthetic Input model.
This is really the exact same idea as earlier, but going forwards rather than backwards.
In fact, we can even use the same trick as earlier to go even faster.
Instead of letting the signal propagate through the whole network to compute h2, we can just
use the synthetic input model from the previous layer and feed it to the hidden layer 2, and
this will give us something hopefully close enough to h2, to train I3, the Synthetic Input
model of layer 3.
To conclude, let's look at the data flow of a fully Decoupled Neural Interface that
uses both synthetic inputs and synthetic gradients.
First, the Synthetic Input model receives the next training batch and computes an approximation
of the layer's inputs, h_i-1_hat.
Then, the hidden layer computes its outputs h_i and feeds them simultaneously to the next
layer and to its own Synthetic Gradient model.
These gradients are backpropagated through the hidden layer, which gives a reasonably
good approximation of the true gradients for the previous layer.
The gradients delta_i-1 are just sent back to the previous layer, which will use them
to update its own Synthetic Gradients model.
And immediately after that, we can update the layer's parameters using the Synthetic
Gradients delta_i_hat.
At some point we receive the outputs of the previous layer, h_i-1, and we will use them
to train the Synthetic Input model.
And lastly, we receive the gradients from the next layer, and we use them to train the
Synthetic Gradients model.
And that's it!
The DNI is ready to handle the next training batch.
If you want to learn more about Synthetic Gradients, I encourage you to read the paper
itself, as it touches on a few more topics, such as many implementation details, or how
Synthetic Gradients can help two Recurrent Nets communicate efficiently when they don't
tick at the same rate, and so on.
Also check out the links in the video description, there are several interesting blog posts and
implementations, and I might add my own implementation at one point.
If you want to learn more about Deep Learning, check out my book Hands-On Machine Learning
with Scikit-Learn and TensorFlow.
In particular, there's a whole chapter on running TensorFlow across multiple GPUs and
servers.
There's also a german version and a French version, and I believe a Chinese version should
be out in the next few weeks.
And that's all I had for today!!
I hope you enjoyed this video and that you found it useful.
If you did, please, like, share, comment, subscribe, and you can also follow me on Twitter
if you're into that.
See you next time and I wish you a very Happy New Year!
Không có nhận xét nào:
Đăng nhận xét