Vanilla Neural Networks

Introduction

In this post, I walk through the inner workings of a vanilla neural network. In particular, I describe the simple forward-pass of input data, as well as the back-propagation during the training process.

The Overall Framework

Neural nets are a supervised learning framework that learns a function that maps input vectors, $x$, to output vectors, $y$. Vanilla neural nets take the simplest approach possible. It first multiples the input by a matrix, $w$, adds a bias vector, $b$, then passes the result through a non-linearity, like the sigmoid function, $\sigma(z)=\frac{1}{1+e^{-z}}$ The expression for such a neural network is:

\begin{align*} \begin{split} \boldsymbol{z} = \boldsymbol{w} \boldsymbol{x} + \boldsymbol{b} \\ \boldsymbol{y} = \sigma(\boldsymbol{z}) \end{split} \end{align*}

Aside from the obvious ability to express non-linear functions, the $\sigma$ function allows neural nets to be stacked on top of each other. That is, the output of one neural network can be fed into another, as input. This ability to conduct deep learning allows these networks to have an exponentially large function-space, with incredible expressiveness. Note that this expressiveness cannot be attained without the non-linearity, as multiple matrices will mathematically reduce to a single matrix.

Training a Neural Network

The crucial question now is, How do we get the right weights and biases? In other words, how should the initially random entries of the $\boldsymbol{w}$ matrices and $\boldsymbol{b}$ vectors be modified to align themselves so our neural net behaves like a good approximation between input and output? In this section, we'll mathematically analyze neural nets to figure out how we can coax these weights to reach better values.

The Loss Function

First, we need to define what it means for a neural net to be a good approximation of the mapping from input to output, given a number of training points. Intuitively, this mapping should be one that fits the training data best. That is, given the input vectors of the training data, the neural net should give back a vector similar to the corresponding desired outputs vectors.

We'll define a metric of how off our net's outputs, $\boldsymbol{y}$, are with the desired outputs, $\boldsymbol{y^{*}}$, via a loss function. A common loss function to use is the quadratic loss function, defined as follows:

\begin{equation*} L(\boldsymbol{y} | \boldsymbol{w}, \boldsymbol{b}) = \frac{1}{2n} \sum_{\boldsymbol{x}} \lVert \boldsymbol{y(x)} - \boldsymbol{y}^{*} \rVert ^2 \end{equation*}

Here, $n$ is the number of training points we're working with, $\boldsymbol{y}$ is the final output of the neural net, and $\boldsymbol{y}^{*} $ is the desired output. The quadratic loss function is just the average of the squared Euclidean distances between desired and actual outputs (the factor of $\frac{1}{2}$ is unimportant - only there to ease later calculations). Our goal is to decrease the value of this loss function as much as possible. We'll do this with what's called gradient descent.

Gradient Descent

The idea behind gradient descent is to iteratively decrease our loss by nudging all $w$ and $b$ to their optimal values. Each iteration, our goal is to achieve a negative $\Delta L$. Calculus tells us that

\begin{align*} \begin{split} \Delta L \approx \nabla_{\boldsymbol{w}} L \cdot \Delta \boldsymbol{w} + \nabla_{\boldsymbol{b}} L \cdot \Delta \boldsymbol{b} \end{split} \end{align*}

We can see from these expressions that, if we let $\Delta \boldsymbol{w} = -\eta \nabla_{\boldsymbol{w}} L$ and $\Delta \boldsymbol{b} = -\eta \nabla_{\boldsymbol{b}} L$, we're guaranteed to get back a negative value for $\Delta L$, and consequently, a lower value of $L$. Note that the $\eta$ is any number small enough to justify the approximation we make in the above equation. Thus, each iteration, we'll simply update the weights and biases according to

\begin{align*} \begin{split} \boldsymbol{w} \leftarrow \boldsymbol{w} - \eta \nabla_{\boldsymbol{w}} L \\ \boldsymbol{b} \leftarrow \boldsymbol{b} - \eta \nabla_{\boldsymbol{b}} L \\ \end{split} \end{align*}

Therefore, we must simply follow the above update rules for some number of iterations. The question is, How do we find $\nabla_{\boldsymbol{w}} L$ and $\nabla_{\boldsymbol{b}} L$?

Back Propagation and the Chain Rule

Imagine we had a neural network consisting of $10$ layers. Let the superscript of the weights and biases indicate the index of the layer it belongs in. We've just passed in a training input, so we know the $\boldsymbol{y}$ values for each layer, as well as the inputs to the non-linearities, $\boldsymbol{z}$, at each layer. And, of course, we know $\boldsymbol{y}^{*} $, the final desired output. To find $\nabla_{\boldsymbol{w}^{(10)}} L = \frac{\partial L}{\partial \boldsymbol{w}^{(10)}}$ and $\nabla_{{\boldsymbol{b}}^{(10)}} L = \frac{\partial L}{\partial \boldsymbol{b}^{(10)}}$, all we have to do is utilize the chain rule to obtain

\begin{align*} \begin{split} \frac{\partial L}{\partial \boldsymbol{w}^{(10)}} = \boldsymbol{\delta}^{(10)} \frac{\partial \boldsymbol{y}^{(10)}}{\partial \boldsymbol{z}^{(10)}} \frac{\partial \boldsymbol{z}^{(10)}}{\partial \boldsymbol{w}^{(10)}} \\ \frac{\partial L}{\partial \boldsymbol{b}^{(10)}} = \boldsymbol{\delta}^{(10)} \frac{\partial \boldsymbol{y}^{(10)}}{\partial \boldsymbol{z}^{(10)}} \frac{\partial \boldsymbol{z}^{(10)}}{\partial \boldsymbol{b}^{(10)}} \\ \boldsymbol{\delta}^{(10)} = \frac{\partial L}{\partial \boldsymbol{y}^{(10)}} \end{split} \end{align*}

In the case of our sigmoid activation function, we know this amounts to

\begin{align*} \begin{split} \frac{\partial L}{\partial \boldsymbol{w}^{(10)}} = \boldsymbol{\delta}^{(10)} \sigma'(\boldsymbol{z}^{(10)}) \boldsymbol{y}^{(9)} = \boldsymbol{\delta}^{(10)} \left( \sigma(\boldsymbol{z}^{(10)})(1 - \sigma(\boldsymbol{z}^{(10)})) \right) \boldsymbol{y}^{(9)} \\ \frac{\partial L}{\partial \boldsymbol{b}^{(10)}} = \boldsymbol{\delta}^{(10)} \left( \sigma(\boldsymbol{z}^{(10)})(1 - \sigma(\boldsymbol{z}^{(10)})) \right) \\ \boldsymbol{\delta}^{(10)} = \boldsymbol{y}^{(10)} - \boldsymbol{y}^{*} \end{split} \end{align*}

We now have $\nabla_{\boldsymbol{w}^{(10)}} L$ and $\nabla_{{\boldsymbol{b}}^{(10)}} L$ terms to use in our updates for layer $10$. Now, we must finds the terms, $\nabla_{\boldsymbol{w}^{(9)}} L$ and $\nabla_{\boldsymbol{b}^{(9)}} L$. To do this, we again use the chain rule. The expression for layer $9$ is

\begin{align*} \begin{split} \frac{\partial L}{\partial \boldsymbol{w}^{(9)}} = \boldsymbol{\delta}^{(9)} \left( \sigma(\boldsymbol{z}^{(9)})(1 - \sigma(\boldsymbol{z}^{(9)})) \right) \boldsymbol{y}^{(8)} \\ \frac{\partial L}{\partial \boldsymbol{b}^{(9)}} = \boldsymbol{\delta}^{(9)} \left( \sigma(\boldsymbol{z}^{(9)})(1 - \sigma(\boldsymbol{z}^{(9)})) \right) \\ \boldsymbol{\delta}^{(9)} = \frac{\partial L}{\partial \boldsymbol{y}^{(9)}} = \frac{\partial L}{\partial \boldsymbol{y}^{(10)}} \frac{\partial \boldsymbol{y}^{(10)}}{\partial \boldsymbol{z}^{(10)}} \frac{\partial \boldsymbol{z}^{(10)}}{\partial \boldsymbol{y}^{(9)}} = \boldsymbol{\delta}^{(10)} \left( \sigma(\boldsymbol{z}^{(10)})(1 - \sigma(\boldsymbol{z}^{(10)})) \right) \boldsymbol{w^{(10)}} \end{split} \end{align*}

Notice that layer $9$'s expression is identical to the one for layer $10$, except the $\boldsymbol{\delta}$ is updated. In fact, $\boldsymbol{\delta}^{(9)}$ relies on $\boldsymbol{\delta}^{(10)}$. By updating $\boldsymbol{\delta}$ for each preceding layer, we can efficiently compute each layer's $\nabla \boldsymbol{w}$ and $\nabla \boldsymbol{b}$ values. This process of storing and reusing the $\boldsymbol{\delta}$ values is called back propagation. The following is the calculations required to update layer $8$.

\begin{align*} \begin{split} \frac{\partial L}{\partial \boldsymbol{w}^{(8)}} = \boldsymbol{\delta}^{(8)} \left( \sigma(\boldsymbol{z}^{(8)})(1 - \sigma(\boldsymbol{z}^{(8)})) \right) \boldsymbol{y}^{(7)} \\ \frac{\partial L}{\partial \boldsymbol{b}^{(8)}} = \boldsymbol{\delta}^{(8)} \left( \sigma(\boldsymbol{z}^{(8)})(1 - \sigma(\boldsymbol{z}^{(8)})) \right) \\ \boldsymbol{\delta}^{(8)} = \frac{\partial L}{\partial \boldsymbol{y}^{(8)}} = \frac{\partial L}{\partial \boldsymbol{y}^{(9)}} \frac{\partial \boldsymbol{y}^{(9)}}{\partial \boldsymbol{z}^{(9)}} \frac{\partial \boldsymbol{z}^{(9)}}{\partial \boldsymbol{y}^{(8)}} = \boldsymbol{\delta}^{(9)} \left( \sigma(\boldsymbol{z}^{(9)})(1 - \sigma(\boldsymbol{z}^{(9)})) \right) \boldsymbol{w^{(9)}} \end{split} \end{align*}

As expected, the $\boldsymbol{\delta}$ value for layer $8$ is computed by using the $\boldsymbol{\delta}$ value for layer $9$. On a quick side note, be aware that most literature uses $\delta = \frac{\partial L}{\partial z}$ instead of $\delta = \frac{\partial L}{\partial y}$, since it becomes a little more computationally efficient. However, the derivations via the chain rule are nearly identical.

Conclusion

The basic ideas behind neural nets were:

  • Feed training samples through the layers of matrix multiplications, vector additions, and non-linearities.
  • Use back-propagation to calculate the gradient of the loss-function with respect to each of the parameters.
  • Use the calculated gradients, along with any tiny $\eta$ value to update each parameter.

I plan to build off this post by exploring other architectures of neural nets: convolutional nets, recurrent nets, long short-term memory units, etc. Meanwhile, I'll be working on a generic neural net implementation that incorporates not only feed-forward layers (as we've seen in this post), but convolutional, recurrent, and other neural layers. Here's the code for both python and C++ neural network implementations.