The learning rate $$\eta$$ is one of the hyperparameters we need to optimize when training neural networks. It controls how fast we reach the minimum in our error function using gradient descent. If $$\eta$$ is too small, the learning process takes too long which is especially a problem in deep networks which already have the burden of high learning times. But using too high learning rates can result in problems as well. We might overstep the minimum and oscillate around in the error landscape. Here, we want to analyse the effect of the learning rate on a simple example. For this, we use the following network which consists only of one input and one sigmoid neuron.

To train the network, we use the quadratic error function between the expected output $$T^{\mu}$$ and the output of the network $$y(x^{\mu})$$ for a training instance $$\mu$$ (with $$M$$ training instances in total)

$$\label{eq:NNLearningRate_ErrorFunction} E(w, b) = \sum_{\mu=1}^{M} \left( T^{\mu} - y(x^{\mu}) \right)^2 = \sum_{\mu=1}^{M} \left( T^{\mu} - f(w \cdot x^{\mu} + b) \right)^2.$$

We use a small training dataset consisting only of four points (values actually):

\begin{align*} x^1 &= -1, &T^1 &= 0 \\ x^2 &= 0, &T^2 &= 1 \\ x^3 &= 1, &T^3 &= 0 \\ x^4 &= 2, &T^4 &= 0 \end{align*}

Using gradient descent, the update rules for the weight $$w$$ and bias $$b$$ are defined as follows

\begin{align} \begin{split} w(t+1) &= w(t) + \eta \cdot \frac{\partial E(w,b)}{\partial w} \\ b(t+1) &= b(t) + \eta \cdot \frac{\partial E(w,b)}{\partial b} \end{split} \label{eq:NNLearningRate_UpdateRules} \end{align}

and we are using them in batch update rules, i.e. all four training samples are used in each iteration. The following animation shows the error surface produced by \eqref{eq:NNLearningRate_ErrorFunction} and a trajectory from the starting point $$(w(0), b(0)) = (-2, -2)$$ with the goal of reaching the minimum. You can change the value of the learning rate $$\eta$$ and see how it changes the trajectory. Note also the plot below the error surface which shows the error $$E(w(t), b(t))$$ over the iterations $$t$$. Ideally, this function decreases until we reach the minimum. Unfortunately, this does not always work out that well.

List of attached files:

← Back to the overview page