Title: Highway Networks
Authors: Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber
Published: 3rd May 2015 (Sunday) @ 01:56:57
Link: http://arxiv.org/abs/1505.00387v2

Abstract

There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on “information highways”. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.


It is worth noting that Srivastava, Greff and Schmidhuber’s paper was first uploaded to the arXiv on 3rd November 2015, whilst the ResNet Deep Residual Learning for Image Recognition, which contains basically the same idea applied to a deep CNN, was uploaded (v1) on 10th December 2015, as far as I can tell.


Highway networks enable training of very deep networks by allowing the ‘unimpeded information flow across several layers on information highways’ through the use of trainable gating units, inspired by LSTMs, which learn to regulate the flow of information through the network.

Motivation for Highway Networks

The authors cite the contemporary use of deeper networks to make advances in e.g. computer vision through AlexNet’s deeper architecture and smaller convolutive filters in each layer. They mention the theoretical results from (Montufar 2014) On the Number of Linear Regions of Deep Neural Networks1 and others on the increased efficiency of deep nets (as opposed to wide ones, for example) whilst highlighting the difficulty of training deep nets leading researchers to use e.g. initialization schemes, multiple-stage training or temporary companion losses.

Method

A vanilla fully-connected network applies an affine transformation (linear transformation via matrix multiplication of input vector and addition of bias vector) and follows this with a non-linear activation function. Omitting the layer index and biases for clarity,

Highway networks additionally define two nonlinear transforms and such that

where:

  • is the transform gate
  • as the carry gate

since they express how much of the output is produced by transforming the input and carrying it, respectively.

The authors set (analogous to a GRU2) giving

So in the extremes we can carry the input as-is or entirely transform it:

Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of a plain layer and that of a layer which simply passes its inputs through.

The Jacobian of the layer transform varies between the identity matrix and the derivative of the transform:

NB Since gated residual (“skip”) connections are used, the dimensionality of and must be the same. Plain FC layers can be used to match dimensionality between differently-sized highway blocks.

Highway Networks - Abridged Results and Conclusions

  • Their setup proved effective in their experiments, allowing training of deep (10, 20, 50, 100-layer) networks under negative bias initialization with a range of activation functions and weight initialization distributions.
  • They used 50 units in highway layers and 71 in plain layers (to equate number of parameters)
  • Plain nets at depth 10 outperformed highway networks but going deeper (20+), highway nets outperformed plain nets
  • Optimization (after optimization hyperparameter tuning) was faster for highway nets amongst deeper networks

Highway Networks Cross Entropy Error Plots

Comparison of Optimization of Plain Networks and Highway Networks of Varying Depth

PyTorch Implementation

A single highway block (layer) can be implemented in PyTorch as follows.

class Highway(nn.Module):
    def __init__(self, in_size, out_size):
        super(Highway, self).__init__()
        self.H = nn.Linear(in_size, out_size)
        self.H.bias.data.zero_()
        self.T = nn.Linear(in_size, out_size)
        self.T.bias.data.fill_(-1)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
 
    def forward(self, inputs):
        H = self.relu(self.H(inputs))
        T = self.sigmoid(self.T(inputs))
        return H * T + inputs * (1.0 - T)
 

This code was taken without modification from tacotron.py in github.com/r9y9/tacotron_pytorch.

Footnotes

  1. This is a beautiful paper that shows geometrically how the number of regions that can be separated with linear boundaries grows exponentially with the depth of a network and only polynomially with the number of units in a layer.

  2. In a GRU, the process of adding new input and forgetting the old hidden cell state are coupled, with the new state being a linear combination of the previous and the new input. See for example http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ which writes: “The input and forget gates [of an LSTM] are coupled by an update gate [in the GRU]“. NB The GRU architecture is attributed to Cho et al. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation