In Understanding the Difficulty of Training Neural Networks, Xavier Glorot and Yoshua Bengio investigate the effect of various different activation functions (like the sigmoid, hyperbolic tangent (tanh) and softsign on the saturation of weights in an 4-layer MLP toy network when trying to train for image classification
- point out that the softmax + negative log-likelihood loss induces a more steep loss landscape than use of the quadratic (MSE) loss (which was apparently used for classification problems before this - they cite Rumelhart et al. 1986)
- They show that the weights of Layer 4 all go to zero (there is no variance) pretty much immediately when training their toy network â this is a problem because the model is relying on the biases of the final layer to perform the classification and ignoring the features that come from the previous layer (Layer 3)
Fan-in and Fan-out
In deep learning, in the context of initialization I often encounter âfan_inâ and âfan_outâ parameters. I recently discovered this comes from electronic engineering (and something to do with gates).
Could you please explain what these parameters are in the context of initialising layers in a neural network, and explain why they are named as such - making reference to the origin of these terms from electronics
ChatGPTâs response:
In the context of initializing layers in a neural network, the terms âfan_inâ and âfan_outâ refer to the number of inputs and outputs, respectively, of a particular layer. These terms were indeed borrowed from electronic engineering, specifically from the field of digital logic gates.
In electronics, a gate is a fundamental building block of digital circuits that performs a specific logical operation. Gates have input lines and output lines, and the terms âfan-inâ and âfan-outâ are used to describe the number of inputs and outputs connected to a gate. The âfan-inâ of a gate indicates the number of input lines connected to the gate, while the âfan-outâ represents the number of output lines connected to the gate.
When these concepts were borrowed for use in deep learning, they were applied to describe the connectivity between layers in a neural network. In a neural network, each layer consists of multiple artificial neurons (also known as nodes or units) that take inputs and produce outputs. The âfan-inâ of a layer refers to the number of connections from the previous layer to the neurons in the current layer. Conversely, the âfan-outâ of a layer represents the number of connections from the neurons in the current layer to the next layer.
In the context of weight initialization in deep learning, the terms âfan-inâ and âfan-outâ are commonly used to determine the appropriate scale of random weights. Initializing the weights randomly is crucial for training neural networks effectively. The magnitude of the weights at initialization should be carefully chosen to ensure proper learning dynamics and prevent issues like vanishing or exploding gradients.
By considering the âfan-inâ and âfan-outâ of a layer, you can adjust the initialization scheme accordingly. For example, one popular initialization method is Xavier initialization (also known as Glorot initialization), which scales the randomly initialized weights based on the âfan-inâ and âfan-outâ values of the layer. This helps to maintain the variance of activations and gradients throughout the network, promoting stable and efficient training.
In summary, the terms âfan-inâ and âfan-outâ used in deep learning initialization come from electronic engineering and describe the number of inputs and outputs, respectively, of a layer in a neural network. These terms were borrowed due to the analogy between the connectivity in electronic circuitry (gates) and neural networks (layers of artificial neurons), and have since become standard terminology in the field.