In the paper, the authors investigate the question - why do deep ensembles work better than single deep neural networks?
In their investigation, the authors figure out:
-
Different snapshots of the same model (i.e., model trained after 1, 10, 100 epochs) exhibit functional similarity. Hence, their ensemble is less likely to explore the different modes of local minima in the optimization space.
-
Different solutions of the same model (i.e., trained with different random initializations each time) exhibit functional dissimilarity. Hence, their ensemble is more likely to explore the different modes of local minima in the optimization space.
Inspired by their findings, in this article, we present several different insights that are useful for understanding the dynamics of deep neural networks in general.
Table of Contents
Revisiting the Optimization Landscape of Neural Networks
Neural networks are stochastic functions i.e., each time you train a neural network, it may not lead to the exact same solution as before. Neural networks are optimized using gradient-based learning. This optimization problem is almost always non-convex. When expressed with Greek letters, this optimization problem looks like so -
where,
\theta
- $$ m $$ï»ż is the number of training examples, - $$ h_\theta $$ï»ż is the model (neural network) parameterized over $$ \theta\ell(h_\theta(x), y) x y
Consider the figure below that shows a sample non-convex loss landscape (typical for neural networks). As we can see, there are multiple local minima in there. A neural network can only reach one of these local minima at one time after they are trained. The same neural network can end up in different landscapes each time they are trained with different random initializations exhibiting high variance in predictions.  We can also see that these local minima lie at the same level in the loss landscape, which further suggests that if a network ends up in one of these local minima, it will yield the same kind of performance more or less. ## Mitigating the High Variance of a Single Model With Ensembling To allow a network to cover these local minima better, we often train several versions of the same model but with different initializations. During inference, we make predictions from each of these different solutions, and we average their predictions. It works quite well in practice, and this process is referred to as ensembling. Ensembling also helps to reduce the high variance that might come from the predictions of individual models (the same network trained multiple times with different random initializations). In order to understand why ensembles work well, we need to figure out the ingredients that make these ensembles cover the loss landscape better. Neural networks are parameterized functions, as we saw earlier. Each time we train a network, we end up in a different parameter space leading to different optimums. The more diverse this space, the better the coverage of different optimums. So, how do we quantify this diversity? To investigate this systematically, the authors do the following (among other things): - They measure the cosine similarity of the weights from different runs of the same network. Cosine similarity is a widely used metric to measure the similarity between two vectors. It does so by measuring the orientation and not the magnitude (refer to the figure below). Formally speaking, it is the dot product of two normalized vectors divided by the product of their respective norms. They want to examine the functional similarity of different trajectories (weights of the same model trained with different initialization).  - Practically we can do this by training the same model with different initializations while grabbing trainable weights, ignoring biases, flattening weights from each layer, and extending them to a list. Apply cosine similarity formula (NumPy implementation) for each pair of models. ``` # compute cosine similarity of weightscos_sim = np.dot(weights1, weights2)/(norm(weights1)*norm(weights2)) ``` - They measure the extent to which the predictions from different runs disagree with each other. The authors want to see if the models trained with different initializations fail for the same subset(or complete set) of the testing dataset. Suppose a model trained with different inits produces different predictions on the test dataset, we can say that the prediction is a function of its initialization. - Also, the examples which tend to confuse the model across different initializations can be called intrinsically hard examples. To find this, we first compared [confusion matrix](https://wandb.ai/wandb/plots/reports/Confusion-Matrix--VmlldzozMDg1NTM) epoch-wise, i.e., confusion matrix across individual epochs from the same init. This was followed with solution-wise comparison, i.e., confusion matrix from different solutions (inits) of the same model. - Practically, to compute dissimilarity in predictions, add the total number of equality between the true labels and the predicted labels, normalize by dividing the sum with the total number of test data points followed by subtraction by 1. ``` # compute dissimilaritydissimilarity_score = 1 - np.sum(np.equal(preds1, preds2))/10000 ``` Before we dive deep into the experiments mentioned above, it is essential to review our experimental setup. ## Experimental Setup - Dataset used (primarily): CIFAR-10 - Architectures: - MediumCNN: channels \[32,64,128,128\] - Dropout: 0.1 (only applicable when using SmallCNN and MediumCNN) - Learning rate schedule: Initially start at 1.6 Ă 10â3 and halving it every 10 epochs - Data augmentation: Only when using ResNet20v1 Note: We did not exactly follow what is specified in section 3 of the paper. There are minor differences in our experimental setup and what the authors followed. For convenience, below, we specify how the learning rate schedule would look and the data augmentation pipeline we followed.  ``` def augment(image,label): image = tf.image.resize_with_crop_or_pad(image, 40, 40) # Add 8 pixels of padding image = tf.image.random_crop(image, size=[32, 32, 3]) # Random crop back to 32x32 image = tf.image.random_brightness(image, max_delta=0.5) # Random brightness image = tf.clip_by_value(image, 0., 1.) return image, label ``` We used [Google Colab](https://colab.research.google.com/) for running all of our experiments. ## Dissecting the Weight Space of a (Deep) Ensemble Going back to our experiments, we are going to present them in two different flavors: - For each of the experiments quantifying the diversity (cosine similarity, prediction disagreement): - Take different snapshots of a model from the same training run and perform the experiment. - Train the model multiple times with different random initializations and perform the experiment. Note: By snapshots, we refer to models taken from epoch 0, epoch 1, and so on from the same training run (same initialization). ### Cosine Similarity in Between the Weights (Snapshots) #### Observations <br> - The functions (different checkpoints of the same model) in the same trajectory are similar, and it holds for all variants (small, medium, and large) of the model. - The cosine similarity between the weights of the different snapshots of the same model starts showing a high degree of similarity between each other as it approaches convergence. Thus, there is not much change in the weight space when the trajectory is settled for a loss-landscape valley. - The checkpoints from the later stage of training differ the most from the initial stage of training, followed by mild similarity (whitish region). ### Cosine Similarity in Between the Weights (Different Inits) #### Observations - The models trained with different initialization (different trajectories) are entirely dissimilar. This holds for all three variants of the model. - Thus, initialization decides the weight space the model will explore. ### Disagreement Between Predictions (Snapshots) #### Observations - The functions (different checkpoints of the same model) in the same trajectory tend to disagree less about its predictions. Further, confirming that functions in the same trajectory are similar. - From the prediction dissimilarity plot we can see that different snapshots of the same model start showing a high degree of similarity between each other as it approaches convergence (increasing epoch). Thus one can say that many examples are functionally mapped ($$ x \rightarrow y $$ï»ż) when the trajectory is settled for a loss landscape valley. - We also observe high dissimilarity in predictions between the checkpoints from the later stage of training and the very initial stage of training. ### Disagreement Between Predictions (Different Inits) #### Observations - The predictions for the same model with different initializations trained on the same dataset with same hyperparameters disagree. đČ - Obviously, there is a subset of examples that the model trained with different trajectories will agree upon. - There must be a subset of intrinsically hard examples that the model trained with different trajectories will misclassify similarly. We shall investigate in the next section. ## Intrinsic Hardness as a Function of Initialization Below we see that the set of examples that confuses a model epoch-wise changes as we proceed toward the optimization. We further see that this set varies when we train the model with different initialization. We could not enlist results from all the different initialization for space constraints, but feel free to check them out [here](https://wandb.ai/authors/loss-landscape). This suggests that the definition of intrinsically hard examples is relative to how a model is being initialized to train. This may also further suggest that the images that cause the top losses during training (epoch-wise) are also not the same when we change the initialization of a model. Note: You can click on the little button located at the top-left corner and play with the slider to see how the confusion matrices change with epochs. The idea of creating an epoch-wise callback is referred to from [this tutorial](https://www.tensorflow.org/tensorboard/r2/image_summaries). ## Different Initializations and Their Paths to Optimization We talked about different initializations of the same model and observed functional dissimilarity between them. To spice it up, let's try to visualize the path for different trajectories visually. The authors do so by taking three (for simplicity) different trajectories (inits) of the same model. They then take the softmax output from different checkpoints along individual training trajectories and append them to an array. The shape of the array should be (num\_of\_trajectories, num\_of\_epochs, num\_of\_test\_examples, num\_classes) and then compute a 2 component t-SNE of this array. The predictions from all the solutions and their individual epochs were appended to a single array because they belong to the same "space". We apply 2 component t-SNE to reduce this higher dimensional space to a two-dimensional space. Below is the result of this experiment for Small and Medium-sized CNN. And wow! In our opinion and also from the plots (shown below), it is evident that the models with different initializations have different trajectories. As one approaches convergence, they tend to cluster around the same valley in space. Even though the models reach similar accuracy, we can clearly see the evidence of multiple minima which lie on the same plane. ## Accuracy as a Function of Ensemble Size Another interesting question the authors explore is - how does ensemble size affect the overall test accuracy? Below we can see that as we keep increasing the ensemble size, the model performance gets enhanced. For SmallCNN, after a certain period, the enhancement gets plateaued. We think this might be because a small-capacity model does not produce an optimum solution over the training dataset. Ensembling predictions do help improve model performance, but after reaching peak performance, the uncertainty from multiple suboptimal models takes over the benefit of ensembling. This suggests itâs because an ensemble is able to cover the optimization landscape better than a single model and indeed that seems to be the case. > Although this behavior is interesting for deployment-related situations using a large ensemble of very heavy models might not be practically feasible. ## Perturbating an already optimized solution space The authors, in addition to the experiments based on the checkpoints along a trajectory also explore the subspace along an individual trajectory. Subspace along a trajectory is a set of functions (solutions) that exist in the function space around the explored space and while retraining with the same initialization could be explored. The authors use a representative set of four subspace sampling methods: - Diagonal Gaussian approximation - Low-rank covariance matrix Gaussian approximation - Random subspace approximation The authors construct their subspace around an optimized weight-space (weights and biases of a trained neural network) solution Ξ. By using the t-SNE plot experimental setup, they show that the created subspace lies in the same valley as the optimized solution while a different solution lies in a different valley. The authors validate two hypotheses: - Ensembling the solutions by sub-sampling around an optimized solution provides benefits in terms of model performance. But⊠- The relative benefit of simple ensembling (shown above) is higher as it averages prediction over more diverse solutions. The plot below summarizes these -  ## Conclusion The paper we discussed in this article gives us an excellent understanding of why (deep) ensembles are very powerful in covering the optimization landscape better with simple experiments. Below we leave you with a couple of amazing papers in case you are interested in knowing more about different aspects of deep neural networks - ## Acknowledgements Thanks to Balaji Lakshminarayanan for providing feedback on the initial draft of the article and rectifying our mistake on the tSNE projections. Hope you have enjoyed reading this article. For any feedback reach out to us on Twitter: [@RisingSayak](https://twitter.com/RisingSayak) and [@ayushthakur0](https://twitter.com/ayushthakur0).