Date: 4th October 2023 Speaker: Simone Scardapane Talk recording available: https://youtu.be/IHAizmmakqU
Lecture Notes
- Architectural bias - given budget of parameters / compute, some types of neural network look to work better
- Modular sparsity: Give network modules and allow it to choose which components to use at training time
- Types of sparsity
- input sparsity
- layer sparsity
- block sparsity (feature sparsity)
- Motivations to enforce modular sparsity:
- Computational cost can be independent of # of parameters
- Dynamic cost for inference (e.g. elastic demand)
- Encapsulation (e.g. better interpretability, model re-use)
- Very good fit for multi-modal and multi-task problems (intuition: use different parts of the network)
- fundamentally a discrete problem
- Avoid Reinforcement Learning (RL) - Simone doesnât like RL
Differentiable Sampling: The Gumbel-Softmax Trick
- Simplest way of doing something in a discrete way with backprop
- Describes a random variable that can take 1 of k categories (patches, layers, blocks)
- Support:
- Parameters:
- Distribution: [eqn. not complete]
Formalising the problem
- k choices
- neural network
- scores, s = [1.3, 0.2, 0.7]
- argmax (gradient almost always zero!) â z = [1, 0, 0]
- â downstream model
- Note: for k=2 we take binary decisions e.g. keep or discard a patch/layer/module
Early Exits
- how to train early-exit models
- how to decide when to early exit
- Inception (from Google) had a similar mechanism
- Confident Adaptive LM
- Allow model to select which modules to use according to loss minimisation - Differentiable Branching In Deep Networks for Fast Inference
Mixture of {experts, adapters}
- Can multiply parameters for the same computational cost
- will only activate certain modules depending on the tokens received as input
- Can be parallelised if the experts are distributed
- May provide re-use of components and interpretability
- Good fit for multi-modal / multi-task problems
- Routing can very easily collapse
- Implementation tricky (sparse MoE)
Further Reading
- Neural Networks, Types, and Functional Programming â colahâs blog
- Learning with Fenchel-Young Losses
- Categorical Reparameterization with Gumbel-Softmax
- Confident Adaptive Language Modeling
- Why should we add early exits to neural networks?
- Differentiable Branching In Deep Networks for Fast Inference
- LIMoE: Learning Multiple Modalities with One Sparse Mixture-of-Experts Model
- AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
- On the Representation Collapse of Sparse Mixture of Expertscollapse
- A Review of Sparse Expert Models in Deep Learningcollapse
Talk Summary
Title: Designing efficient and modular neural networks
Abstract:
âAs neural networks grow in size and complexity, it is of paramount importance to imagine novel ways to design them in order to improve efficiency, power usage, and accuracy. In particular, there is a growing interest in making neural networks more modular and dynamic. In the majority of cases, this reduces to the problem of taking discrete decisions in a differentiable way, e.g., routing tokens in a mixture-of-experts, deactivating components with conditional computation techniques, or merging and reassembling separate components from different networks. In this talk, I will provide an overview to the problem of having discrete modules inside neural networks, common solutions and algorithms (e.g., Gumbel-Softmax tricks, REINFORCE, âŠ) and code examples to show how to implement them. We will conclude by pointing out to interesting research directions along these lines.â
Bio:
âSimone Scardapane is a tenure-track assistant professor at Sapienza University of Rome. His research is focused on graph neural networks, explainability, continual learning and, more recently, modular and efficient deep networks. He has published more than 100 papers on these topics in top-tier journals and conferences. Currently, he is an associate editor for the IEEE Transactions on Neural Networks and Learning Systems (IEEE), Neural Networks (Elsevier), Industrial Artificial Intelligence (Springer), and Cognitive Computation (Springer). He is a member of multiple groups and societies, including the ELLIS society, the IEEE Task Force on Reservoir Computing, the âMachine learning in geodesyâ joint study group of the International Association of Geodesy, and the Statistical Pattern Recognition Techniques TC of the International Association for Pattern Recognition.â