🪴 Anil's Garden

Home

❯

Talks

❯

Designing efficient and modular neural networks - Simone Scardapane

04 Oct 20234 min read

talk
collapse

Date: 4th October 2023 Speaker: Simone Scardapane Talk recording available: https://youtu.be/IHAizmmakqU

Lecture Notes

Architectural bias - given budget of parameters / compute, some types of neural network look to work better
Modular sparsity: Give network modules and allow it to choose which components to use at training time
Types of sparsity
- input sparsity
- layer sparsity
- block sparsity (feature sparsity)
Motivations to enforce modular sparsity:
- Computational cost can be independent of # of parameters
- Dynamic cost for inference (e.g. elastic demand)
- Encapsulation (e.g. better interpretability, model re-use)
- Very good fit for multi-modal and multi-task problems (intuition: use different parts of the network)
- fundamentally a discrete problem
- Avoid Reinforcement Learning (RL) - Simone doesn’t like RL

Differentiable Sampling: The Gumbel-Softmax Trick

Simplest way of doing something in a discrete way with backprop
Describes a random variable that can take 1 of k categories (patches, layers, blocks)
Support: $z \in Z = one_hot (1, \dots, k)$
Parameters: $θ \in R^{k}, θ_{i} \geq 0, \sum_{i} θ = 1$
Distribution: $p (z ∣ θ) = \prod_{i} t h e t a$ [eqn. not complete]

Formalising the problem

k choices
neural network
scores, s = [1.3, 0.2, 0.7]
argmax (gradient almost always zero!) → z = [1, 0, 0]
→ downstream model
Note: for k=2 we take binary decisions e.g. keep or discard a patch/layer/module

Early Exits

how to train early-exit models
how to decide when to early exit
Inception (from Google) had a similar mechanism
Confident Adaptive LM
Allow model to select which modules to use according to loss minimisation - Differentiable Branching In Deep Networks for Fast Inference

Mixture of {experts, adapters}

Can multiply parameters for the same computational cost
- will only activate certain modules depending on the tokens received as input
Can be parallelised if the experts are distributed
May provide re-use of components and interpretability
Good fit for multi-modal / multi-task problems
Routing can very easily collapse
Implementation tricky (sparse MoE)

Talk Summary

Title: Designing efficient and modular neural networks

Abstract:

“As neural networks grow in size and complexity, it is of paramount importance to imagine novel ways to design them in order to improve efficiency, power usage, and accuracy. In particular, there is a growing interest in making neural networks more modular and dynamic. In the majority of cases, this reduces to the problem of taking discrete decisions in a differentiable way, e.g., routing tokens in a mixture-of-experts, deactivating components with conditional computation techniques, or merging and reassembling separate components from different networks. In this talk, I will provide an overview to the problem of having discrete modules inside neural networks, common solutions and algorithms (e.g., Gumbel-Softmax tricks, REINFORCE, …) and code examples to show how to implement them. We will conclude by pointing out to interesting research directions along these lines.”

Bio:

“Simone Scardapane is a tenure-track assistant professor at Sapienza University of Rome. His research is focused on graph neural networks, explainability, continual learning and, more recently, modular and efficient deep networks. He has published more than 100 papers on these topics in top-tier journals and conferences. Currently, he is an associate editor for the IEEE Transactions on Neural Networks and Learning Systems (IEEE), Neural Networks (Elsevier), Industrial Artificial Intelligence (Springer), and Cognitive Computation (Springer). He is a member of multiple groups and societies, including the ELLIS society, the IEEE Task Force on Reservoir Computing, the “Machine learning in geodesy” joint study group of the International Association of Geodesy, and the Statistical Pattern Recognition Techniques TC of the International Association for Pattern Recognition.”

Graph View

Lecture Notes
Differentiable Sampling: The Gumbel-Softmax Trick
Early Exits
Mixture of {experts, adapters}
Further Reading
Talk Summary

Backlinks

Efficient Machine Learning

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋