2021-05-26
Łukasz Kaiser presented on research he co-authored in 2020 on how to make Transformer architectures more efficient. These are some minimal notes from that talk, which was given at the Pi Campus, Rome on 26th May 2021 and is available here. The main reference for this talk is the Reformer: The Efficient Transformer which has a write-up at the Google AI blog.
Pi Campus Summary / Teaser
Transformer models have been used in a variety of fields and yield great results on many NLP tasks
But between the BERT, GPT-3, and many other variants, they can be inefficient and it can be hard to apply them.
Lukasz will introduce a new efficient variant of Transformer. He’ll take us through the main methods needed for efficiency and show how they address the main problems of high memory use and low performance on long sequences that limited the use of some Transformers before. He will finish with new applications that open up.
Notes
- RNNs used for MT, but drawbacks
- slow: sequential processing of input to get hidden state
- propagate gradients but only up to a point
- Attention
- encoder: attend to everything
- decode: attend to the left (the past) and to the encoder output
- everything attends to everything: parallel processing fast
MT results (WMT-14) were good: 29.1 EN-DE and 41.8 EN-FR; outperformed LSTMs/GRUs
Transformer drawbacks:
- everything attends to everything quadratic complexity in the sentence length; problem for e.g. paragraphs of text, books
- Memory: 12 GB RAM with seq length of 384 max batch length is zero (must create square matrix for attention).
- 175B params GPT-3
- 13B params (GPT-2?)
- 1.3B params (GPT-1?)
More parameters seem to be better, even when parameter count gets large
also good for one-shot or few-shot learning.
Efficiency Challenges
- memory
- reversible residual layers as in RevNet [Gomez+ 17]
- efficiently train with memory swapping to CPU and quantization
- time
- introduce fast attention with LSH
- activate all weights for each token
- sparse layers that allow selective activations
Memory Efficiency
- 1M tokens
- Input embedding size:
- already a tensor that takes ~2 GB
- each layer is 2GB: 12x attention + 12x FFN 50GB
Solution: Reversible Networks
Transformer has residual connections
No caching needed for RevNets
Reversible Transformer: works on par with standard transformer.
Time Complexity
Attention is quadratic, but is also sparse.
Leverage this to limit the number of keys that a query attends to
Use Locality Sensitive Hashing (LSH) to get nearest neighbours without explicit calculation.
Idea (sketch)
- draw random lines
- consider only potential neighbours that fall on the same side of all the lines
- in most cases, the nearest neighbours are in this partition of the space
Notes: May fail; probabilistic algorithm
Need to redraw lines; empirically ~8 hashes (LSHa 8-hash)
Speeds up training for sequences above ~4000 tokens
LSH Attention
LSH bucketing
Sort by LSH bucket
Chunk sorted sequence to parallelise (for GPUs)
Attend within same bucket
Sparsity
Standard FFN Layer
Sparse version: Keep only one row/column from each block.
Sparsity (often) increased by ReLU
How to decide which columns/rows to keep: use low-rank matrix (e.g. 32)
Straight-Through Gumbel-Softmax (per block)
Sparsifying Dense QKV Layers in Attention
perform local convolution
Sparsified Transformer results soon to be published (as of 2021-05-26)
Future
- Efficient transformers for all lengths
- Decoding fast enough even on CPUs
- Fine-tuning possible for everyone
Main References
- Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya (2020) Reformer: The Efficient Transformer. https://arxiv.org/abs/2001.04451.
- Nikita Kitaev and Łukasz Kaiser (2020) Reformer: The Efficient Transformer. Published Thursday, January 16, 2020 on Google AI Blog. https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
Audience Questions
Question: Do you see any techniques that will supersede transformers?
Response:
- Fourier transforms
- GLOM, an evolution of Capsules; see also capsules.
Question: tensor2tensor has been superseded by Trax. How does Trax take over from these frameworks?
Response: Pre-trained models are hard to release and maintain. Hugging Face provide pre-trained models readily.
Question: FNet seems promising. What do you think?
Response: It does. It remains to be seen if it will perform well across as-yet untested domains, but it very well might.