🪴 Anil's Garden

Home

❯

Talks

❯

Efficient Transformers - Łukasz Kaiser

17 Jun 20254 min read

talk

2021-05-26

Łukasz Kaiser presented on research he co-authored in 2020 on how to make Transformer architectures more efficient. These are some minimal notes from that talk, which was given at the Pi Campus, Rome on 26th May 2021 and is available here. The main reference for this talk is the Reformer: The Efficient Transformer which has a write-up at the Google AI blog.

Pi Campus Summary / Teaser

Transformer models have been used in a variety of fields and yield great results on many NLP tasks

But between the BERT, GPT-3, and many other variants, they can be inefficient and it can be hard to apply them.

Lukasz will introduce a new efficient variant of Transformer. He’ll take us through the main methods needed for efficiency and show how they address the main problems of high memory use and low performance on long sequences that limited the use of some Transformers before. He will finish with new applications that open up.

Notes

RNNs used for MT, but drawbacks
- slow: sequential processing of input to get hidden state
- propagate gradients but only up to a point
Attention
- encoder: attend to everything
- decode: attend to the left (the past) and to the encoder output
- everything attends to everything: parallel processing $\to$ fast

MT results (WMT-14) were good: 29.1 EN-DE and 41.8 EN-FR; outperformed LSTMs/GRUs

Transformer drawbacks:

everything attends to everything $\to$ quadratic complexity in the sentence length; problem for e.g. paragraphs of text, books
Memory: 12 GB RAM with seq length of 384 $\to$ max batch length is zero (must create square matrix for attention).
175B params GPT-3
13B params (GPT-2?)
1.3B params (GPT-1?)

$\to$ More parameters seem to be better, even when parameter count gets large

$\to$ also good for one-shot or few-shot learning.

Efficiency Challenges

memory
- reversible residual layers as in RevNet [Gomez+ 17]
- efficiently train with memory swapping to CPU and quantization
time
- introduce fast attention with LSH
activate all weights for each token
- sparse layers that allow selective activations

Memory Efficiency

1M tokens
Input embedding size: $d_{m o d e l} = 512$
- $\to$ already a tensor that takes ~2 GB
- $\to$ each layer is 2GB: 12x attention + 12x FFN $\to$ 50GB

Solution: Reversible Networks

Transformer has residual connections

No caching needed for RevNets

Reversible Transformer: works on par with standard transformer.

Time Complexity

Attention is quadratic, but is also sparse.

$\to$ Leverage this to limit the number of keys that a query attends to

$\to$ Use Locality Sensitive Hashing (LSH) to get nearest neighbours without explicit calculation.

Idea (sketch)

draw random lines
consider only potential neighbours that fall on the same side of all the lines
in most cases, the nearest neighbours are in this partition of the space

Notes: May fail; probabilistic algorithm

$\to$ Need to redraw lines; empirically ~8 hashes (LSHa 8-hash)

$\to$ Speeds up training for sequences above ~4000 tokens

LSH Attention

LSH bucketing

Sort by LSH bucket

Chunk sorted sequence to parallelise (for GPUs)

Attend within same bucket

Sparsity

Standard FFN Layer

Sparse version: Keep only one row/column from each block.

Sparsity (often) increased by ReLU

How to decide which columns/rows to keep: use low-rank matrix (e.g. 32)

Straight-Through Gumbel-Softmax (per block)

Sparsifying Dense QKV Layers in Attention

$\to$ perform local convolution

Sparsified Transformer results soon to be published (as of 2021-05-26)

Future

Efficient transformers for all lengths
Decoding fast enough even on CPUs
Fine-tuning possible for everyone

Main References

Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya (2020) Reformer: The Efficient Transformer. https://arxiv.org/abs/2001.04451.
Nikita Kitaev and Łukasz Kaiser (2020) Reformer: The Efficient Transformer. Published Thursday, January 16, 2020 on Google AI Blog. https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html

Audience Questions

Question: Do you see any techniques that will supersede transformers?

Response:

Fourier transforms
GLOM, an evolution of Capsules; see also capsules.

Question: tensor2tensor has been superseded by Trax. How does Trax take over from these frameworks?

Response: Pre-trained models are hard to release and maintain. Hugging Face provide pre-trained models readily.

Question: FNet seems promising. What do you think?

Response: It does. It remains to be seen if it will perform well across as-yet untested domains, but it very well might.

Graph View

Pi Campus Summary / Teaser
Notes
Efficiency Challenges
Memory Efficiency
Time Complexity
Sparsity
Future
Main References
Audience Questions

Backlinks

Efficient Machine Learning

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋