2021-05-26

Łukasz Kaiser presented on research he co-authored in 2020 on how to make Transformer architectures more efficient. These are some minimal notes from that talk, which was given at the Pi Campus, Rome on 26th May 2021 and is available here. The main reference for this talk is the Reformer: The Efficient Transformer which has a write-up at the Google AI blog.

Pi Campus Summary / Teaser

Transformer models have been used in a variety of fields and yield great results on many NLP tasks

But between the BERT, GPT-3, and many other variants, they can be inefficient and it can be hard to apply them.

Lukasz will introduce a new efficient variant of Transformer. He’ll take us through the main methods needed for efficiency and show how they address the main problems of high memory use and low performance on long sequences that limited the use of some Transformers before. He will finish with new applications that open up.

Notes

  • RNNs used for MT, but drawbacks
    • slow: sequential processing of input to get hidden state
    • propagate gradients but only up to a point
  • Attention
    • encoder: attend to everything
    • decode: attend to the left (the past) and to the encoder output
    • everything attends to everything: parallel processing fast

MT results (WMT-14) were good: 29.1 EN-DE and 41.8 EN-FR; outperformed LSTMs/GRUs

Transformer drawbacks:

  • everything attends to everything quadratic complexity in the sentence length; problem for e.g. paragraphs of text, books
  • Memory: 12 GB RAM with seq length of 384 max batch length is zero (must create square matrix for attention).
  • 175B params GPT-3
  • 13B params (GPT-2?)
  • 1.3B params (GPT-1?)

More parameters seem to be better, even when parameter count gets large

also good for one-shot or few-shot learning.

Efficiency Challenges

  • memory
    • reversible residual layers as in RevNet [Gomez+ 17]
    • efficiently train with memory swapping to CPU and quantization
  • time
    • introduce fast attention with LSH
  • activate all weights for each token
    • sparse layers that allow selective activations

Memory Efficiency

  • 1M tokens
  • Input embedding size:
    • already a tensor that takes ~2 GB
    • each layer is 2GB: 12x attention + 12x FFN 50GB

Solution: Reversible Networks

Transformer has residual connections

No caching needed for RevNets

Reversible Transformer: works on par with standard transformer.

Time Complexity

Attention is quadratic, but is also sparse.

Leverage this to limit the number of keys that a query attends to

Use Locality Sensitive Hashing (LSH) to get nearest neighbours without explicit calculation.

Idea (sketch)

  • draw random lines
  • consider only potential neighbours that fall on the same side of all the lines
  • in most cases, the nearest neighbours are in this partition of the space

Notes: May fail; probabilistic algorithm

Need to redraw lines; empirically ~8 hashes (LSHa 8-hash)

Speeds up training for sequences above ~4000 tokens

LSH Attention

LSH bucketing

Sort by LSH bucket

Chunk sorted sequence to parallelise (for GPUs)

Attend within same bucket

Sparsity

Standard FFN Layer

Sparse version: Keep only one row/column from each block.

Sparsity (often) increased by ReLU

How to decide which columns/rows to keep: use low-rank matrix (e.g. 32)

Straight-Through Gumbel-Softmax (per block)

Sparsifying Dense QKV Layers in Attention

perform local convolution

Sparsified Transformer results soon to be published (as of 2021-05-26)

Future

  • Efficient transformers for all lengths
  • Decoding fast enough even on CPUs
  • Fine-tuning possible for everyone

Main References

Audience Questions

Question: Do you see any techniques that will supersede transformers?

Response:

Question: tensor2tensor has been superseded by Trax. How does Trax take over from these frameworks?

Response: Pre-trained models are hard to release and maintain. Hugging Face provide pre-trained models readily.

Question: FNet seems promising. What do you think?

Response: It does. It remains to be seen if it will perform well across as-yet untested domains, but it very well might.