I discussed the Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu and colleagues published at ICCV ‘21 at the PINLab Reading Group on the 3d November 2021.

The Swin Transformer modifies the base Transformer to enforce a form of local attention, rendering the computational complexity linear in the image size, and combines this with a hierarchical architecture that introduces inductive biases similar to those of CNNs for general computer vision backbones. Within blocks of the architecture, it uses “shifted windows”, i.e. windows partitionings that vary in alternating Transformer Blocks implementing self-attention to enable information flow between windows, to (further) overcome the lack of global self-attention. Significantly, it surpasses state-of-the-art results in dense prediction tasks, i.e. those that operate at the pixel level like Semantic Segmentation and Object Detection.

My slides are available for download here

Shifted Window Transformer Teaser

Resources

Swin Transformer

Datasets

Baselines

State of the Art