Tensor Parallelism

Excerpt

We’re on a journey to advance and democratize artificial intelligence through open source and open science.


Tensor parallelism is a technique used to fit a large model in multiple GPUs. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the GPUs and concatenated together to get the final result, like below 👇

Image courtesy of Anton Lozkhov

Tensor Parallelism only works for models officially supported, it will not work when falling back to transformers. You can get more information about unsupported models here.

You can learn a lot more details about tensor-parallelism from the transformers docs.

< > Update on GitHub