ViT (Visual Transformer) implementation practice ViT employs the Transformer architecture, which can extract the global information of the image and the local information of the image.
ViT consists of the following components:
-
- Patch Embedding
-
- Transformer Encoder
-
- Linear layer
-
- Prediction head
In Transformer Encoder, the following components are used:
-
- Encoder Block
-
- Layer Normalization
-
- Residual Connection
-
- Outputs
Encoder Block consists of the following components:
-
- Multi-head Attention
-
- MLP
-
- Residual Connection
-
- Outputs where, Multi-head Attention is used to extract the global information of the image and the local information of the image by using the attention map, created by the Multi-head Attention. Multi-head Attention works with Query, Key, and Value, where Query is the input, Key is used to calculate the attention score, and Value is used to generate the output.