这是indexloc提供的服务,不要输入任何密码
Skip to content

atsushiishii-utokyo/ViT_practice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViT_practice

ViT (Visual Transformer) implementation practice ViT employs the Transformer architecture, which can extract the global information of the image and the local information of the image.

ViT consists of the following components:

    1. Patch Embedding
    1. Transformer Encoder
    1. Linear layer
    1. Prediction head

In Transformer Encoder, the following components are used:

    1. Encoder Block
    1. Layer Normalization
    1. Residual Connection
    1. Outputs

Encoder Block consists of the following components:

    1. Multi-head Attention
    1. MLP
    1. Residual Connection
    1. Outputs where, Multi-head Attention is used to extract the global information of the image and the local information of the image by using the attention map, created by the Multi-head Attention. Multi-head Attention works with Query, Key, and Value, where Query is the input, Key is used to calculate the attention score, and Value is used to generate the output.

About

ViT (Visual Transformer) implementation practice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published