ViT_practice

ViT (Visual Transformer) implementation practice ViT employs the Transformer architecture, which can extract the global information of the image and the local information of the image.

ViT consists of the following components:

1. Patch Embedding
1. Transformer Encoder
1. Linear layer
1. Prediction head

In Transformer Encoder, the following components are used:

1. Encoder Block
1. Layer Normalization
1. Residual Connection
1. Outputs

Encoder Block consists of the following components:

1. Multi-head Attention
1. MLP
1. Residual Connection
1. Outputs where, Multi-head Attention is used to extract the global information of the image and the local information of the image by using the attention map, created by the Multi-head Attention. Multi-head Attention works with Query, Key, and Value, where Query is the input, Key is used to calculate the attention score, and Value is used to generate the output.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
DETR.py		DETR.py
README.md		README.md
ViT model.ipynb		ViT model.ipynb
dataloader.py		dataloader.py
train.py		train.py
vit.py		vit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ViT_practice

About

Uh oh!

Releases

Packages

Languages

atsushiishii-utokyo/ViT_practice

Folders and files

Latest commit

History

Repository files navigation

ViT_practice

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages