Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

29.2K subscribers

50,063 views

About
Share

Published On Jul 4, 2023

▬▬ Papers / Resources ▬▬▬
Colab Notebook: https://colab.research.google.com/dri...
ViT paper: https://arxiv.org/abs/2010.11929
Best Transformer intro: https://jalammar.github.io/illustrate...
CNNs vs ViT: https://arxiv.org/abs/2108.08810
CNNs vs ViT Blog: https://towardsdatascience.com/do-vis...
Swin Transformer: https://arxiv.org/abs/2103.14030
DeiT: https://arxiv.org/abs/2012.12877

▬▬ Support me if you like 🌟
►Link to this channel: https://bit.ly/3zEqL1W
►Support me on Patreon: https://bit.ly/2Wed242
►Buy me a coffee on Ko-Fi: https://bit.ly/3kJYEdl
►E-Mail: [email protected]

▬▬ Used Music ▬▬▬▬▬▬▬▬▬▬▬
Music from #Uppbeat (free for Creators!):
https://uppbeat.io/t/92elm/jasmine
License code: SMTWRWLNGHZHH0OC

▬▬ Used Icons ▬▬▬▬▬▬▬▬▬▬
All Icons are from flaticon: https://www.flaticon.com/authors/freepik

▬▬ Timestamps ▬▬▬▬▬▬▬▬▬▬▬
00:00 Introduction
00:16 ViT Intro
01:12 Input embeddings
01:50 Image patching
02:54 Einops reshaping
04:13 [CODE] Patching
05:35 CLS Token
06:40 Positional Embeddings
08:09 Transformer Encoder
08:30 Multi-head attention
08:50 [CODE] Multi-head attention
09:12 Layer Norm
09:30 [CODE] Layer Norm
09:55 Feed Forward Head
10:05 Feed Forward Head
10:21 Residuals
10:45 [CODE] final ViT
13:10 CNN vs. ViT
14:45 ViT Variants

▬▬ My equipment 💻
- Microphone: https://amzn.to/3DVqB8H
- Microphone mount: https://amzn.to/3BWUcOJ
- Monitors: https://amzn.to/3G2Jjgr
- Monitor mount: https://amzn.to/3AWGIAY
- Height-adjustable table: https://amzn.to/3aUysXC
- Ergonomic chair: https://amzn.to/3phQg7r
- PC case: https://amzn.to/3jdlI2Y
- GPU: https://amzn.to/3AWyzwy
- Keyboard: https://amzn.to/2XskWHP
- Bluelight filter glasses: https://amzn.to/3pj0fK2

Published On Jul 4, 2023

Share/Embed

Video Link