George Hotz | Programming | can you multiply a matrix? (noob lesson) | geohot/tinygrad/tree/gemm
george hotz archive george hotz archive
195K subscribers
120,294 views
0

 Published On Jun 26, 2022

Date of stream 25 Jun 2022.
Live-stream chat added as Subtitles/CC - English (Twitch Chat).
Stream title: can you multiply a matrix? (noob lesson)

Source files:
- https://github.com/geohot/tinygrad/tr...
Follow for notifications:
-   / georgehotz  
Support George:
-   / georgehotz  
Programming playlist:
-    • George Hotz programming archive (twit...  
Compute computer
- Ubuntu 20.04.4 LTS
- AMD Ryzen 9 5950X
- 64GB RAM
- AMD Radeon RX 6900 XT
Streaming computer
- Apple MacBook M1
- LG UltraFine 5K
- Blue Yeti
- Apple Magic Keyboard
- HHKB
- tmux & Vim & Visual Studio Code with Vim Key Bindings and other
https://github.com/geohot/configuration

Chapters:
00:00:00 intro
00:01:10 quiet computer
00:02:10 no adderall joke
00:03:00 noob day
00:04:00 how to multiply a matrix
00:06:00 big matrix
00:07:10 j_blow raid George
00:07:45 how much compute is matrix multiplication
00:08:25 how to do matrix multiplication
00:09:50 FLOPS, time.monotonic
00:11:50 SI prefixes
00:14:00 hype titles, freedom units
00:15:00 CPU TFLOP/S, threadripper, ryzen
00:17:55 AMD Radeon RX 6900 XT
00:18:35 SGEMM, DGEMM, MADDNESS
00:20:10 github.com/dblalock/bolt
00:21:50 Theoretical GFLOPS
00:23:30 Same performance in C
00:26:30 multiply a matrix in C
00:28:05 timer in C
00:33:45 python,C performance, tiling
00:35:00 today's lesson (cache aware algorithm)
00:35:50 order of for loops
00:37:30 still slow
00:44:55 avx2 instructions c
00:49:30 FMA3, VFMADD
00:50:50 don't use strassen, cpu instructions, FMA
00:56:00 avx2 only about integers, we need FMA, thank you @paranon1
00:57:40 real
00:59:00 segmentation fault, align(64)
01:04:00 is that wrong?
01:09:00 still slow, threads
01:11:10 1 thread speed
01:14:30 visualizing what is it doing
01:15:20 _m256 init to 0, _mm256_fmadd_ps
01:23:04 time for printf's
01:26:00 short break, should we play wonderwall on a guitar
01:27:20 tweet about downsizing apartments
01:27:50 gdb
01:29:00 this is illegal, suing clang
01:30:29 not suing clang
01:31:30 that one is always 0 that can't be right
01:33:30 whiteboard missing
01:35:10 gemm tinygrad branch
01:37:25 internet broken
01:38:10 extract _m256, a bit faster
01:42:45 tracking down segmentation fault
01:43:55 data not aligned, dumbass
01:44:40 it's always your fault
01:45:10 good speed, alignment bytes
01:48:30 fan spinup
01:50:20 zen microarchitecture
01:54:30 something about this is slow
01:58:00 another way to do this
02:07:50 without and with ffast-math
02:12:30 too early for optimization
02:22:50 visualizing
02:27:20 will work but stupid
02:32:10 number of ymm registers, ymm matmul
02:35:40 not getting the numpy performance
02:38:20 slower, second fma unit,
02:39:40 it's faster now, don't trust -O3
02:44:40 lag on stream, turning off the dryer
02:47:15 hard to make faster
02:52:20 profile cache stalls x86
02:58:20 that loop looks fast
03:06:05 cpu cache sizes
03:15:50 cache coherence, how is it slower
03:22:38 short break
03:28:20 tweet about adderall, drug test, people without skills
03:32:20 zen microarchitecture, optimization
03:39:35 L1 only 32 kB
03:46:40 we are trying to do fast matrix multiply
03:54:40 openblas haswell gemm
03:59:45 online whiteboard
04:02:15 no sarcasm allowed subscriber get's a timeout
04:02:50 removing code, _mm256_broadcast_ss
04:16:45 just persistent
04:19:00 whiteboard time, better understanding
04:25:40 don't want to reorded matrix
04:32:00 strassen = ban, wrong and slow
04:38:25 coherent meaning, access memory in order better
04:46:15 same number of fma as broadcasts
04:48:20 it's fast now
04:51:05 how to get the same fma adds
04:54:15 beating numpy
05:00:45 multithreading check, max clock, pragma
05:06:35 theoretical maximum on cpu
05:12:40 crushing numpy, real threads in C
05:22:10 double the speed, even more speed
05:24:10 overhead, semaphore
05:28:40 we cheated
05:29:30 no TFLOP
05:43:50 Alex is home, stupid question timeout
05:49:00 beautiful htop, throttling
05:51:40 theoretical maximum
05:53:40 cpu power draw
05:57:30 cpu temperature
06:01:00 disable throttling

Official George Hotz communication channels:
- https://geohot.com
-   / georgehotz  
-   / georgehotz  
- https://github.com/geohot
-    / geohot  
-   / realgeorgehotz  

We archive George Hotz and comma.ai videos for fun.
Follow for notifications:
-   / geohotarchive  

Thank you for reading and using the SHOW MORE button.
We hope you enjoy watching George's videos as much as we do.
See you at the next video.

show more

Share/Embed