Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!) https://tube.ekaii.fr/videos/watch/dd2c3220-703b-4d2a-9a7d-2dc119f63914 Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel. Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100. -- Link to my test code: https://github.com/LaurieWired/BenchmarkCustomPTX -- Timestamps 00:00 CUDA vs PTX vs SASS 02:12 Global Memory Target 03:27 Custom PTX Walkthrough 06:40 NVIDIA ISA Reference 07:42 Example Impelmentation 10:38 H100 Benchmark 11:46 SASS (Machine) Code Follow LaurieWired on Social Media: ►https://linktr.ee/lauriewired — mirrored from https://www.youtube.com/watch?v=iEda8_Mvvo4 Sat, 20 Jun 2026 18:55:18 GMT https://validator.w3.org/feed/docs/rss2.html PeerTube - https://tube.ekaii.fr Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!) https://tube.ekaii.fr/client/assets/images/icons/icon-1500x1500.png https://tube.ekaii.fr/videos/watch/dd2c3220-703b-4d2a-9a7d-2dc119f63914 All rights reserved, unless otherwise specified in the terms specified at https://tube.ekaii.fr/about and potential licenses granted by each content's rightholder.