<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)</title>
        <link>https://tube.ekaii.fr/videos/watch/dd2c3220-703b-4d2a-9a7d-2dc119f63914</link>
        <description>Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel. Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100. -- Link to my test code: https://github.com/LaurieWired/BenchmarkCustomPTX -- Timestamps 00:00 CUDA vs PTX vs SASS 02:12 Global Memory Target 03:27 Custom PTX Walkthrough 06:40 NVIDIA ISA Reference 07:42 Example Impelmentation 10:38 H100 Benchmark 11:46 SASS (Machine) Code Follow LaurieWired on Social Media: ►https://linktr.ee/lauriewired — mirrored from https://www.youtube.com/watch?v=iEda8_Mvvo4</description>
        <lastBuildDate>Sat, 20 Jun 2026 18:55:18 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>PeerTube - https://tube.ekaii.fr</generator>
        <image>
            <title>Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)</title>
            <url>https://tube.ekaii.fr/client/assets/images/icons/icon-1500x1500.png</url>
            <link>https://tube.ekaii.fr/videos/watch/dd2c3220-703b-4d2a-9a7d-2dc119f63914</link>
        </image>
        <copyright>All rights reserved, unless otherwise specified in the terms specified at https://tube.ekaii.fr/about and potential licenses granted by each content's rightholder.</copyright>
        <atom:link href="https://tube.ekaii.fr/feeds/video-comments.xml?videoId=dd2c3220-703b-4d2a-9a7d-2dc119f63914" rel="self" type="application/rss+xml"/>
    </channel>
</rss>