AI Dynamics

Global AI News Aggregator

COMPUTING

  • Day 83 GPU Programming: DeepSeek Multi-Head Latent Attention Optimization
    Day 83 GPU Programming: DeepSeek Multi-Head Latent Attention Optimization

    Day 83/365 of GPU Programming Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation. A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post. @jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles. The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA. Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels. levi (@levidiamode) Day 82/365 of GPU Programming Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge. I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic. I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well…), expert sizes, training objectives, top k routing and DeepSeek variations. Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd. Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead. — https://nitter.net/levidiamode/status/2037297869518950430#m

    → View original post on X — @rasbt, 2026-03-27 22:49 UTC

  • Why I Shill Droid 24/7: Eight Major Accomplishments Today
    Why I Shill Droid 24/7: Eight Major Accomplishments Today

    Here's why I shill Droid 24/7 ———- Today Droid single-handedly: 1. Published a REAP of GLM-5 in FP8, there's a reason no one else has done it DSA is still very new: huggingface.co/0xSero/GLM-5-… 2. Found and Fixed an upstream issue with VLLM + DSA + Hopper where GLM-5's kv-cache would need to recompute and spend 20x the time needed, fixed. 3. Created multiple working quantisations on it's own, it tried exl3 and autoround but both failed so resorted to GGUF (autoround 3 bits doesn't work on ampere) huggingface.co/0xSero/GLM-5-… 4. Implemented github.com/0xSero/turboquant within 24 hours of the research paper coming out, tested it across 5090s, 3090s, H100s, and B200s 5. Has been distilling larger models into LoRA to help me test arxiv.org/abs/2505.21835 and it got an 80% prune to be semi-coherent again. 6. Helped my find research papers, clean up slop with the human-writing skill. 7. Got BYOK working with Anthropic, ZAI, Kimi, MiniMax, OpenAI working in Cursor github.com/0xSero/factory-cu… 8. Helped me Implement blog.comfy.org/p/dynamic-vra… 's dynamic loading, only works on a tiny model, but still. ——- I only have to check in on it every 30-45 minutes (I am talking all 8 of my sessions) the thing will run for 16 hours with like 0 prep All this while I am mostly focused on my actual job and tweeting 24/7 Keep in mind each one of these experiments is running on a different server, with different constraints, like I don't understand how I can get such good results here. ——— I love novelty. Which is why I jump around and talking about all these different tools. I have used all of these harnesses and messed around with every feature. I keep coming back to this, and I keep shilling it because I sincerely wish others get to experience this.

    → View original post on X — @nathanlands, 2026-03-27 19:36 UTC

  • MiniMax AI 2.5 Cloud Release: Developers Share First Test Ideas

    Since @MiniMax_AI 2.5 is available on our cloud —devs, we have a question for you. What’s the first thing you’d test with the upgraded algorithm?

    → View original post on X — @sambanovaai,

  • T-Mobile Network Planning for Major Events and Emergency Communications

    Behind every major event is long-term network planning. Automation and AI-powered optimization help manage changing demand while strengthening everyday connectivity across key venues and transit hubs. More: https://
    t-mobile.com/news/network/t
    -mobile-bay-area-emergency-communications-big-game
    … @TMobileBuiness Partner

    → View original post on X — @haroldsinnott,

  • Multimodal AI: The Future of Human-Computer Interaction According to Stanford

    At @NVIDIAGTC, @StanfordHAI's James @Landay said we're amid a major shift in human-computer interaction. Current text/voice AI is "just a blip" and he envisions a future of multimodal agents that anticipate user needs through voice, gesture, and context: nvidia.com/en-us/on-demand/s… [Translated from EN to English]

    → View original post on X — @stanfordhai, 2026-03-27 16:05 UTC

  • Mojo Kernels: Reducing conv2d Code from 870 to 130 Lines

    130 lines instead of 870. That's the difference between our conv2d implementation on Blackwell and CUTLASS's. We broke kernels into three swappable pieces: one for moving data, one for coordinating the pipeline, one for compute. When you need a new kernel, you only change the piece that actually needs to change. Part 3 of our Structured Mojo Kernels series walks through the details: modular.com/blog/structured-…

    → View original post on X — @jeremyphoward, 2026-03-27 15:00 UTC

  • SambaNova RDU: Dataflow Architecture for Efficient AI Processing

    Want faster, more efficient AI? It starts with dataflow—the natural way AI models process data. Our RDU is built for exactly that. See how it works: https://
    sambanova.ai/products/dataf
    low-architecture?utm_source=x&utm_medium=organic&utm_campaign=developer

    → View original post on X — @sambanovaai,

  • Tenstorrent Unveils New Cluster with 1TB VRAM and 3TB DDR5
    Tenstorrent Unveils New Cluster with 1TB VRAM and 3TB DDR5

    New Tenstorrent cluster hot from the kitchen > 1TB of VRAM > 3TB DDR5 RAM > 32TB SSD Storage New product, will share more later P.S. Can you find the cat in the picture?

    → View original post on X — @tenstorrent, 2026-03-27 14:40 UTC

  • Living Brain Cells Play DOOM: Cortical Labs Advances Neuromorphic Computing

    Living Brain Cells Play DOOM: Cortical Labs Pushes Neuromorphic Computing Forward
    by @IntEngineering #Innovation #EmergingTech #Technology #Tech

    → View original post on X — @ronald_vanloon,

  • Insurers Leverage Cloud Computing for Efficiency and Flexibility
    Insurers Leverage Cloud Computing for Efficiency and Flexibility

    Modern insurance runs in the cloud. Outsourcing cloud-computing storage allows insurers like Stuttgarter Lebensversicherung a.G. to benefit from flexibility & efficiency of the cloud without placing strain on limited internal IT resources. http://
    2.sas.com/6011B6nkDB

    → View original post on X — @sassoftware,