AI Dynamics

Global AI News Aggregator

About

@id_aa_carmack

  • Bfloat16 Precision Gaps in Large Scatter Plots Beyond Origin
    Bfloat16 Precision Gaps in Large Scatter Plots Beyond Origin

    Making a scatter plot of 400_000 data points, some of the plots had odd gaps in coverage. It took me a little while to realize that it was only when the data was farther from the origin — it was the raw bfloat16 precision. Everything looks great from -1 to 1, but as you go past 2 and 4, the coverage gaps get larger. My intuition didn't have it being quite so "discretely countable" at those modest numeric values. Float32 for comparison.

    → View original post on X — @id_aa_carmack, 2026-04-09 23:01 UTC

  • Improving Judgments Through Pairwise Comparisons and ELO Ranking

    So many judging tasks could be improved by aggregating partial orderings, and in the limit, just ordering pairs. The annual Libertarian Futurist Society novel awards discussion is starting, and while I would like to participate on some level, there is no way I have time to read an entire slate of novels. However, I will likely read at least two from the list, and I could give a relative assessment. This cries out for the use of something like ELO ranking, as in chess competition, perhaps with some suggestions to get sufficient coverage. Peer and out-of-chain employee performance calibrations could probably also benefit from a greater quantity of sparse pairwise comparisons

    → View original post on X — @id_aa_carmack, 2026-04-06 19:36 UTC

  • GPU Power Draw as True Utilization Metric in Data Centers

    Without getting all the way down to performance counters, GPU power from nvidia-smi is a better indicator of true utilization than job scheduling or “gpu busy”. I would love to see animated “heat maps” of the big data centers, with each pixel being an individual GPU’s power draw. I am confident that inference and frontier training at the big labs is highly efficient, but I wonder how many GPUs would be dark due to scheduling and inefficient research code. With a little calibration for base load and peak, just the power bill for the datacenter would be a pretty good first order indicator of utilization.

    → View original post on X — @id_aa_carmack, 2026-04-02 14:49 UTC

  • SIGReg and LeWM Experiments: Challenges with Value Function Generalization

    Adding a little bit of SIGReg to prefinal activations did coerce them into independent Gaussian, but it hurt generalization on value functions. Training a full LeWM ahead of time also resulted in worse value function estimation. I’m not giving up yet, but my first few attempts

    → View original post on X — @id_aa_carmack,

  • LeJEPA: Scalable Self-Supervised Learning Without Heuristics Analysis

    nitter.net/ID_AA_Carmack/status/2… John Carmack (@ID_AA_Carmack) #PaperADay 10 LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics arxiv.org/pdf/2511.08544 The comments on #PaperADay 3 recommended this paper as the state of the art JEPA paper, and it does look much better! They acknowledge that much of the prior JEPA research is ad-hoc and full of heuristics, but here they make strong theoretical claims of optimality and provide proofs (which I did not read). The first claim is that isotropic gaussian is the unique optimal embedding distribution for both linear and nonlinear probing, minimizing worst-case risk across downstream tasks. I would have taken that on faith with just a “sounds good to me”, but they go into it with details and examples. Actually getting an isotropic gaussian in high dimensions is easier said than done. They present Sketched Isotropic Gaussian Regularization (SIGReg) as a well behaved loss function to achieve this after analyzing a number of different statistical tests, and they claim it beats the curse of dimensionality with linear scalability. The final loss is just a blend factor to weight the JEPA prediction loss against the SIGReg isotropy loss. This is the one tunable hyperparameter for LeJEPA. Despite the P in JEPA, they don’t use predictor networks here, they just directly compare view embeddings for the JEPA loss. Predictor networks could still be useful for video sequences, especially when conditioned with action information for agents / robots. Each training image is augmented to produce 2 global views and 6 local views with different spatial scales but the same set of color and geometric transformations. The loss is the average MSE between the average of the global view embeddings and each of the local view embeddings. I don’t have a good feel for the tradeoffs in their view transforms, which still seem very much in the ad-hoc space, but they will determine the nature of what gets filtered out of the representation. Learning what doesn’t matter is critical, but the specification of “matters” is only implicit in the view transformations. LeJEPA itself is architecture independent – anything that digests a batch of samples from a dataset into vectors can be used. Vision transformers, MLP, ConvNets, etc. The specific augmentations for views would be input modality specific, but the LeJEPA algorithm could work on audio, images, video, or other things. They show that the LeJEPA loss on a large foundation model is very indicative of downstream task performance, both directly, and with a heuristic to improve the predictive power of the loss farther. They also show that it can be used to train from scratch on small datasets with as few as 1000 samples and achieve better results than probing a conventional general foundation model. I was pleased to see sample code blocks in the paper instead of greek-laden pseudocode, as well as a github repo. Appendix D has interesting details on generating good coverage of unit hyperspheres with low discrepancy samples by transforming Sobol sequences, but this is only for their theoretical analysis, and they show you are better off just making new random hypervectors every batch, with even 16 random vectors outperforming a fixed set of thousands. Some questions: In the discussion of non-linear probing, only kNN and kernel methods are mentioned, presumably for their theoretical analysis tractability, but would an MLP generally perform better? A JEPA embedding is not fully reversible like NICE or a RevNet, so how does it react to inputs that are far outside the training set? Will novel inputs map to unique embeddings, or could they be collapsed onto the codes from the training set? How would the embeddings evolve in a continuous learning environment, as novel inputs are added to the training mix? Can a JEPA be overtrained – is lower training loss always better, or would there be an optimal early stopping point? — https://nitter.net/ID_AA_Carmack/status/2014883608037556431#m

    → View original post on X — @id_aa_carmack, 2026-03-31 18:24 UTC

  • LeWorldModel: Stable JEPA Architecture for Offline Robotics World Models

    Paper review: LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels arxiv.org/pdf/2603.19312 Nice clean github: github.com/lucas-maes/le-wm This is the application of the LeJEPA results to world models, trained offline on experience from three different robotics style tests with one to two million steps in each dataset. Re-states the benefits of the SigReg loss relative to prior world model approaches. Uses ImageNet standard 224×224 RGB pixel input images with an unmodified ViT-Tiny vision transformer from HuggingFace to generate latents. One extra post-projection step is needed to give SigReg the necessary freedom to perturb the latents into independent gaussians, since ViT ends with a layernorm’d layer. Also tested with ResNet-18, which still performed well, but slightly worse. Uses a 192 dimensional latent. Performance slightly dropped when doubling the latent size to 384; it would be nice to know if it was stable there, or if it continued worsening with excessive latents. There is a relationship between batch size and SIGReg, the larger latent may have improved performance if the batch size was increased. The predictor is implemented as a ViT-S backbone – Why a vision transformer when the latent is flat? Uses a history of 3 sets of latents for two of the benchmarks and 1 for the other. Performance was markedly better with the “small” ViT model than the “tiny”, but the larger “base” model degraded notably, which is interesting. Dropout of 0.1 on the predictor significantly improved performance. 0.2 was still better than 0.0, but 0.5 was worse. Trained with a batch of 128 x 4 trajectories. I wish their training loss graphs were more zoomed in with grid lines. Performs planning at test time instead of building a policy by training in imagination like Dreamer / Diamond. Rolls out 300 initially random sets of actions up to a planning horizon H of 5 (at frame-skip 5). Iterates up to 30 times using the Cross Entropy Method (CEM). The main paper body mentions using Model Predictive Control (MPC) strategy, where only the first K planned actions are executed before replanning, but appendix D says they execute all 5 planned actions. After training, they probe the latent space to demonstrate that it does capture and represent physically meaningful quantities. They also implement a decoder from the latent space back to pixels – not used by the algorithms, but helpful to see what things the latent space is actually representing. They tested incorporating the reconstruction loss into training, but it hurt performance somewhat. They wound up with a 0.1 lambda for SigReg, as opposed to 0.05 in the LeJEPA paper. 1024 sigreg projections, but observe the number has negligible impact I like the JEPA framework, but so far my attempts to use it on Atari games with value functions have not matched my other efforts. Lucas Maes (@lucasmaes_) JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io — https://nitter.net/lucasmaes_/status/2036080584569618741#m

    → View original post on X — @id_aa_carmack, 2026-03-31 18:24 UTC

  • Multiple losses and shared computation in backpropagation analysis

    That would only be true if the losses didn’t share any computation. I would think the far more common case of multiple losses would be regularizations on a shared set of layers, in which case splitting the loss backwards would still give the same result, but be early twice as

    → View original post on X — @id_aa_carmack,

  • Gratitude for Meta’s Open Source PyTorch Contributions

    Just because a corporation receives the gift doesn’t mean everyone else doesn’t. I am thankful for all the open source code in PyTorch that Meta has released.

    → View original post on X — @id_aa_carmack,

  • Open Source as Gift: AI Training Magnifies Value

    I know there is some overlap between open source and anti-AI activists, but I have a hard time reconciling it. My million+ open source LOC were always intended as a gift to the world. Yes, I would make arguments about how it would strengthen our communities, and the GPL would prevent outright exploitation by our competitors, but those were to allay fears of my partners to allow me to make the gift. AI training on the code magnifies the value of the gift. I am enthusiastic about it! Some people do look at open source as a tool for social change, career advancement, or reputation building, but those are all downstream of the gift. Rich Whitehouse (@DickWhitehouse) Genuinely devastating take to see from someone who popularized the GPL across so many communities. Fails to appreciate the social and cultural importance of the license. — https://nitter.net/DickWhitehouse/status/2032241405276668188#m

    → View original post on X — @id_aa_carmack, 2026-03-13 14:15 UTC

  • Open Source Code and AI Training: A Matter of Legitimacy

    It is absurd to have a problem with AI learning from code you have open sourced. If github trained models on the contents of your private repos, that would be a violation.

    → View original post on X — @id_aa_carmack,