nitter.net/ID_AA_Carmack/status/2… John Carmack (@ID_AA_Carmack) #PaperADay 10 LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics arxiv.org/pdf/2511.08544 The comments on #PaperADay 3 recommended this paper as the state of the art JEPA paper, and it does look much better! They acknowledge that much of the prior JEPA research is ad-hoc and full of heuristics, but here they make strong theoretical claims of optimality and provide proofs (which I did not read). The first claim is that isotropic gaussian is the unique optimal embedding distribution for both linear and nonlinear probing, minimizing worst-case risk across downstream tasks. I would have taken that on faith with just a “sounds good to me”, but they go into it with details and examples. Actually getting an isotropic gaussian in high dimensions is easier said than done. They present Sketched Isotropic Gaussian Regularization (SIGReg) as a well behaved loss function to achieve this after analyzing a number of different statistical tests, and they claim it beats the curse of dimensionality with linear scalability. The final loss is just a blend factor to weight the JEPA prediction loss against the SIGReg isotropy loss. This is the one tunable hyperparameter for LeJEPA. Despite the P in JEPA, they don’t use predictor networks here, they just directly compare view embeddings for the JEPA loss. Predictor networks could still be useful for video sequences, especially when conditioned with action information for agents / robots. Each training image is augmented to produce 2 global views and 6 local views with different spatial scales but the same set of color and geometric transformations. The loss is the average MSE between the average of the global view embeddings and each of the local view embeddings. I don’t have a good feel for the tradeoffs in their view transforms, which still seem very much in the ad-hoc space, but they will determine the nature of what gets filtered out of the representation. Learning what doesn’t matter is critical, but the specification of “matters” is only implicit in the view transformations. LeJEPA itself is architecture independent – anything that digests a batch of samples from a dataset into vectors can be used. Vision transformers, MLP, ConvNets, etc. The specific augmentations for views would be input modality specific, but the LeJEPA algorithm could work on audio, images, video, or other things. They show that the LeJEPA loss on a large foundation model is very indicative of downstream task performance, both directly, and with a heuristic to improve the predictive power of the loss farther. They also show that it can be used to train from scratch on small datasets with as few as 1000 samples and achieve better results than probing a conventional general foundation model. I was pleased to see sample code blocks in the paper instead of greek-laden pseudocode, as well as a github repo. Appendix D has interesting details on generating good coverage of unit hyperspheres with low discrepancy samples by transforming Sobol sequences, but this is only for their theoretical analysis, and they show you are better off just making new random hypervectors every batch, with even 16 random vectors outperforming a fixed set of thousands. Some questions: In the discussion of non-linear probing, only kNN and kernel methods are mentioned, presumably for their theoretical analysis tractability, but would an MLP generally perform better? A JEPA embedding is not fully reversible like NICE or a RevNet, so how does it react to inputs that are far outside the training set? Will novel inputs map to unique embeddings, or could they be collapsed onto the codes from the training set? How would the embeddings evolve in a continuous learning environment, as novel inputs are added to the training mix? Can a JEPA be overtrained – is lower training loss always better, or would there be an optimal early stopping point? — https://nitter.net/ID_AA_Carmack/status/2014883608037556431#m
→ View original post on X — @id_aa_carmack, 2026-03-31 18:24 UTC
Leave a Reply