Same problem with patch tool usage in two separate languages and two separate sessions. Companies get one chance to impress these days, and I think Gemini 3 just blew it…
@alexjc
-

Gemini 3 Review: Fast but Practically Unusable
By
–
Gemini 3 review: it's fast, it's not dumb, but it's completely unusable in practice. It will get lost after a few edits then completely trash the file: issuing patch commands that include line numbers at best, and at worst it will discard most of the lines!
-
Tokenizer Approaches Impact LLM Performance on HellaSwag Benchmarks
By
–
If you measure downstream performance on HellaSwag rather than speedrun-equivalent loss, then different tokenizer approaches come out on top… The first run I did was much better on common-sense downstream, trained in equivalent time or better.
-
Tokenizer Training and Data Filtering Compliance Standards
By
–
Well, the tokenizer I used was trained on large quantities of data — I filtered the tokens based on yet more data from FineWeb. Question is if that's acceptable according to your rules…
-
Retokenization and Language Knowledge in Model Training
By
–
The biggest question is whether you allow re-tokenization, and whether that should be done with the same data as the training itself. Right now there is knowledge about the language in existing tokens built-in and changing that is against the rules and/or unfavorable.
-

Plaintiff Lawyers Mishandle Copyright Claims and Cloud Data Evidence
By
–
The plaintiff lawyers mis-pleaded the important Copyright claims, had to withdraw them. They also "forgot" to provide the most damning evidence for reputational harm. I talked to the legal team they had no clue about cloud copies of data even the day before court. I'm not saying
-
Question Phrasing Impact on AI Technical Answers
By
–
I love how both the answers about thermal throttling and speculative decoding are correct based on how you phrased the question!
-

Language Models Struggle With High School Math Fundamentals
By
–
Language models perform poorly on high-school math? You don't want to hear this, but the problems started in grade-school. The moment we (collectively) found acceptable that mid-tier models could score only 75%-85% on a GSM test set of 1.32k straightforward problems…
-

Faster Coding Model Pricing Efficiency Questions
By
–
The speed of a faster coding model is worth it, but it seems mis-priced. C1 gobbles through files, reasons more, expect extra feedback to reach similar place as slower model do with less of everything. Intuitively it feels more expensive "the fast way" with current pricing.
-
Chat modes code review and conversation summarization evolution
By
–
Not a single feature, but evolution of:
1) Making the code changes directly in the files consistently (chat modes) and marking the diffs in a nicely reviewable way.
2) Long chat summarization combined with third-party model capabilities to handle ongoing conversations so you