@jxmnop - AI Dynamics - Page 10 of 78

Synthetic Data Generation for AI Model Training and Benchmarks

By

–

08 August 2025 22h47

oh i mean it’s probably synthetic data that was generated to convey skills useful for certain benchmarks. not trying to imply that they’re training on the benchmarks. saw no evidence of that and i doubt it

→ View original post on X — @jxmnop

8 August 2025

Extracting Training Data Directly from Large Language Models

By

@jxmnop

–

08 August 2025 21h21

FUTURE WORK – direct extraction we're working on directly extracting training data from models using RL and other methods. we'll be presenting our first work on this in COLM, and expect more in this space we may be able to directly extract data from the 120b model.. one day

→ View original post on X — @jxmnop

8 August 2025

Deduplicating Redundant AI-Generated Output Data

By

@jxmnop

–

08 August 2025 21h21

FUTURE WORK – deduplication even though i varied the random seed and used temperature, a lot of the outputs are highly redundant it would be prudent to deduplicate, i bet there are only 100k or fewer mostly-unique examples here

→ View original post on X — @jxmnop

8 August 2025

Describing Text Distribution Differences Between Language Models

By

@jxmnop

–

08 August 2025 21h21

FUTURE WORK – describing differences @ZhongRuiqi has some incredible work on methods for describing the difference between two text distributions *in natural language* we could compare outputs of 20b to the 120b model, or LLAMA, or GPT-5…

→ View original post on X — @jxmnop

8 August 2025

GPT-OSS 20B Samples Dataset Released on Hugging Face

By

@jxmnop

–

08 August 2025 21h21

if you want to try the data, here you go, it's on huggingface: http://
huggingface.co/datasets/jxm/g
pt-oss20b-samples
… let me know what you find!

→ View original post on X — @jxmnop

8 August 2025

Model Unicode Skills and Physics Limitations Explored

By

@jxmnop

–

08 August 2025 21h21

i also learned a lot from this one. the model is *really* good at using unicode …but might be bad at physics. what in the world is a 'superhalo function'

→ View original post on X — @jxmnop

8 August 2025

Understanding Constant Codeswitching in Language Models

By

@jxmnop

–

08 August 2025 21h21

what are some explanations for constant codeswitching? 1. OpenAI has figured out RL. the models no longer speak english
2. data corruption issues via OCR or synthetic training
3. somehow i forced the model to output too many tokens and they gradually shift out of distribution

→ View original post on X — @jxmnop

8 August 2025

AI Models Generate Creative Screenplay Examples

By

@jxmnop

–

08 August 2025 21h21

there are a small number of creative outputs interspersed throughout here's one example where the model starts writing a sketch for a norwegian screenplay

→ View original post on X — @jxmnop

8 August 2025

OCR Conjecture: Evidence OpenAI Scanned Books for Training

By

@jxmnop

–

08 August 2025 21h21

the OCR conjecture: some examples include artifacts such as OCRV ROOT, which indicate the training data may have been reading between the lines: OpenAI is scanning books (for some reason the model loves mentioning how many deaf people live in Malaysia)

→ View original post on X — @jxmnop

8 August 2025

Multilingual Reasoning Chains in Neural Language Models

By

@jxmnop

–

08 August 2025 21h21

what you can't see from the map is many of the chains start in English but slowly descend into Neuralese the reasoning chains happily alternate between Arabic, Russian, Thai, Korean, Chinese, and Ukrainian. then usually make their way back to English (but not always)

→ View original post on X — @jxmnop

8 August 2025