I’m expecting to have more time to travel post-singularity
@jxmnop
-
Token Sampling by Average Frequency with Single Token Prompting
By
–
i sample tokens based on average frequency and prompt with 1 token
-
Synthetic Data Generation for AI Model Training and Benchmarks
By
–
oh i mean it’s probably synthetic data that was generated to convey skills useful for certain benchmarks. not trying to imply that they’re training on the benchmarks. saw no evidence of that and i doubt it
-
Extracting Training Data Directly from Large Language Models
By
–
FUTURE WORK – direct extraction we're working on directly extracting training data from models using RL and other methods. we'll be presenting our first work on this in COLM, and expect more in this space we may be able to directly extract data from the 120b model.. one day
-
Deduplicating Redundant AI-Generated Output Data
By
–
FUTURE WORK – deduplication even though i varied the random seed and used temperature, a lot of the outputs are highly redundant it would be prudent to deduplicate, i bet there are only 100k or fewer mostly-unique examples here
-
Describing Text Distribution Differences Between Language Models
By
–
FUTURE WORK – describing differences @ZhongRuiqi has some incredible work on methods for describing the difference between two text distributions *in natural language* we could compare outputs of 20b to the 120b model, or LLAMA, or GPT-5…
-

GPT-OSS 20B Samples Dataset Released on Hugging Face
By
–
if you want to try the data, here you go, it's on huggingface: http://
huggingface.co/datasets/jxm/g
pt-oss20b-samples
… let me know what you find! -

Model Unicode Skills and Physics Limitations Explored
By
–
i also learned a lot from this one. the model is *really* good at using unicode …but might be bad at physics. what in the world is a 'superhalo function'
-
Understanding Constant Codeswitching in Language Models
By
–
what are some explanations for constant codeswitching? 1. OpenAI has figured out RL. the models no longer speak english
2. data corruption issues via OCR or synthetic training
3. somehow i forced the model to output too many tokens and they gradually shift out of distribution -

AI Models Generate Creative Screenplay Examples
By
–
there are a small number of creative outputs interspersed throughout here's one example where the model starts writing a sketch for a norwegian screenplay