the OCR conjecture: some examples include artifacts such as OCRV ROOT, which indicate the training data may have been reading between the lines: OpenAI is scanning books (for some reason the model loves mentioning how many deaf people live in Malaysia)
OCR Conjecture: Evidence OpenAI Scanned Books for Training
By
–
