Since the model can't really explicitly distinguish between synthetic and non-synthetic data during training, the best way would tackle the problem at the root: ensuring that the synthetic data-generating model does not produce hallucinated contents.
Preventing Hallucinations in Synthetic Data Generation Models
By
–