“"The paraphrasing can lead to semantic drift, which is the reason why human annotation is necessary in this process." — Hongyuan Adam Lu et al., Figure 1 caption, arXiv:2604.02176”
You've probably noticed that asking an AI the same question in two different ways gives two very different answers — and the worse-phrased version isn't always the one you'd expect to fail. Prior research confirmed this variance (Cao et al., 2024) but couldn't explain *which* paraphrase to prefer or why. Without a principled selector, prompt engineering becomes trial-and-error guesswork. For fine-tuning, the same blindspot applies: if your training sentences happen to use rare or unusual phrasing, you're paying compute cost for worse results.
The core idea is that a sentence's 'frequency' — how often its words appear together in natural text — predicts how well an LLM will handle it. The authors approximate sentence-level frequency as the geometric mean of its word-level frequencies, computed from any public corpus (no access to the LLM's training data required). Given two paraphrases with the same meaning, you pick the higher-frequency one. At inference time, a paraphraser (GPT-4o-mini in their experiments) rewrites your input toward common phrasing before passing it to the target LLM. For fine-tuning, Textual Frequency Distillation (TFD) refines the frequency estimate by querying the target model for story completions; Curriculum Textual Frequency Training (CTFT) then fine-tunes by showing rare examples first and common ones last — the opposite of the typical easy-to-hard order.
If you write prompts professionally — for production RAG pipelines, LLM fine-tuning jobs, or prompt engineering at a product company — this paper gives you a testable, actionable hypothesis you can validate in a day. It's also worth reading if you work on MT or math reasoning evaluation and have noticed inexplicable variance across prompt phrasings. Not useful yet if you need a drop-in production library: modules 3-5 are unfinished, there is no end-to-end run command, and multi-day compute is likely required for full reproduction.
Worth a read and a small experiment if you run prompts at any volume — the zero-retraining paraphraser idea is testable this week with nothing but an LLM API key and a frequency wordlist. The CTFT curriculum ordering is more speculative outside of MT: it was tested on a 526-pair translation dataset only. Treat the 'law' framing skeptically; opentrain.ai explicitly flags the benchmark signals as too thin for confident reproduction, and Appendix F of the paper acknowledges the theoretical proof is incomplete.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.