Your LLM Runs Better on Common Phrasing — Here's the Proof

What problem does it solve

“"The paraphrasing can lead to semantic drift, which is the reason why human annotation is necessary in this process." — Hongyuan Adam Lu et al., Figure 1 caption, arXiv:2604.02176”

You've probably noticed that asking an AI the same question in two different ways gives two very different answers — and the worse-phrased version isn't always the one you'd expect to fail. Prior research confirmed this variance (Cao et al., 2024) but couldn't explain *which* paraphrase to prefer or why. Without a principled selector, prompt engineering becomes trial-and-error guesswork. For fine-tuning, the same blindspot applies: if your training sentences happen to use rare or unusual phrasing, you're paying compute cost for worse results.

llmprompt-engineeringnlpfine-tuningcurriculum-learningresearch-paperpython

How it works

The core idea is that a sentence's 'frequency' — how often its words appear together in natural text — predicts how well an LLM will handle it. The authors approximate sentence-level frequency as the geometric mean of its word-level frequencies, computed from any public corpus (no access to the LLM's training data required). Given two paraphrases with the same meaning, you pick the higher-frequency one. At inference time, a paraphraser (GPT-4o-mini in their experiments) rewrites your input toward common phrasing before passing it to the target LLM. For fine-tuning, Textual Frequency Distillation (TFD) refines the frequency estimate by querying the target model for story completions; Curriculum Textual Frequency Training (CTFT) then fine-tunes by showing rare examples first and common ones last — the opposite of the typical easy-to-hard order.

Key takeaways

✦

01

Input Paraphraser — gives you an accuracy boost at inference time without touching the model: rewrite your prompts to higher-frequency phrasing before every LLM call, zero retraining required.

⟁

02

Textual Frequency Distillation (TFD) — lets you estimate frequency for closed-source models you can't inspect: query the target LLM with story-completion prompts and use its completions to refine your frequency proxy.

⊕

03

Curriculum Textual Frequency Training (CTFT) — improves fine-tuning outcomes by ordering training data rare-first, then common: counterintuitive but empirically shown to outperform standard ordering on FLORES-200 machine translation.

◈

04

Corpus-agnostic frequency proxy — works without the LLM's private training data: any public text corpus produces the word-level frequencies you need to run the geometric mean calculation from Equation 3.

∞

05

Textual Frequency Paired Dataset (TFPD) — gives you a ready-made benchmark for testing frequency effects: 738 high/low-frequency math pairs from GSM8K and 526 MT pairs from FLORES-200, included in the repo under datasets/.

Should you care?

Who it’s for

If you write prompts professionally — for production RAG pipelines, LLM fine-tuning jobs, or prompt engineering at a product company — this paper gives you a testable, actionable hypothesis you can validate in a day. It's also worth reading if you work on MT or math reasoning evaluation and have noticed inexplicable variance across prompt phrasings. Not useful yet if you need a drop-in production library: modules 3-5 are unfinished, there is no end-to-end run command, and multi-day compute is likely required for full reproduction.

Worth exploring

Worth a read and a small experiment if you run prompts at any volume — the zero-retraining paraphraser idea is testable this week with nothing but an LLM API key and a frequency wordlist. The CTFT curriculum ordering is more speculative outside of MT: it was tested on a 526-pair translation dataset only. Treat the 'law' framing skeptically; opentrain.ai explicitly flags the benchmark signals as too thin for confident reproduction, and Appendix F of the paper acknowledges the theoretical proof is incomplete.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Your LLM Runs Better on Common Phrasing — Here's the Proof

Underrated tools. Unfiltered takes.