“"Frontier LMs are not simply getting better at insight anticipation via scaling." — Joy He-Yueya et al., GIANTS paper, Stanford University (arXiv:2604.09793)”
You know that feeling when you read two papers and immediately see the experiment neither team ran? GIANTS targets exactly that gap: no formal benchmark exists for testing whether an AI model can anticipate a downstream scientific contribution from its two parent papers. Without a held-out test set, researchers building synthesis tools have no standard way to measure how good their models are at this task. Current evaluation setups are either human-judged (expensive and slow) or based on surface-level similarity metrics that miss scientific meaning.
You give the model two parent-paper summaries — brief descriptions of two foundational papers a researcher might build on. The model outputs one sentence predicting the core contribution of the downstream paper that cited both. GIANTS-4B was built in two stages: supervised fine-tuning on 10,335 labeled examples to teach the model what a good insight looks like, then GRPO reinforcement learning where Gemini-2.5-Flash scores each generated insight against the real downstream contribution and returns that score as a reward. Think of it like a student who reads past exam answers first (SFT), then takes practice tests where a teacher grades the responses and gives immediate feedback (RL). The ground-truth labels in GiantsBench come from citation graphs — Gemini-3-Pro rewrites each downstream contribution into a standalone sentence.
If you're an NLP or AI-for-science researcher evaluating multi-document synthesis models, GiantsBench is immediately usable without touching the training pipeline. If you're studying RL fine-tuning for specialized text generation, the GRPO-with-similarity-reward approach is a concrete reproducible recipe. This is not useful yet if you need a deployed research assistant: the oracle-parent assumption means you still have to hand-pick the two input papers, which is the hard part of real research workflows.
Worth exploring now if you are benchmarking multi-document synthesis or studying RL fine-tuning for generation tasks — GiantsBench is the only published held-out benchmark for this specific prediction task, and the GRPO recipe is clean and reproducible. Not worth adopting as a production component: the oracle-parent assumption removes the hardest real-world step, ground truth is Gemini-generated rather than human-validated, and an absolute score of 5.97/10 shows the task remains hard for all systems including this one. The benchmark is the durable contribution; the model is a proof of concept.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.