4B model beats Gemini-3-Pro at predicting research insights

What problem does it solve

“"Frontier LMs are not simply getting better at insight anticipation via scaling." — Joy He-Yueya et al., GIANTS paper, Stanford University (arXiv:2604.09793)”

You know that feeling when you read two papers and immediately see the experiment neither team ran? GIANTS targets exactly that gap: no formal benchmark exists for testing whether an AI model can anticipate a downstream scientific contribution from its two parent papers. Without a held-out test set, researchers building synthesis tools have no standard way to measure how good their models are at this task. Current evaluation setups are either human-judged (expensive and slow) or based on surface-level similarity metrics that miss scientific meaning.

ai-for-scienceresearch-papernlpreinforcement-learningbenchmarkscientific-discoveryllm

How it works

You give the model two parent-paper summaries — brief descriptions of two foundational papers a researcher might build on. The model outputs one sentence predicting the core contribution of the downstream paper that cited both. GIANTS-4B was built in two stages: supervised fine-tuning on 10,335 labeled examples to teach the model what a good insight looks like, then GRPO reinforcement learning where Gemini-2.5-Flash scores each generated insight against the real downstream contribution and returns that score as a reward. Think of it like a student who reads past exam answers first (SFT), then takes practice tests where a teacher grades the responses and gives immediate feedback (RL). The ground-truth labels in GiantsBench come from citation graphs — Gemini-3-Pro rewrites each downstream contribution into a standalone sentence.

Key takeaways

✦

01

GiantsBench (17,839 examples): gives you a held-out benchmark to score any multi-document synthesis model against consistent ground truth — no annotation effort required on your end

⟁

02

GRPO RL with similarity reward: shows that a 4B open-weight model trained with this recipe outperforms Gemini-3-Pro on the task, giving you a reproducible fine-tuning template for specialized generation tasks

⊕

03

Three-judge evaluation (Gemini-3-Pro, Qwen3-14B, SciJudge-30B): cross-validates results across three independent judges so the reported 5.97/10 score is not an artifact of a single biased evaluator

◈

04

Best@k inference scaling: lets you sample k insight candidates at inference time and select the top-scored one, trading compute for higher output quality without any retraining

∞

05

Temporal train/test split: test examples post-date training data (pre-July 2023 train, post-July 2023 test), so the benchmark measures generalization rather than pattern matching on seen data

◎

06

Apache-2.0 license: lets you use the code and model in commercial products or derivative research without restriction

Should you care?

Who it’s for

If you're an NLP or AI-for-science researcher evaluating multi-document synthesis models, GiantsBench is immediately usable without touching the training pipeline. If you're studying RL fine-tuning for specialized text generation, the GRPO-with-similarity-reward approach is a concrete reproducible recipe. This is not useful yet if you need a deployed research assistant: the oracle-parent assumption means you still have to hand-pick the two input papers, which is the hard part of real research workflows.

Worth exploring

Worth exploring now if you are benchmarking multi-document synthesis or studying RL fine-tuning for generation tasks — GiantsBench is the only published held-out benchmark for this specific prediction task, and the GRPO recipe is clean and reproducible. Not worth adopting as a production component: the oracle-parent assumption removes the hardest real-world step, ground truth is Gemini-generated rather than human-validated, and an absolute score of 5.97/10 shows the task remains hard for all systems including this one. The benchmark is the durable contribution; the model is a proof of concept.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

4B model beats Gemini-3-Pro at predicting research insights

Underrated tools. Unfiltered takes.