ARIS Forces a Rival AI to Audit Every Claim Your Agent Makes

What problem does it solve

“"Any long-term task performed by a single agent is unreliable." — Ruofeng Yang, Yongcan Li, Shuai Li (ARIS paper §1, arxiv.org/abs/2605.03042)”

You know that feeling when an AI produces a confident, well-structured result that turns out to be wrong in ways you can't see until something downstream breaks? In autonomous ML research pipelines, this failure compounds quietly: the same model writes the experiment, interprets the results, and drafts the paper — so its errors are internally consistent and pass the review step because the reviewer shares the same blind spots. Existing systems like AI Scientist use the same model family for both generation and review, meaning the reviewer inherits the same biases as the writer. The failure mode isn't a crash — it's a polished artifact where the numbers look plausible but the underlying evidence doesn't support them.

autonomous-researchmulti-agentllmresearch-automationopen-sourceadversarial-reviewml-research

How it works

ARIS splits every research task between an executor (the AI doing the work) and a reviewer (an AI from a completely different provider that reads only the finished artifacts, never the executor's reasoning). By default, Claude Code executes — writing code, running experiments, drafting LaTeX — while GPT-5.4 reviews from a fresh context thread with no prior conversation history. The loop runs up to 4 rounds or until the reviewer score tops 6/10; each round the executor addresses the reviewer's action items and can run new GPU experiments if the reviewer asks for more evidence. Three audit stages run in sequence: first checking that experimental code produced the numbers it claims, then mapping each result to a supported/partial/invalidated verdict, then having a zero-context fresh reviewer cross-check every quantitative statement in the manuscript against raw result files and the claim ledger.

Key takeaways

✦

01

Cross-family reviewer-executor separation as the default — the reviewer comes from a different AI provider than the executor and reads your artifacts cold with no access to the executor's reasoning, which is the only configuration that cat...

⟁

02

Three-stage evidence-to-claim audit cascade — Stage 1 checks your experimental code for phantom results and self-normalized scores; Stage 2 maps each result to a supported/partial/invalidated verdict; Stage 3 runs a fresh-context reviewer ...

⊕

03

65+ reusable Markdown skill files — each SKILL.md is a plain-text specification that runs on Claude Code, Codex CLI, and Cursor without any file-level changes, so you can swap the executor without rewriting your workflow or being locked to...

◈

04

Persistent research wiki with eight typed relationships — papers, ideas, experiments, and claims are stored in a structured knowledge graph across sessions, so you don't get the same failed idea proposed again in a new run; rejected ideas ...

∞

05

Four effort presets — lite (0.4×), balanced (1×, default), max (2.5×), and beast (5–8×) — so you control how many papers get surveyed, how many review rounds run, and how many experiment repetitions fire without touching a single skill fil...

◎

06

Five-pass scientific editing pipeline — after drafting, your paper goes through clutter removal, active-voice conversion, sentence structure improvement, terminology consistency check, and numerical consistency cross-check against result f...

✺

07

Proof-checker with a 20-category issue taxonomy — for theory-heavy papers, a dedicated skill verifies your theorem applications against side-condition checklists and runs a counterexample pass on key lemmas, producing a proof-obligation le...

Should you care?

Who it’s for

If you run multi-day ML experiments and write papers about them, ARIS gives you a structured way to have an AI from a different provider challenge every claim before submission — specifically targeting the failure mode where your own review loop can't catch its own errors. Also useful if you've had the experience of submitting a paper only to have reviewers find numeric inconsistencies your collaborators also missed. This is not yet suitable if you need guaranteed correctness (the audit stack is explicitly advisory, not formal verification), if your code is confidential (repository-level revi...

Worth exploring

Worth exploring if you run multi-session ML research pipelines and want a documented claim audit layer rather than a single-pass self-review. The 8.8k GitHub stars and v0.4.4 release signal real community traction, but the paper's authors explicitly flag the absence of controlled evaluation as the primary limitation — the claim that cross-family review outperforms same-family review hasn't been tested in a compute-matched benchmark yet. Treat it as a well-designed beta-stage system, not a research quality guarantee.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

ARIS Forces a Rival AI to Audit Every Claim Your Agent Makes

Underrated tools. Unfiltered takes.