Agent skills that evolve collectively: 88% gain in 6 rounds

What problem does it solve

“"AI agent skills that evolve from every real interaction — just talk. Across sessions, agents, devices, and users." — SkillClaw README (source: raw.githubusercontent.com/AMAP-ML/SkillClaw/main/README.md, verified 2026-05-04)”

You know that feeling when you ship an agent with a polished skill library, then two weeks later users keep hitting the same broken workflow and fixing it means someone manually reading session logs and rewriting SKILL.md files by hand? Every user who hits the failure rediscovers it independently, and the knowledge evaporates when they close the tab. The paper's abstract states it directly: 'similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience.' The problem compounds in multi-user deployments where different users have complementary signals — one user knows when a skill fails, another knows a workaround — but nothing aggregates those signals into an update.

llm-agentsskill-evolutionmulti-agentopen-sourcepythonresearch-paperagentic-ai

How it works

A lightweight proxy sits between your agent and its LLM API, intercepting every call to /v1/chat/completions without changing your agent's behavior — you reroute to port 30000 and logging starts transparently. Every session gets written to shared storage (S3, OSS, or local). Each night, an Evolve Server reads those logs and runs an Autonomous Evolver — itself an LLM — that scans for recurring patterns across users. For each candidate change, the Evolver picks exactly one of three actions: Refine the existing skill, Create a new skill for an uncovered subprocess, or Skip when the evidence is too thin. Candidate updates then race against the current skill in a live validation environment: only the version that performs better gets deployed, giving you a monotonic improvement guarantee by construction.

Key takeaways

✦

01

Zero-config session capture — the client proxy intercepts your existing agent API calls with no code changes to your agent; reroute to port 30000 and logging starts automatically across every connected user

⟁

02

Three-decision Evolver (Refine / Create / Skip) — the LLM evolver applies a conservative Skip when session evidence is thin, which prevents skill degradation in low-traffic or ambiguous situations

⊕

03

Nighttime validation gate — every candidate update competes against the current skill version in a live idle-environment test before deployment; only the winner ships, giving you a monotonic improvement guarantee

◈

04

Cross-user knowledge transfer — a fix contributed by one user's session propagates to all users in the shared group overnight without any of them needing to take action

∞

05

Dual engine modes — run the Evolve Server in workflow mode (fixed 3-stage pipeline: Summarize → Aggregate → Execute) for predictable, debuggable behavior, or agent mode (OpenClaw-driven) for open-ended reasoning on complex skill updates

◎

06

Native integration with 10+ agent platforms — Hermes, Codex, Claude Code, OpenClaw, QwenPaw, and IronClaw are all auto-configured by a single `skillclaw setup` command

Should you care?

Who it’s for

If you run LLM agents in production on OpenClaw, Hermes, or any OpenAI-compatible API and have spent time manually patching skill files after watching users hit the same failures repeatedly, SkillClaw is built for exactly that situation. It is also immediately relevant for AI platform teams deploying shared agents across dozens of engineers who want collective improvement without building a labeling pipeline. Not ready for you yet if you need multi-GPU or fully on-device inference — the Evolve Server requires API access to a frontier model, and the paper only validates behavior at 8-user, 6-d...

Worth exploring

Worth a pilot if you are already on an OpenClaw-compatible platform with more than a handful of active users — the client proxy installs in 30 minutes and adds no visible latency to existing setups. Be cautious about the benchmark numbers: the 88.41% relative improvement in Creative Synthesis comes from a 6-day experiment with 8 users on WildClawBench, a benchmark built by InternLM (an Alibaba-affiliated team in the same ecosystem as the paper's authors), which the paper does not flag. The authors themselves describe the work as 'small-scale testing stage.' Run your own before/after comparison on internal tasks before treating the published numbers as guarantees.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Agent skills that evolve collectively: 88% gain in 6 rounds

Underrated tools. Unfiltered takes.