GitHub Repos advanced 3 min read May 5, 2026
Public Preview Sign in free for the full digest →

Agent skills that evolve collectively: 88% gain in 6 rounds

“One task went 28.3% → 100% after six rounds of skill evolution driven by eight users sharing sessions — no labels, no manual rewrites.”

Agent skills that evolve collectively: 88% gain in 6 rounds
1 Views
0 Likes
0 Bookmarks
Source · huggingface.co

“"AI agent skills that evolve from every real interaction — just talk. Across sessions, agents, devices, and users." — SkillClaw README (source: raw.githubusercontent.com/AMAP-ML/SkillClaw/main/README.md, verified 2026-05-04)”

You know that feeling when you ship an agent with a polished skill library, then two weeks later users keep hitting the same broken workflow and fixing it means someone manually reading session logs and rewriting SKILL.md files by hand? Every user who hits the failure rediscovers it independently, and the knowledge evaporates when they close the tab. The paper's abstract states it directly: 'similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience.' The problem compounds in multi-user deployments where different users have complementary signals — one user knows when a skill fails, another knows a workaround — but nothing aggregates those signals into an update.

llm-agentsskill-evolutionmulti-agentopen-sourcepythonresearch-paperagentic-ai

A lightweight proxy sits between your agent and its LLM API, intercepting every call to /v1/chat/completions without changing your agent's behavior — you reroute to port 30000 and logging starts transparently. Every session gets written to shared storage (S3, OSS, or local). Each night, an Evolve Server reads those logs and runs an Autonomous Evolver — itself an LLM — that scans for recurring patterns across users. For each candidate change, the Evolver picks exactly one of three actions: Refine the existing skill, Create a new skill for an uncovered subprocess, or Skip when the evidence is too thin. Candidate updates then race against the current skill in a live validation environment: only the version that performs better gets deployed, giving you a monotonic improvement guarantee by construction.

01
Zero-config session capture — the client proxy intercepts your existing agent API calls with no code changes to your agent; reroute to port 30000 and logging starts automatically across every connected user
02
Three-decision Evolver (Refine / Create / Skip) — the LLM evolver applies a conservative Skip when session evidence is thin, which prevents skill degradation in low-traffic or ambiguous situations
03
Nighttime validation gate — every candidate update competes against the current skill version in a live idle-environment test before deployment; only the winner ships, giving you a monotonic improvement guarantee
04
Cross-user knowledge transfer — a fix contributed by one user's session propagates to all users in the shared group overnight without any of them needing to take action
05
Dual engine modes — run the Evolve Server in workflow mode (fixed 3-stage pipeline: Summarize → Aggregate → Execute) for predictable, debuggable behavior, or agent mode (OpenClaw-driven) for open-ended reasoning on complex skill updates
06
Native integration with 10+ agent platforms — Hermes, Codex, Claude Code, OpenClaw, QwenPaw, and IronClaw are all auto-configured by a single `skillclaw setup` command
Who it’s for

If you run LLM agents in production on OpenClaw, Hermes, or any OpenAI-compatible API and have spent time manually patching skill files after watching users hit the same failures repeatedly, SkillClaw is built for exactly that situation. It is also immediately relevant for AI platform teams deploying shared agents across dozens of engineers who want collective improvement without building a labeling pipeline. Not ready for you yet if you need multi-GPU or fully on-device inference — the Evolve Server requires API access to a frontier model, and the paper only validates behavior at 8-user, 6-d...

Worth exploring

Worth a pilot if you are already on an OpenClaw-compatible platform with more than a handful of active users — the client proxy installs in 30 minutes and adds no visible latency to existing setups. Be cautious about the benchmark numbers: the 88.41% relative improvement in Creative Synthesis comes from a 6-day experiment with 8 users on WildClawBench, a benchmark built by InternLM (an Alibaba-affiliated team in the same ecosystem as the paper's authors), which the paper does not flag. The authors themselves describe the work as 'small-scale testing stage.' Run your own before/after comparison on internal tasks before treating the published numbers as guarantees.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →