“"AI agent skills that evolve from every real interaction — just talk. Across sessions, agents, devices, and users." — SkillClaw README (source: raw.githubusercontent.com/AMAP-ML/SkillClaw/main/README.md, verified 2026-05-04)”
You know that feeling when you ship an agent with a polished skill library, then two weeks later users keep hitting the same broken workflow and fixing it means someone manually reading session logs and rewriting SKILL.md files by hand? Every user who hits the failure rediscovers it independently, and the knowledge evaporates when they close the tab. The paper's abstract states it directly: 'similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience.' The problem compounds in multi-user deployments where different users have complementary signals — one user knows when a skill fails, another knows a workaround — but nothing aggregates those signals into an update.
A lightweight proxy sits between your agent and its LLM API, intercepting every call to /v1/chat/completions without changing your agent's behavior — you reroute to port 30000 and logging starts transparently. Every session gets written to shared storage (S3, OSS, or local). Each night, an Evolve Server reads those logs and runs an Autonomous Evolver — itself an LLM — that scans for recurring patterns across users. For each candidate change, the Evolver picks exactly one of three actions: Refine the existing skill, Create a new skill for an uncovered subprocess, or Skip when the evidence is too thin. Candidate updates then race against the current skill in a live validation environment: only the version that performs better gets deployed, giving you a monotonic improvement guarantee by construction.
If you run LLM agents in production on OpenClaw, Hermes, or any OpenAI-compatible API and have spent time manually patching skill files after watching users hit the same failures repeatedly, SkillClaw is built for exactly that situation. It is also immediately relevant for AI platform teams deploying shared agents across dozens of engineers who want collective improvement without building a labeling pipeline. Not ready for you yet if you need multi-GPU or fully on-device inference — the Evolve Server requires API access to a frontier model, and the paper only validates behavior at 8-user, 6-d...
Worth a pilot if you are already on an OpenClaw-compatible platform with more than a handful of active users — the client proxy installs in 30 minutes and adds no visible latency to existing setups. Be cautious about the benchmark numbers: the 88.41% relative improvement in Creative Synthesis comes from a 6-day experiment with 8 users on WildClawBench, a benchmark built by InternLM (an Alibaba-affiliated team in the same ecosystem as the paper's authors), which the paper does not flag. The authors themselves describe the work as 'small-scale testing stage.' Run your own before/after comparison on internal tasks before treating the published numbers as guarantees.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.