“"Love the idea. it saved me a ton of time updating my vector embeddings" — renning22 (Hacker News, https://news.ycombinator.com/item?id=43772582)”
You know that feeling when your nightly batch job re-embeds your entire document corpus at 3am, but a critical file was updated at 3:01am and your AI assistant won't know until tomorrow night? Standard ETL pipelines treat every update as 'reprocess everything' — there is no row-level concept of what actually changed. Worse, when you update your chunking strategy or swap embedding models, you manually invalidate the entire index and wait hours for a full re-run. CocoIndex replaces this by computing hash(source_bytes) + hash(code_version) per row and skipping execution entirely for rows whose hash is unchanged.
You define your transformation as a standard Python async function decorated with @coco.fn; CocoIndex runs it once per input row and stores the output against a cache key built from hash(source_bytes) XOR hash(transformation_code_version). On every subsequent run, it checks each row's key against the cache: unchanged rows skip execution entirely and their cached outputs flow directly to your target store. When you update a transformation function, CocoIndex walks the lineage graph forward from that function and re-runs only the rows whose pipeline passes through it. The Rust engine handles parallelization, retries, dead-letter queues, and failure isolation; you write plain Python without touching thread management.
If you build RAG pipelines, semantic search backends, or AI agent infrastructure and your current approach re-processes the entire corpus on every update cycle — by schedule or on code change — CocoIndex targets your exact problem. You need comfortable Python async/await skills and must provision a persistent backing database (Postgres is shown in the official examples) for the control plane state. CocoIndex is not the right fit if you need a built-in query or retrieval layer; the team removed the built-in query handler in recent releases, so you bring your own retrieval logic.
CocoIndex hit v1.0.3 on May 5, 2026 — its first stable 1.x release after 14 months of 0.1.x iterations — and GitHub selected it alongside Pandas and Apache Airflow for its Secure Open Source Fund in February 2026. With 9,446 stars, 67 contributors, and daily pushes as of May 10, 2026, the maintenance signals are strong. Run a POC against your actual corpus before committing: the persistent control plane adds real operational overhead, the async-only API raises the Python skill floor, and the '10×' savings figure from the README carries no independent benchmark backing.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.