CocoIndex: Re-index Only Changed Rows, Skip 99.9% of Re-runs

What problem does it solve

“"Love the idea. it saved me a ton of time updating my vector embeddings" — renning22 (Hacker News, https://news.ycombinator.com/item?id=43772582)”

You know that feeling when your nightly batch job re-embeds your entire document corpus at 3am, but a critical file was updated at 3:01am and your AI assistant won't know until tomorrow night? Standard ETL pipelines treat every update as 'reprocess everything' — there is no row-level concept of what actually changed. Worse, when you update your chunking strategy or swap embedding models, you manually invalidate the entire index and wait hours for a full re-run. CocoIndex replaces this by computing hash(source_bytes) + hash(code_version) per row and skipping execution entirely for rows whose hash is unchanged.

aietlpythonrustragopen-sourcedata-pipeline

How it works

You define your transformation as a standard Python async function decorated with @coco.fn; CocoIndex runs it once per input row and stores the output against a cache key built from hash(source_bytes) XOR hash(transformation_code_version). On every subsequent run, it checks each row's key against the cache: unchanged rows skip execution entirely and their cached outputs flow directly to your target store. When you update a transformation function, CocoIndex walks the lineage graph forward from that function and re-runs only the rows whose pipeline passes through it. The Rust engine handles parallelization, retries, dead-letter queues, and failure isolation; you write plain Python without touching thread management.

Key takeaways

✦

01

Row-level memoization — when 10 rows change in a 10,000-row corpus, only those 10 rows re-execute; you skip paying to re-embed the other 9,990.

⟁

02

Code-version cache invalidation — updating a chunking function or swapping an embedding model triggers re-runs only for rows that pass through that function; rows untouched by the change stay cached even across code deploys.

⊕

03

Rust engine, Python API — you write standard async Python functions; the Rust core handles parallelization, retries, dead-letter queues, and failure isolation without any configuration from you.

◈

04

Full data lineage tracking — every output chunk traces back to its exact source byte, giving you a complete audit trail of why your index contains what it contains.

∞

05

S3 and local filesystem connectors — ingest from local folders or Amazon S3 with built-in incremental change detection; no manual diffing required.

◎

06

Multi-target write support — define one pipeline and write to vector databases, relational databases, graph databases, or feature stores using the same declarative API.

✺

07

CocoInsight observability dashboard — real-time view of index freshness and cache hit rates per pipeline run without instrumenting your own metrics.

Should you care?

Who it’s for

If you build RAG pipelines, semantic search backends, or AI agent infrastructure and your current approach re-processes the entire corpus on every update cycle — by schedule or on code change — CocoIndex targets your exact problem. You need comfortable Python async/await skills and must provision a persistent backing database (Postgres is shown in the official examples) for the control plane state. CocoIndex is not the right fit if you need a built-in query or retrieval layer; the team removed the built-in query handler in recent releases, so you bring your own retrieval logic.

Worth exploring

CocoIndex hit v1.0.3 on May 5, 2026 — its first stable 1.x release after 14 months of 0.1.x iterations — and GitHub selected it alongside Pandas and Apache Airflow for its Secure Open Source Fund in February 2026. With 9,446 stars, 67 contributors, and daily pushes as of May 10, 2026, the maintenance signals are strong. Run a POC against your actual corpus before committing: the persistent control plane adds real operational overhead, the async-only API raises the Python skill floor, and the '10×' savings figure from the README carries no independent benchmark backing.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

CocoIndex: Re-index Only Changed Rows, Skip 99.9% of Re-runs

Underrated tools. Unfiltered takes.