GitHub Repos intermediate 3 min read May 10, 2026 · Updated May 11, 2026
Public Preview Sign in free for the full digest →

CocoIndex: Re-index Only Changed Rows, Skip 99.9% of Re-runs

“Your RAG pipeline re-processes 100% of your corpus on every run — CocoIndex cuts that to the 0.1% that actually changed, and it just hit v1.0 stable.”

CocoIndex: Re-index Only Changed Rows, Skip 99.9% of Re-runs
1 Views
0 Likes
0 Bookmarks
Source · github.com

“"Love the idea. it saved me a ton of time updating my vector embeddings" — renning22 (Hacker News, https://news.ycombinator.com/item?id=43772582)”

You know that feeling when your nightly batch job re-embeds your entire document corpus at 3am, but a critical file was updated at 3:01am and your AI assistant won't know until tomorrow night? Standard ETL pipelines treat every update as 'reprocess everything' — there is no row-level concept of what actually changed. Worse, when you update your chunking strategy or swap embedding models, you manually invalidate the entire index and wait hours for a full re-run. CocoIndex replaces this by computing hash(source_bytes) + hash(code_version) per row and skipping execution entirely for rows whose hash is unchanged.

aietlpythonrustragopen-sourcedata-pipeline

You define your transformation as a standard Python async function decorated with @coco.fn; CocoIndex runs it once per input row and stores the output against a cache key built from hash(source_bytes) XOR hash(transformation_code_version). On every subsequent run, it checks each row's key against the cache: unchanged rows skip execution entirely and their cached outputs flow directly to your target store. When you update a transformation function, CocoIndex walks the lineage graph forward from that function and re-runs only the rows whose pipeline passes through it. The Rust engine handles parallelization, retries, dead-letter queues, and failure isolation; you write plain Python without touching thread management.

01
Row-level memoization — when 10 rows change in a 10,000-row corpus, only those 10 rows re-execute; you skip paying to re-embed the other 9,990.
02
Code-version cache invalidation — updating a chunking function or swapping an embedding model triggers re-runs only for rows that pass through that function; rows untouched by the change stay cached even across code deploys.
03
Rust engine, Python API — you write standard async Python functions; the Rust core handles parallelization, retries, dead-letter queues, and failure isolation without any configuration from you.
04
Full data lineage tracking — every output chunk traces back to its exact source byte, giving you a complete audit trail of why your index contains what it contains.
05
S3 and local filesystem connectors — ingest from local folders or Amazon S3 with built-in incremental change detection; no manual diffing required.
06
Multi-target write support — define one pipeline and write to vector databases, relational databases, graph databases, or feature stores using the same declarative API.
07
CocoInsight observability dashboard — real-time view of index freshness and cache hit rates per pipeline run without instrumenting your own metrics.
Who it’s for

If you build RAG pipelines, semantic search backends, or AI agent infrastructure and your current approach re-processes the entire corpus on every update cycle — by schedule or on code change — CocoIndex targets your exact problem. You need comfortable Python async/await skills and must provision a persistent backing database (Postgres is shown in the official examples) for the control plane state. CocoIndex is not the right fit if you need a built-in query or retrieval layer; the team removed the built-in query handler in recent releases, so you bring your own retrieval logic.

Worth exploring

CocoIndex hit v1.0.3 on May 5, 2026 — its first stable 1.x release after 14 months of 0.1.x iterations — and GitHub selected it alongside Pandas and Apache Airflow for its Secure Open Source Fund in February 2026. With 9,446 stars, 67 contributors, and daily pushes as of May 10, 2026, the maintenance signals are strong. Run a POC against your actual corpus before committing: the persistent control plane adds real operational overhead, the async-only API raises the Python skill floor, and the '10×' savings figure from the README carries no independent benchmark backing.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →