DoorDash's semantic search uses 32M labels to match queries with products

What problem does it solve

“Off-the-shelf models lack the specificity of the e-commerce domain and frequently fail when used on short but specific queries. — DoorDash ML Team, DashCLIP paper”

You know that feeling when a customer searches 'healthy snack' and your search returns nothing because your products are labeled 'organic granola bar'? Traditional search relies on keyword matching and engagement history — it can't understand that a photo of chips and the word 'crunchy' describe the same thing. Off-the-shelf vision-language models like CLIP work great on general images but fail on e-commerce because they don't understand product categories, aisle layouts, or shopping intent. Before DashCLIP: you'd get zero results for semantically relevant queries. After: the system retrieves the right products even when the words don't match.

semantic-searchmultimodal-mle-commerceembeddingscontrastive-learningvision-languageinformation-retrieval

How it works

Think of DashCLIP like a universal translator that converts product images, product text, and search queries into the same language — vectors. Stage 1: Take a pretrained vision-language model (BLIP-14M) and continue training it on 400K DoorDash product images and titles. This teaches the model what grocery products look like. Stage 2: Train a separate query encoder that maps user searches into the same vector space as products. The key innovation is the Query-Catalog Contrastive loss — it pulls relevant query-product pairs closer together while pushing irrelevant pairs apart. You use 700K human labels to fine-tune GPT, which then generates 32M labeled pairs for training. At inference, encode the query, find nearest product vectors, and rank.

Key takeaways

✦

01

Two-stage training pipeline — why YOU care: Stage 1 adapts generic models to your domain (400K products), Stage 2 aligns queries with products, giving you embeddings that actually understand your specific catalog

⟁

02

Query-Catalog Contrastive (QCC) loss — why YOU care: Custom loss function designed for e-commerce that outperforms generic contrastive learning, giving you better retrieval accuracy on short, specific queries

⊕

03

LLM-augmented labeling — why YOU care: Start with 700K human labels, use GPT to expand to 32M — eliminates position/selection bias from engagement data while keeping labeling costs manageable

◈

04

Multimodal product representation — why YOU care: Combines image encoder, text encoder, and image-grounded text encoder into one representation, so products with poor descriptions but good images still get found

∞

05

Generalizable embeddings — why YOU care: Same embeddings work for retrieval, ranking, aisle categorization, and relevance prediction — one model serves multiple downstream tasks

◎

06

Production-proven at scale — why YOU care: Deployed to 100% of DoorDash sponsored product traffic with statistically significant improvements in CTR and conversion rate

Should you care?

Who it’s for

If you're building or improving search/recommendations for an e-commerce platform, marketplace, or any product catalog — this shows you how to move beyond keyword matching to semantic understanding. Also relevant if you're evaluating whether to fine-tune off-the-shelf vision-language models vs build from scratch. Not useful if you don't have a product catalog or if your search problem is purely text-based without visual content.

Worth exploring

The architecture is production-proven and the paper provides enough detail to replicate. The key insight — that off-the-shelf models fail on domain-specific e-commerce queries — is broadly applicable. The LLM-augmented labeling approach (700K human → 32M GPT-generated) is a practical pattern you can borrow. The one caveat: DoorDash has significant ML infrastructure; you'll need to adapt this to your scale. If you're doing e-commerce search, this is worth studying closely.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

DoorDash's semantic search uses 32M labels to match queries with products

Underrated tools. Unfiltered takes.