R&D beginner 3 min read Jun 22, 2026 · Updated Jun 23, 2026
Public Preview Sign in free for the full digest →

NVIDIA's 21M Indian Personas Run on 15-Year-Old Census Data

“21 million Indian AI personas — but the real demographic variety is 3 million records, and every one is calibrated to a 2011 India that is 150 million people smaller than today's.”

NVIDIA's 21M Indian Personas Run on 15-Year-Old Census Data
1 Views
0 Likes
0 Bookmarks
Source · huggingface.co

“"Despite the improved data diversity and fidelity to India's population, the dataset is still limited by data availability, current staleness of data, and reasonable model complexity. This results in some necessary independence assumptions; for instance, that occupations are ind...”

You know that feeling when you fine-tune a model for Indian users and it confidently assumes everyone speaks English fluently, lives in a metro city, and has a tech job? Generic synthetic persona datasets pull from English-web demographics as the default, which bakes a Western population distribution into your Hindi chatbot's priors. Collecting real Indian user data raises privacy and regulatory hurdles. The result: your model has Hindi vocabulary but American demographic assumptions behind every generated response.

synthetic-dataindiahindillm-trainingdemographicsopen-datasetnvidia

NVIDIA first built a Probabilistic Graphical Model calibrated to India's 2011 Census tables — so when it samples age, sex, religion, education level, occupation, and district, the resulting combination matches India's actual population distribution rather than an LLM's internal biases. Those demographic bundles then pass to GPT-OSS-120B, which writes a natural-language persona narrative for each bundle. This process runs seven times per record — once each for general, professional, linguistic, culinary, sports, arts, and travel personas — so each of the 3 million demographic profiles becomes seven different character descriptions. NeMo Data Designer orchestrates the pipeline using Jinja templating and Pydantic validation. Output lands in Parquet files you load with the HuggingFace datasets library in two lines.

01
Census-grounded demographic sampling — age, sex, religion, education, occupation, and district match India's actual population distribution from the 2011 Census, not a web-scraped approximation of who writes things online
02
7 persona types per record — each demographic bundle generates general, professional, linguistic, culinary, sports, arts, and travel personas, giving your fine-tuning pipeline text variety without re-sampling demographics
03
~2,900 occupational categories — the 26 broad census buckets are expanded via the National Classification of Occupations-2004, so your model learns about beedi rollers and handloom weavers, not just 'labourer'
04
Three script variants — English (en_IN), Hindi Devanagari (hi_Deva_IN), and Hindi Latin (hi_Latn_IN), each at approximately 1 million rows, letting you pick the script your target application needs
05
All 36 Indian states and 640 districts — geographic granularity gives personas accurate regional context so a coastal Kerala persona reads differently from a Rajasthan desert town persona
06
~560k unique names from Electoral Rolls — names reflect India's actual linguistic name diversity rather than being generated by an LLM trained on English baby-name lists
07
CC BY 4.0 license — free for commercial use, with no attribution tangle compared to alternatives like BhashaKritika which uses the more restrictive Krutrim Community License
Who it’s for

If you are training or fine-tuning a Hindi or Indian-English language model and need demographically calibrated synthetic personas as conditioning input, this is for you. It also works for building evaluation sets for Indian-language chatbots where you want to test against statistically realistic user profiles. Not useful if your target population speaks Tamil, Telugu, Bengali, Marathi, or any of the 19 other scheduled Indian languages absent from this dataset.

Worth exploring

Load the English subset in an afternoon with two lines of Python — no account, no API key, and Parquet works with any data tool. The census-grounded demographic sampling is the only approach like it in open source for India at this scale. The hard limit to know before committing: 22 scheduled Indian languages are absent, and the 2011 Census backbone means urbanization and smartphone penetration patterns are 15 years out of date with no stated update plan.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →