NVIDIA's 21M Indian Personas Run on 15-Year-Old Census Data

What problem does it solve

“"Despite the improved data diversity and fidelity to India's population, the dataset is still limited by data availability, current staleness of data, and reasonable model complexity. This results in some necessary independence assumptions; for instance, that occupations are ind...”

You know that feeling when you fine-tune a model for Indian users and it confidently assumes everyone speaks English fluently, lives in a metro city, and has a tech job? Generic synthetic persona datasets pull from English-web demographics as the default, which bakes a Western population distribution into your Hindi chatbot's priors. Collecting real Indian user data raises privacy and regulatory hurdles. The result: your model has Hindi vocabulary but American demographic assumptions behind every generated response.

synthetic-dataindiahindillm-trainingdemographicsopen-datasetnvidia

How it works

NVIDIA first built a Probabilistic Graphical Model calibrated to India's 2011 Census tables — so when it samples age, sex, religion, education level, occupation, and district, the resulting combination matches India's actual population distribution rather than an LLM's internal biases. Those demographic bundles then pass to GPT-OSS-120B, which writes a natural-language persona narrative for each bundle. This process runs seven times per record — once each for general, professional, linguistic, culinary, sports, arts, and travel personas — so each of the 3 million demographic profiles becomes seven different character descriptions. NeMo Data Designer orchestrates the pipeline using Jinja templating and Pydantic validation. Output lands in Parquet files you load with the HuggingFace datasets library in two lines.

Key takeaways

✦

01

Census-grounded demographic sampling — age, sex, religion, education, occupation, and district match India's actual population distribution from the 2011 Census, not a web-scraped approximation of who writes things online

⟁

02

7 persona types per record — each demographic bundle generates general, professional, linguistic, culinary, sports, arts, and travel personas, giving your fine-tuning pipeline text variety without re-sampling demographics

⊕

03

~2,900 occupational categories — the 26 broad census buckets are expanded via the National Classification of Occupations-2004, so your model learns about beedi rollers and handloom weavers, not just 'labourer'

◈

04

Three script variants — English (en_IN), Hindi Devanagari (hi_Deva_IN), and Hindi Latin (hi_Latn_IN), each at approximately 1 million rows, letting you pick the script your target application needs

∞

05

All 36 Indian states and 640 districts — geographic granularity gives personas accurate regional context so a coastal Kerala persona reads differently from a Rajasthan desert town persona

◎

06

~560k unique names from Electoral Rolls — names reflect India's actual linguistic name diversity rather than being generated by an LLM trained on English baby-name lists

✺

07

CC BY 4.0 license — free for commercial use, with no attribution tangle compared to alternatives like BhashaKritika which uses the more restrictive Krutrim Community License

Should you care?

Who it’s for

If you are training or fine-tuning a Hindi or Indian-English language model and need demographically calibrated synthetic personas as conditioning input, this is for you. It also works for building evaluation sets for Indian-language chatbots where you want to test against statistically realistic user profiles. Not useful if your target population speaks Tamil, Telugu, Bengali, Marathi, or any of the 19 other scheduled Indian languages absent from this dataset.

Worth exploring

Load the English subset in an afternoon with two lines of Python — no account, no API key, and Parquet works with any data tool. The census-grounded demographic sampling is the only approach like it in open source for India at this scale. The hard limit to know before committing: 22 scheduled Indian languages are absent, and the 2011 Census backbone means urbanization and smartphone penetration patterns are 15 years out of date with no stated update plan.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

NVIDIA's 21M Indian Personas Run on 15-Year-Old Census Data

Underrated tools. Unfiltered takes.