225 GB of Japanese TV Reactions: 3.3M Downloads, One Dev

What problem does it solve

“"All past logs from the legacy Niconico Jikkyo up to December 15, 2020 (originally distributed by Nekopanda), all logs from the new Niconico Jikkyo, and starting from June 10, 2024, current-day logs from NX-Jikkyo (an alternate comment server for Jikkyo) collected every 5 minute...”

You are building a Japanese NLP model or research project and need informal, real-time Japanese text — not formal news or structured Wikipedia content, but the shorthand, slang, and reaction language actual Japanese internet users type in the moment. No equivalent public corpus covers this style of Japanese at this scale, and the only platform that generated it (NicoNico Jikkyo) discontinued its original service in 2020, with the replacement restricting archive access to a 3-week window. Without this archive, 15 years of real-time reaction text simply would not exist in any accessible form.

japanese-nlpdatasettext-corpushuggingfaceniconicosocial-medianlp

How it works

The corpus has two origins stitched together. A previous archivist named Nekopanda bulk-exported all NicoNico Jikkyo logs from 2009 through December 15, 2020 before the original service shut down. Since then, tsukumijima has been running JKCommentCrawler, a Python tool that polls NX-Jikkyo (an alternative comment server) every 5 minutes and appends new comment batches to the dataset. Each batch is committed to HuggingFace's Git LFS storage as .nicojk files organized by channel ID and date. A companion REST API at jikkyo.tsukumijima.net uses this HuggingFace dataset as its live data source, so every API query counts as a dataset download hit.

Key takeaways

✦

01

15-year comment history (2009–present) — you get the only public corpus of real-time Japanese TV reactions spanning this range, covering 90+ channels from NHK General to satellite broadcasters

⟁

02

5-minute live ingestion — you get new NX-Jikkyo comments within 5 minutes of broadcast, automatically appended by JKCommentCrawler with no manual steps on your end

⊕

03

Per-comment metadata fields — each record carries exact UNIX timestamp plus microseconds, user premium status, anonymity flag, and display command string (the mail field), giving you richer context than bare comment text

◈

04

Configurable builder params — you filter loads by channel_id, year, and number_of_files directly in the dataset load call, avoiding full 190+ GB pulls when you only need a slice

∞

05

REST API access without downloading — jikkyo.tsukumijima.net lets you query any 3-day window by channel and time range in JSON or XML with no local dataset download required

◎

06

MIT license — no attribution requirement, no use restrictions; you can include this in commercial products or LLM training pipelines without negotiating anything

✺

07

Dual-provenance continuity — the corpus bridges the 2020 platform shutdown (Nekopanda's original dump) and the 2024 cyberattack gap (NX-Jikkyo ingestion started June 10, 2024), giving you an unbroken record across two infrastructure crises

Should you care?

Who it’s for

If you are building Japanese NLP models and need informal, colloquial training text — not news articles or Wikipedia — this is the largest free source of that style. If you are researching Japanese internet culture, public opinion on TV events, or temporal shifts in online language, no comparable dataset exists. This is not useful if you need labeled or annotated data (there are no sentiment labels, categories, or named entity tags), if you need formal Japanese text, or if your pipeline cannot handle 225 GB of raw, unfiltered content with embedded ASCII art.

Worth exploring

Yes, if your use case is informal Japanese text — this is the only public corpus at this scale for this domain, MIT-licensed, and actively maintained with a working companion API. The main risks are the broken HuggingFace dataset viewer (legacy Python script format, no in-browser preview), single-maintainer dependency, and the preprocessing burden from unfiltered ASCII art and display command strings. For the REST API access path, it works in production today without any of those caveats.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

225 GB of Japanese TV Reactions: 3.3M Downloads, One Dev

Underrated tools. Unfiltered takes.