R&D intermediate 3 min read May 24, 2026
Public Preview Sign in free for the full digest →

225 GB of Japanese TV Reactions: 3.3M Downloads, One Dev

“3.3 million monthly downloads — and almost none of that is researchers: the maintainer's own API server is downloading the data every 5 minutes and each call counts.”

225 GB of Japanese TV Reactions: 3.3M Downloads, One Dev
1 Views
0 Likes
0 Bookmarks
Source · huggingface.co

“"All past logs from the legacy Niconico Jikkyo up to December 15, 2020 (originally distributed by Nekopanda), all logs from the new Niconico Jikkyo, and starting from June 10, 2024, current-day logs from NX-Jikkyo (an alternate comment server for Jikkyo) collected every 5 minute...”

You are building a Japanese NLP model or research project and need informal, real-time Japanese text — not formal news or structured Wikipedia content, but the shorthand, slang, and reaction language actual Japanese internet users type in the moment. No equivalent public corpus covers this style of Japanese at this scale, and the only platform that generated it (NicoNico Jikkyo) discontinued its original service in 2020, with the replacement restricting archive access to a 3-week window. Without this archive, 15 years of real-time reaction text simply would not exist in any accessible form.

japanese-nlpdatasettext-corpushuggingfaceniconicosocial-medianlp

The corpus has two origins stitched together. A previous archivist named Nekopanda bulk-exported all NicoNico Jikkyo logs from 2009 through December 15, 2020 before the original service shut down. Since then, tsukumijima has been running JKCommentCrawler, a Python tool that polls NX-Jikkyo (an alternative comment server) every 5 minutes and appends new comment batches to the dataset. Each batch is committed to HuggingFace's Git LFS storage as .nicojk files organized by channel ID and date. A companion REST API at jikkyo.tsukumijima.net uses this HuggingFace dataset as its live data source, so every API query counts as a dataset download hit.

01
15-year comment history (2009–present) — you get the only public corpus of real-time Japanese TV reactions spanning this range, covering 90+ channels from NHK General to satellite broadcasters
02
5-minute live ingestion — you get new NX-Jikkyo comments within 5 minutes of broadcast, automatically appended by JKCommentCrawler with no manual steps on your end
03
Per-comment metadata fields — each record carries exact UNIX timestamp plus microseconds, user premium status, anonymity flag, and display command string (the mail field), giving you richer context than bare comment text
04
Configurable builder params — you filter loads by channel_id, year, and number_of_files directly in the dataset load call, avoiding full 190+ GB pulls when you only need a slice
05
REST API access without downloading — jikkyo.tsukumijima.net lets you query any 3-day window by channel and time range in JSON or XML with no local dataset download required
06
MIT license — no attribution requirement, no use restrictions; you can include this in commercial products or LLM training pipelines without negotiating anything
07
Dual-provenance continuity — the corpus bridges the 2020 platform shutdown (Nekopanda's original dump) and the 2024 cyberattack gap (NX-Jikkyo ingestion started June 10, 2024), giving you an unbroken record across two infrastructure crises
Who it’s for

If you are building Japanese NLP models and need informal, colloquial training text — not news articles or Wikipedia — this is the largest free source of that style. If you are researching Japanese internet culture, public opinion on TV events, or temporal shifts in online language, no comparable dataset exists. This is not useful if you need labeled or annotated data (there are no sentiment labels, categories, or named entity tags), if you need formal Japanese text, or if your pipeline cannot handle 225 GB of raw, unfiltered content with embedded ASCII art.

Worth exploring

Yes, if your use case is informal Japanese text — this is the only public corpus at this scale for this domain, MIT-licensed, and actively maintained with a working companion API. The main risks are the broken HuggingFace dataset viewer (legacy Python script format, no in-browser preview), single-maintainer dependency, and the preprocessing burden from unfiltered ASCII art and display command strings. For the REST API access path, it works in production today without any of those caveats.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →