“"All past logs from the legacy Niconico Jikkyo up to December 15, 2020 (originally distributed by Nekopanda), all logs from the new Niconico Jikkyo, and starting from June 10, 2024, current-day logs from NX-Jikkyo (an alternate comment server for Jikkyo) collected every 5 minute...”
You are building a Japanese NLP model or research project and need informal, real-time Japanese text — not formal news or structured Wikipedia content, but the shorthand, slang, and reaction language actual Japanese internet users type in the moment. No equivalent public corpus covers this style of Japanese at this scale, and the only platform that generated it (NicoNico Jikkyo) discontinued its original service in 2020, with the replacement restricting archive access to a 3-week window. Without this archive, 15 years of real-time reaction text simply would not exist in any accessible form.
The corpus has two origins stitched together. A previous archivist named Nekopanda bulk-exported all NicoNico Jikkyo logs from 2009 through December 15, 2020 before the original service shut down. Since then, tsukumijima has been running JKCommentCrawler, a Python tool that polls NX-Jikkyo (an alternative comment server) every 5 minutes and appends new comment batches to the dataset. Each batch is committed to HuggingFace's Git LFS storage as .nicojk files organized by channel ID and date. A companion REST API at jikkyo.tsukumijima.net uses this HuggingFace dataset as its live data source, so every API query counts as a dataset download hit.
If you are building Japanese NLP models and need informal, colloquial training text — not news articles or Wikipedia — this is the largest free source of that style. If you are researching Japanese internet culture, public opinion on TV events, or temporal shifts in online language, no comparable dataset exists. This is not useful if you need labeled or annotated data (there are no sentiment labels, categories, or named entity tags), if you need formal Japanese text, or if your pipeline cannot handle 225 GB of raw, unfiltered content with embedded ASCII art.
Yes, if your use case is informal Japanese text — this is the only public corpus at this scale for this domain, MIT-licensed, and actively maintained with a working companion API. The main risks are the broken HuggingFace dataset viewer (legacy Python script format, no in-browser preview), single-maintainer dependency, and the preprocessing burden from unfiltered ASCII art and display command strings. For the REST API access path, it works in production today without any of those caveats.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.