Wikistats 2 builds on the success of Wikistats, the project started more than 15 years ago by Erik Zachte. Wikistats has been the canonical source of statistics about the reach and impact of the Wikimedia movement for many years. It offered a quantitative mirror to the Wikimedia communities to reflect on their growth, gaps and strategic opportunities. It also provided one of the earliest public data sources for the study of large-scale peer production communities, and as such has been cited nearly a thousand times in the literature.
As detailed in Wikistats 2’s documentation, there are several noticeable changes in the new site’s design, but the biggest changes come on the backend. In this post, we’ll detail what changes you’ll see, and explain how to access the data programmatically.
What’s new? Pretty much everything … but the data!
The data-processing pipeline for the new Wikistats has been rebuilt from scratch. It uses distributed-computing open source technology such as Hadoop, Spark, Sqoop, and Hive to ingest and enhance projects data, and loads a prepared version of the whole history of every projects into Druid, a fast-computing analytics server. Druid then serves sliced and diced subsets of data through the Analytics Query Service, the MediaWiki external API for analytics data.
A brand new front-end has also been designed and built on top of the new API. The dashboard concentrates many information, providing an easy way to overlook any project at a glance. More details can be found in the three sections of the dashboard which are labeled Contributing, Reading and Content. The Contributing section is about edits and editors, the Reading one about visited articles and unique-devices, and the Content contains article-level statistics.
You may notice that the data that exists in Wikistats 2.0 is the same data that existed in Wikistats. For this alpha release, we decided to replicate the existing metrics. In doing so we had two goals in mind: We wanted to test this new dashboard against a time-proofed one, and we also wanted to provide existing Wikistats users with statistics that closely matched those they are familiar with. We succeeded relatively well at replicating the existing statistics.
How to access the data programmatically
You can access the same data that powers the new user-interface by querying a RESTful API. The full documentation is available on this page, but we’ll walk you through some examples.
Let’s get the number of edits made every day in October 2017 for Wikipedia in Spanish:
There are two parameters in the above URL telling us about editor-types and page-types. The editor-types parameter allows to filter by anonymous users (anonymous), registered users declared as robots (group-bot), registered users not declared as bots but that we suspect are nonetheless (name-bot), and registered users the we think are legitimate humans (user). The page-types parameter is about content versus non-content pages. Content pages are located in the main namespace, while non-content pages refer to talk pages, and others special namespaces.
A second example: We want to find number of human editors who have made more than 100 edits over the course of a month, each month between January and July 2015 on the Commons project:
This request introduces a new parameter, named activity-level. It is defined for requests on editors and edited-pages and allows to filter for specific levels of activity (1..4-edits, 5..24-edits, 25-99-edits, 100..-edits, or all-activity-levels for no filtering).
And a last one, just for fun! Let’s say we want to find the number of pages visited by regular users (not bots) between december 2016 and January 2017 on the English-language edition of Wikipedia. You can see how to add dates below:
That’s it! Please let us know what you like or dislike about the new dashboard, and particularly don’t hesitate to file bugs. This will help us graduate that alpha version to the beta stage.
Joseph Allemandou, Senior Software Engineer, Analytics
Wikimedia Foundation