Analyze how much and how a word or a regular expression is mentioned in Wikipedia pages across languages
The proposed scripts allow you to study the mentions of a certain word across languages using Wikipedia articles.
The data in usage are available on Wikimedia Downloads page. Once you are on the Web Site you can choose the language and the date (data referring to Wikipedia data until that date)you are interested in to carry out the analysis (i.e. suppose you want to have all the Italian articles you follow the link itwiki
- referring to a certain date- and then you proceed downloading the dump itwiki-DATE-pages-articles-multistream.xml.bz2
, where DATE
will be the date that corresponds to your interest. The dump contains articles, templates, media/file descriptions, and primary meta-pages). This data will be the corpus for your analysis.
For an analysis that wants to take into account other factors, like the page views of the articles, through this link it is possible to get the pageviews for the whole Wikipedia corpus for each month since 2011. The documentation related to this data is provided here.
-
wiki_parser.py
:This script provides the code to parse the downloaded data. A detailed documentation is furnished for each function. It gives as output:
- The
.json
files contained inCorpus
directory - related to the example.
- The
-
helpers_parser.py
:It gathers some support functions for the parsing of the
XML
files. -
pageviews.py
:Defines functions used to carry out analysis related to the page views of the articles of interest.
-
across_languages.py
:
Contains functions to make comparisons across languages.
plots.py
:
Gathers functions to draw plots.
Remark: Since interactive plots are present open this link to read the Notebook
correctly.
The goal of the Notebook
is to provide an example that shows how to use the implemented code and to carry out a small analysis having as the object of interest 'Matteo Renzi'. We proceed with the following steps:
-
Find all articles in Italian and Portuguese that mention Matteo Renzi.
-
Rank them by how frequently they were viewed in November.
Then to play a bit with data:
- Explore the differences between IT and PT in terms of numbers and plots. Are there distinct differences between the languages in terms of what kinds of articles mention Renzi? What's the distribution of number of Renzi mentions per article in IT vs. PT?