Find the most interconnected wikipedia pages

Maz Jindeel - 10/18

What are the most linked to Wikipedia pages?

We examine the wikipedia archive to see which pages are the most linked to by other wikipedia pages. At a high level, this is done by sticking wikipedia pages into a digraph and seeing which nodes have the most incoming edges.

Results

Most Linked To Pages

	Page	Links
1	United States	403095
2	Animal	179816
3	France	149450
4	Arthropod	134051
5	India	131814
6	World War II	128344
7	Insect	127139
8	Germany	126493
9	Canada	123462
10	United Kingdom	116515

Find the top 1000 in results/linkedto_top_thousand.txt.
Find the top 1000000 in results/linkedto_top_million.txt.

Instructions:

These steps may take a while, as the data is quite large.
Ideally you will have around 100GB of hard drive space free. The analysis will peak around 130GB of RAM, so the more you can use the better.

Download dataset
- download the wikipedia dataset in XML here.
- Note: you will need at about 12GB free on your disk to complete this step.
Uncompress the data.
- bunzip2 enwiki-20170820-pages-articles.xml.bz2
- Note: Unzipped file is 59GB.
run parseScript.sh to extract just titles and links.
- ./ parseScript.sh enwiki-20180820-pages-articles.xml graph_input.txt
- output is ~10GB
- Links to namespaces are removed, because they are not considered pages
Analyze the data
- This requires a great deal of RAM (over 16gb) or a ton of swap space.
- Install the dependencies (we only need networkx), so pip3 install networkx
- Run the command ./analysis.py graph_input.txt results/linkedto.txt
- See files linkedto.txt and the output of the analysis.

Future work / TODO

Performance:

This thing is particularly memory hungry - some care could be taken to improve the performance. This project peaked around 130gb of RAM. Thankfully I have access to a 500GB RAM machine - doing this with swap files would be painful!
There are still some namespaces remaining in the results. I removed the obvious ones, but some remain as a result of typos or simply that the namespace is too small for us to care.

Bacon distance.

Would be nice to have!

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
results		results
README.md		README.md
analysis.py		analysis.py
parseScript.sh		parseScript.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Find the most interconnected wikipedia pages

Maz Jindeel - 10/18

What are the most linked to Wikipedia pages?

Results

Most Linked To Pages

Instructions:

Future work / TODO

Performance:

Bacon distance.

About

Releases

Packages

Languages

mazjindeel/wikipedia

Folders and files

Latest commit

History

Repository files navigation

Find the most interconnected wikipedia pages

Maz Jindeel - 10/18

What are the most linked to Wikipedia pages?

Results

Most Linked To Pages

Instructions:

Future work / TODO

Performance:

Bacon distance.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages