Skip to content

mazjindeel/wikipedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Find the most interconnected wikipedia pages

Maz Jindeel - 10/18

What are the most linked to Wikipedia pages?

We examine the wikipedia archive to see which pages are the most linked to by other wikipedia pages. At a high level, this is done by sticking wikipedia pages into a digraph and seeing which nodes have the most incoming edges.

Results

Most Linked To Pages

Page Links
1 United States 403095
2 Animal 179816
3 France 149450
4 Arthropod 134051
5 India 131814
6 World War II 128344
7 Insect 127139
8 Germany 126493
9 Canada 123462
10 United Kingdom 116515

Instructions:

  • These steps may take a while, as the data is quite large.
  • Ideally you will have around 100GB of hard drive space free. The analysis will peak around 130GB of RAM, so the more you can use the better.
  1. Download dataset

    • download the wikipedia dataset in XML here.
    • Note: you will need at about 12GB free on your disk to complete this step.
  2. Uncompress the data.

    • bunzip2 enwiki-20170820-pages-articles.xml.bz2
    • Note: Unzipped file is 59GB.
  3. run parseScript.sh to extract just titles and links.

    • ./ parseScript.sh enwiki-20180820-pages-articles.xml graph_input.txt
    • output is ~10GB
    • Links to namespaces are removed, because they are not considered pages
  4. Analyze the data

    • This requires a great deal of RAM (over 16gb) or a ton of swap space.
    • Install the dependencies (we only need networkx), so pip3 install networkx
    • Run the command ./analysis.py graph_input.txt results/linkedto.txt
    • See files linkedto.txt and the output of the analysis.

Future work / TODO

Performance:

  • This thing is particularly memory hungry - some care could be taken to improve the performance. This project peaked around 130gb of RAM. Thankfully I have access to a 500GB RAM machine - doing this with swap files would be painful!
  • There are still some namespaces remaining in the results. I removed the obvious ones, but some remain as a result of typos or simply that the namespace is too small for us to care.

Would be nice to have!

About

What is the most interconnected Wikipedia Page?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published