Skip to content

6 Using Wikidata as KB

Diego Moussallem edited this page Jun 1, 2018 · 16 revisions

Wikidata comprises different version languages as Wikipedia, however, it assigns global identifiers to the resources differently as DBpedia. For example, Barack Obama has its identifier as Q76 (https://www.wikidata.org/entity/Q76) independently in which language the resource is described. What matters are the labels thus after downloading the data, we remove the languages tags that we do not need. The steps for creating the Wikidata index follows below:

1) Download the data from:

https://dumps.wikimedia.org/wikidatawiki/entities We recommend downloading the file in Ntriples(nt) format.

2) Remove the language tags. In the example below, we remove all and just leave English as the preferred language.

sed '/@de/d;/@fr/d;/@it/d;/@eo/d;/@pl/d;/@ru/d;/@ja/d;/@zh/d;/@es/d;/@nl/d;/@af/d;/@an/d;/@ar/d;/@arz/d;/@ast/d;/@az/d;/@bar/d;/@be/d;/@bg/d;/@br/d;/@bs/d;/@ca/d;/@cdo/d;/@cs/d;/@cv/d;/@cy/d;/@da/d;/@diq/d;/@dsb/d;/@el/d;/@et/d;/@eu/d;/@ext/d;/@fa/d;/@fi/d;/@fo/d;/@fy/d;/@ga/d;/@gd/d;/@gl/d;/@gu/d;/@gv/d;/@he/d;/@hi/d;/@hr/d;/@hsb/d;/@ht/d;/@hu/d;/@hy/d;/@ia/d;/@id/d;/@ilo/d;/@io/d;/@is/d;/@jv/d;/@ka/d;/@km/d;/@kn/d;/@ko/d;/@ku/d;/@kw/d;/@la/d;/@lb/d;/@lij/d;/@ln/d;/@lt/d;/@lv/d;/@ml/d;/@mn/d;/@mr/d;/@ms/d;/@mt/d;/@nds-nl/d;/@nn/d;/@nrm/d;/@oc/d;/@os/d;/@pms/d;/@pnb/d;/@pt/d;/@qu/d;/@rm/d;/@rmy/d;/@ro/d;/@scn/d;/@sco/d;/@sh/d;/@sk/d;/@sl/d;/@so/d;/@sq/d;/@sr/d;/@su/d;/@sv/d;/@ta/d;/@tet/d;/@tg/d;/@th/d;/@tl/d;/@tpi/d;/@tr/d;/@tt/d;/@uk/d;/@ur/d;/@vec/d;/@vi/d;/@war/d;/@xal/d;/@yi/d;/@yo/d;/@zea/d;/@nb/d;/@pt-br/d;/@yue/d;/@ang/d;/@bn/d;/@nap/d;/@be-tarask/d;/@nan/d;/@nov/d;/@pa/d;/@ie/d;/@stq/d;/@hak/d;/@li/d;/@am/d;/@ba/d;/@uz/d;/@kk/d;/@sc/d;/@en-gb/d;/@en-ca/d;/@mzn/d;/@ne/d;/@gom/d;/@gsw/d;/@ceb/d;/@lmo/d;/@bho/d;/@te/d;/@sw/d;/@si/d;/@gom-latn/d;/@gom-deva/d' downloaded-data.nt > final wikidata-en.nt

The best is to convert from nt to ttl, because our creator only deals with ttl files, you can convert the file using the following command rapper -g file.nt -o turtle > file.ttl or change our index creator in (https://github.com/dice-group/AGDISTIS/blob/master/src/main/java/org/aksw/agdistis/util/TripleIndexCreator.java#L140) to accept ntriple files, it is up to you and how powerful your machine is.

3) Run the index creator java class using the values from property file below.

You can also download the pre-built index directly from our server and run it locally, but before you need to follow the following steps.

  1. Wikidata does not use rdfs:type as a property to indicate type, it uses http://www.wikidata.org/prop/direct/P31. Thus, you need to change the predicate type search in https://github.com/dice-group/AGDISTIS/blob/master/src/main/java/org/aksw/agdistis/algorithm/DomainWhiteLister.java#L41 before using the whitelist parameter.

  2. you need to include the types that you want to find, for example, Person is http://www.wikidata.org/entity/Q5; Organization is http://www.wikidata.org/entity/Q43229; Location there are a variety of types, but you can start by http://www.wikidata.org/entity/Q515 and http://www.wikidata.org/entity/Q6256

  3. you can run it using the following command: mvn clean package tomcat:run -DskipTests

Get the data via: wget http://hobbitdata.informatik.uni-leipzig.de/agdistis/wikidata/new_index_wikidata_en.zip

agdistis.properties follows.

index=index_wikidata_en
index2=index_bycontext

#used to prune edges
nodeType=http://www.wikidata.org/entity/
edgeType=http://www.wikidata.org/prop/direct/
baseURI =http://www.wikidata.org
#SPARQL endpoint to retrieve domain and range information
endpoint=https://query.wikidata.org/
#this is the trigram distance between words, default = 3
ngramDistance=3
#exploration depth of semantic disambiguation graph
maxDepth=2
#threshold for cutting of similar strings
threshholdTrigram=0.87
#heuristicExpansionOn explains whether simple coocurence resolution is done or not, e.g., Barack => Barack Obama if both are in the same text
heuristicExpansionOn=true
#list of entity domains and corporationAffixes
whiteList=/config/whiteList.txt
corporationAffixes=/config/corporationAffixes.txt

#Active popularity
popularity=false

#Choose an graph-based algorithm "hits" or "pagerank"
algorithm=hits

#Enable search by context
context=false

#Enable search by acronym
acronym=false

#Enable to find common entities
commonEntities=false

# IMPORTANT for creating an own index

folderWithTTLFiles=/Users/diegomoussallem/Desktop/AGDISTIS-WIKIDATA/wikidata/
surfaceFormTSV=

You can test your running AGDISTIS via:

curl --data-urlencode "text='<entity>Barack Obama</entity>.'" -d type='agdistis' http://localhost:8080/AGDISTIS

Pruning the Wikidata

Wikidata does not include in its dump the hierarchy from its ontology. So, it affects the performance of a given graph-based disambiguation algorithm while linking. One typical problem is e.g., Leipzig University is just a university in the dump and it is not an organization then you have to include all sub-classes of ORG in MAG"s whiteList=/config/whiteList.txt for linking it correctly. We queried this list in order to decrease the effort of our users. The list containing all PER, ORG and LOC classes and sub-classes can be found at https://hobbitdata.informatik.uni-leipzig.de/agdistis/wikidata/wikidata-per-loc-org.tsv.

Disclaimer

The dumps differ substantially from the online version. The predicates, in the dumps, are assigned as entities then it may affect the results of a disambiguation process due to Wikidata's topology.