This repository contains all the code used for the WebIsALOD paper.
Hypernymy relations are an important asset in many applications,and a central ingredient to Semantic Web ontologies. The IsA database is a large collection of such hypernymy relations extracted from the Common Crawl. In this paper, we introduce WebIsALOD, a Linked Open Data version of the IsA database, containing 11.7M hyernymy relations, each provided with rich provenance information. As the original dataset contained more than 80% wrong, noisy extractions, we run a machine learning algorithm to assign confdence scores to the individual statements.
All files starting with a number are files to generate the csv files, mappings and nquad generation. The files starting with mTurk are HTML surveys used to generate the ground truth. Files with the name "webisa_{threshold}_sample_results" are the samples from corresponding thresholds together with the majority vote and the answer of each worker. webisa_1_sentence_results.csv conatins the results from the mapping to Wikipedia pages and categories.
Most of the csv files are structed as follows:
- id
- instance
- class
- frequency
- pidspread
- pldspread
- ipremod
- ilemma
- ipostmod
- cpremod
- clemma
- cpostmod
- pids
- plds
- provids
- majority voting
- yes (counts)
- uncertain (counts)
- no (counts)
- mapping instance to dbpedia page (json array)
- mapping instance to dbpedia category (json array)
- mapping class to dbpedia page (json array)
- mapping class to dbpedia category (json array)
- mapping instance to yago (string)
- mapping class to yago (string)