The scripts in this folder create the Wikidata-derived gazetteers that are used to link Quick's Chronology to Wikidata.
Run the code in the following order (from the wikidata/
directory):
- Extract all location entries from a Wikidata dump (uncomment the
Parse all WikiData
section, i.e. rows 372-400. Warning: this step takes about 2 full days). See here for more information:
python entity_extraction.py
- Create the Wikidata gazetteers. See here for more information:
python create_gazetteers.py
- Expand the alternate names. See here for more information:
python extend_altnames.py
The following sections provide further information on each of the steps.
Script entity_extraction.py
extracts locations from Wikidata (and their relevant properties).
This script is partially based on https://akbaritabar.netlify.app/how_to_use_a_wikidata_dump.
This script assumes that you have already downloaded a full Wikidata dump from here (as described in the resources readme). We assume the downloaded bz2
file is stored under station-to-station/resources/wikidata/
.
Before running the script, you will have to uncomment the Parse all WikiData
section (rows 372-400). Beware that this step will take about 2 full days.
The output is in the form of .csv
files that will be created under ../resources/wikidata/extracted/
, each containing 5,000 rows corresponding to geographical entities extracted from Wikidata with the following fields (corresponding to wikidata properties, e.g. P7959
for historical county; a description of each can be found as comments in the code):
'wikidata_id', 'english_label', 'instance_of', 'description_set', 'alias_dict', 'nativelabel', 'population_dict', 'area', 'hcounties', 'date_opening', 'date_closing', 'inception_date', 'dissolved_date', 'follows', 'replaces', 'adm_regions', 'countries', 'continents', 'capital_of', 'borders', 'near_water', 'latitude', 'longitude', 'wikititle', 'geonamesIDs', 'toIDs', 'vchIDs', 'vob_placeIDs', 'vob_unitIDs', 'epns', 'os_grid_ref', 'connectswith', 'street_address', 'adjacent_stations', 'ukrailcode', 'connectline', 'connectservice', 'getty', 'heritage_designation', 'ownedby', 'postal_code', 'street_located'
The feature_exploration.ipynb
notebook will allow you to explore Wikidata entries and their features for specific Wikidata records. It is not part of the pipeline.
Run create_gazetteers.py
to create different Wikidata gazetteers.
This will create three different gazetteers:
- Approximate UK gazetteer (point i)
- GB gazetteer (point ii)
- GB stations gazetteer (point iii)
The following subsections describe how they are created.
In this step, we create an approximate subset of those entities that are in the UK today, to have a more manageable dataset. At this stage we favour recall (we want to make sure all relevant entities are there, at the expense of precision; we will favour precision at a late point). We perform this filtering in the following manner: we keep Wikidata entities whose coordinates fall within a very-approximated coordinate boundary box enclosing the UK.
The result is stored as ../processed/wikidata/uk_approx_gazetteer.csv
.
This step creates a strict GB gazetteer using a GB shapefile (the resources readme describes how to obtain it), by filtering out all locations that are not contained within the polygons described in the shapefile.
The result is stored as ../processed/wikidata/gb_gazetteer.csv
.
In this step, we create a further subset of those entries in the UK that are either instances of station-related classes (manually specified, see the full list here) or their English label has the words station
, stop
, or halt
, not preceded by typical non-railway station stations such as 'police', 'signal', 'power', 'lifeboat', 'pumping', or 'transmitting'.
The result is stored as ../processed/wikidata/gb_stations_gazetteer.csv
.
Run extend_altnames.py
to extend the altnames of the GB gazetteer and the GB stations gazetteer, and to create altname-centric gazetteers (i.e. instead of WikidataID-centric).
The altnames come from the following sources:
- Wikidata
alias_dict
,english_label
, andnative_label
fields. - Geonames alternate names.
- WikiGazetteer alternate names.
This process results in two different dataframes:
../processed/wikidata/altname_gb_gazetteer.tsv
is the expanded altname-centric version ofgb_gazetteer
,../processed/wikidata/altname_gb_stations_gazetteer.tsv
is the altname-centric version ofgb_stations_gazetteer
.
See some rows of the GB stations altnames-centric gazetteer:
wkid | altname | source | lat | lon | |
---|---|---|---|---|---|
29 | Q23070582 | Conway Marsh railway station | english_label | 53.286861 | -3.85256 |
30 | Q23070582 | Conway Morfa railway station | wikigaz | 53.286861 | -3.85256 |
31 | Q2178092 | Deganwy Railway Station | geonames | 53.295000 | -3.83300 |
32 | Q2178092 | Deganwy station | wikigaz | 53.295000 | -3.83300 |