A compilation of gazetteers for processing named entities in unstructured and semi structured text.
These gazetteers are a work in progress and are subject to ongoing change, addition, deletion, improvement.
The gazetteers are a collection of text files. Each text file represents a class of object, person, thing (generally the classification is quite loose).
Each line in a file represents a type of thing, and synonyms are represented on these lines separated with commas ,
The format of these text files are designed to be readily ingested by the Balleen Entity Extraction Text Processor. For more information see the related documentation.
NOTE: These largely simple flat lists. Synonyms are not comprehensibly noted, and there is likely to be duplication. So, do not rely on the gazetteer file groupings of terms for classification, de-duplication and synonymical relationships.
Gazetteers are grouped in folders which indicate their source: See below. Within each source, the files are grouped by entity types described by the Baleen Type System
The gazetteers are built from a range of sources. See below.
We have included the code (usually python notebooks, but some .xls files) which we are using to build and cleanse data to produce these files.
The Scraper Chrome extention is very useful and intuitive.
Gazetteers are grouped in folders which indicate their source. The file name preserves part of the original URL. In many cases the licence of the data is inherited from the source. These are described below.
source | folder | licence | description |
---|---|---|---|
Aleph Insights | ./gazetteers/source_aleph | CC BY-SA licence | Produced by us, often manually. |
Wikipedia | ./gazetteers/source_wikipedia | CC BY-SA licence | Scraped or manually pulled from various wikipedia pages. The file name preserves the page name, so you can trace the source. Last capture - 20 Oct 2016 |
GDELT | ./gazetteers/source_gdelt | GDELT BY | GDELT terms and taxonomies for organisations etc Edited text files Last capture - 24 Oct 2016 |
Make sure you respect the specific licences relating to the gazetteers which come from specific sources. See the table above.
The Aleph Insights produced gazetteers are provided under the CC BY-SA licence.