Skip to content

rdeoliveira/geocorpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This website contains 2 corpora of geographic referring expressions such as the south-east coast of Aberdeenshire:

  • A text-only corpus [xlsx]: Semantically annotated expressions of various languages, domains and audiences.
  • A data-and-text corpus [zip]: Semantically annotated expressions of the meteorological domain, aligned with the the GIS data representations of what the expressions refer to.

Text only corpus

This data set is an Excel spreadsheet. It contains 671 referring expressions such as:

all North Negro tributaries

Each row contains one corpus instance (one expression) and the following annotation tags:

  • Domain: The domain or topic of the document from which the expression originated. It can be river, route or weather.
  • Audience: The target audience of the document from which the expression originated. It can be canoeing, driving, fishing or general.
  • Language: The language of the document from which the expression originated. It can be English, Portuguese or Spanish.
  • Country: The country where the document was published. It can be Brazil, Canada, Canada & USA, Colombia, UK or USA.
  • Frames of Reference: The many semantic concepts that the expression utilizes. For example, if north is in the expression, the frame of reference Direction (DIR) is annotated for this expression. The full names of the frame abbreviations are given in the legend tab.
  • Frame count: The number of different semantic frames used in the expression.

The sheet also contains a stats tab, showing basic statistics of frame of reference use within the corpus.

Read this article for a contextualised explanation of the data:

  • Rodrigo de Oliveira, Somayajulu Sripada and Ehud Reiter (2015). Designing an Algorithm for Generating Named Spatial References. ENLG 2015, 127. [pdf]

Data-and-text corpus

This data set contains several files in 2 formats:

  • GeoJSON: Each file represents a binary weather forecast scenario -- places are either dry or experiencing precipitation -- and it contains 1 or more geographic expressions. Each expression is semantically annotated with frames of reference and align to specific regions of the represented geography (the central region of Scotland).
  • PDF: Essentially the same data as above, but for human reading.

You can view the GeoJSON files in any LeafLet based web map such as http://geojson.io/. Simply paste the contents of a file onto the designated area and the data should be plotted automatically.

The GeoJSON files use the standard GeoJSON format (as described here) and have the following structure:

features
  feature
    geometry
      coordinates
    properties
      marker-symbol
      expression
      marker-color
      labels
full-text

Where:

  • A feature is a subregion of the region.
  • Each feature has a geometry, which is essentially an array of lon-lat (not lat-lon!) coordinates.
  • Each feature that represents a subregion for which an expression exists in the text has:
    • A specific number (marker-symbol).
    • A specific colour (marker-color).
    • An expression in English (expression).
    • A set of semantic tags (labels).
  • full-text is the original text where the expression originates from.

features always contains an array of coordinates representing the subregion that is not referred to in the text: distractors.

You can view the GeoJSON files in any LeafLet based web map such as http://geojson.io/. Simply paste the contents of a file onto the designated area and the data should be plotted automatically.

Read this article for a contextualised explanation of the data:

  • Rodrigo de Oliveira, Somayajulu Sripada and Ehud Reiter (2016). Absolute and Relative Properties in Geographic Referring Expressions. The 9th International Natural Language Generation conference. 2016. [pdf]

Please contact [email protected] should you have any questions.

About

Data and text corpora of geographic expressions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published