Skip to content

Nutigeodb format

Jaak Laineste edited this page Nov 30, 2017 · 1 revision

About

Nutigeodb files are used for offline geocoding and reverse geocoding. At the very high level they are Sqlite database files with carefully designed structure to provide reasonably fast geocoding performance and compact file sizes.

Entity types

Nutigeodb database can store various geographic entities, including countries, states, streets, addresses. Each entity may have multiple entity tags associated with it:

Type Id
Country 1
Region 2
County 3
Locality 4
Neighbourhood 5
Street 6
Postcode 7
Name 8
Housenumber 9

Database structure

This section lists the tables used by the format.

Metadata

Metadata table is used to store both metadata and conversion or storage related information. The table contains two fields:

Field Type
name TEXT
value TEXT

The following table lists the supported 'name' values:

Name Description
version Version of the database, currently 1
rank_scale The scale of the entity ranks, described below
translation_table Token translation table, comma separated key value paris ('A:a,B:b')
bounds WGS84 bounds encoded as MIN_LON,MIN_LAT,MAX_LON,MAX_LAT
origin WGS84 origin point for geometry (encoded as LON,LAT), geometry is stored relative to this
encoding_precision The multiplier used when storing coordinates
quadindex_level The last zoom level stored in the quadindex. Described below in more detail.

Entities

Entities table stores geographic entities, including addresses, streets and so on. It contains the following fields:

Field Type Description
id INTEGER Unique id for the entity (note: not OSM_id!)
type INTEGER Type (1=country, 2=region, 3=country, 4=neighbourhood, 5=street, 8=POI, 9=address)
features BLOB List of features (ids, geometry) encoded with custom TinyWKB-like encoder, described below
housenumbers BLOB NULL for non-addresses, String with | as a separator for addresses
quadindex INTEGER Special quadtree node id for reverse geocoding. Described below.
rank INTEGER relative rank (importance) of the entity. The scale is stored as 'rank_scale' in metadata table

One database row may include multiple addresses. In that case the number of features and housenumbers must be equal.

The names of entity components are stored in entitynames table, which contains pairs of (entity_id, name_id) values.

Names

The following table structure is used to store all entity names (and localized versions):

Field Type Description
id INTEGER Unique id for the name
lang TEXT Locality/language of the name, can be NULL
name TEXT Name string
type INTEGER Entity types this name refers to

The relation between tokens (defined next) and names is stored in nametokens table, which contains (name_id, token_id) pairs.

Tokens

Tokens are sequences of characters used for resolving names. Tokens are normalized (converted to lower case, with diacritical marks dropped) and do not contain any separators or whitespace symbols. The following table structure is used for tokens:

Field Type Description
id INTEGER Unique id for the token
token TEXT Token string value
idf REAL Token Inverse Document Frequency
typemask INTEGER The bitmask of entity types this token refers to

Here idf field is calculated as log(totalTokenCount / thisTokenCount) where totalTokenCount is the count of all tokens in all names and thisTokenCount counts the number of occurences of this token in all names. Note that the token stored is the normalized token (converted to lowercase and symbols translated according to translation_table stored in metadata).

Feature encoding

Features (geometry with optional metadata, similar to GeoJSON) are encoded as bytestreams with 128-bit varint encoding similar to Google protobuf. Unicode strings are converted to UTF-8 bytestrings and then encoded with explicit UTF-8 length stored as a varint and then followed with the UTF-8 bytes.

Feature collections are encoded as follows:

Field Type Description
n varint Number of features
features Feature*n List of features

Each feature is encoded as follows:

Field Type Description
id varint Delta-encoded relative to previous id
geometry Geometry Geometry of the feature
n varint Number of properties
properties Property*n List of properties

Geometry is encoded as follows:

Field Type Description
type varint 1=Point, 2=MultiPoint, 3=LineString, 4=MultiLineString, 5=Polygon, 6=MultiPolygon, 7=Collection
coords/rings ... Encoding depends on type

The encoding of coordinates or rings is 'natural' - for list types, first a number of elements is stored as a varint, following a list of subtypes. All coordinates are stored as integers by first multiplying each component with the value of encoding_precision (which is stored in metadata table) and then delta encoded relative to the previous coordinate.

Properties are encoded as pairs of (name, value), where name is a string and each value are encoded as follows:

Field Type Description
type varint 0=Null, 1=Bool, 2=Int, 3=Float, 4=String
data ... Encoding depends on type

Booleans are stored as varints containing either 0 or 1, Ints are stored as varints. Floats are stored using big-endian 32-bit IEEE 754 encoding,

Quadindex encoding

Geocoding database stores compact 64-bit spatial index for fast reverse geocoding requests. The space is represented internally as a quadtree up to a fixed level (stored as quadindex_level in metadata). Each quadtree node is encoded/represented as 64-bit integer as follows:

(((y << zoom) | x) << 5) | zoom

Each geometry is assigned to the smallest node that fully covers the geometry. To query nearest geometry given a location point we first need to find the smallest node that contains the location point and then move upwards to the parent node until we reach the root node. At each encountered node we query the database using the quadindex of the node.