Nutigeodb format

About

Nutigeodb files are used for offline geocoding and reverse geocoding. At the very high level they are Sqlite database files with carefully designed structure to provide reasonably fast geocoding performance and compact file sizes.

Entity types

Nutigeodb database can store various geographic entities, including countries, states, streets, addresses. Each entity may have multiple entity tags associated with it:

Type	Id
Country	1
Region	2
County	3
Locality	4
Neighbourhood	5
Street	6
Postcode	7
Name	8
Housenumber	9

Database structure

This section lists the tables used by the format.

Metadata

Metadata table is used to store both metadata and conversion or storage related information. The table contains two fields:

Field	Type
name	TEXT
value	TEXT

The following table lists the supported 'name' values:

Name	Description
version	Version of the database, currently 1
rank_scale	The scale of the entity ranks, described below
translation_table	Token translation table, comma separated key value paris ('A:a,B:b')
bounds	WGS84 bounds encoded as MIN_LON,MIN_LAT,MAX_LON,MAX_LAT
origin	WGS84 origin point for geometry (encoded as LON,LAT), geometry is stored relative to this
encoding_precision	The multiplier used when storing coordinates
quadindex_level	The last zoom level stored in the quadindex. Described below in more detail.

Entities

Entities table stores geographic entities, including addresses, streets and so on. It contains the following fields:

Field	Type	Description
id	INTEGER	Unique id for the entity (note: not OSM_id!)
type	INTEGER	Type (1=country, 2=region, 3=country, 4=neighbourhood, 5=street, 8=POI, 9=address)
features	BLOB	List of features (ids, geometry) encoded with custom TinyWKB-like encoder, described below
housenumbers	BLOB	NULL for non-addresses, String with \| as a separator for addresses
quadindex	INTEGER	Special quadtree node id for reverse geocoding. Described below.
rank	INTEGER	relative rank (importance) of the entity. The scale is stored as 'rank_scale' in metadata table

One database row may include multiple addresses. In that case the number of features and housenumbers must be equal.

The names of entity components are stored in entitynames table, which contains pairs of (entity_id, name_id) values.

Names

The following table structure is used to store all entity names (and localized versions):

Field	Type	Description
id	INTEGER	Unique id for the name
lang	TEXT	Locality/language of the name, can be NULL
name	TEXT	Name string
type	INTEGER	Entity types this name refers to

The relation between tokens (defined next) and names is stored in nametokens table, which contains (name_id, token_id) pairs.

Tokens

Tokens are sequences of characters used for resolving names. Tokens are normalized (converted to lower case, with diacritical marks dropped) and do not contain any separators or whitespace symbols. The following table structure is used for tokens:

Field	Type	Description
id	INTEGER	Unique id for the token
token	TEXT	Token string value
idf	REAL	Token Inverse Document Frequency
typemask	INTEGER	The bitmask of entity types this token refers to

Here idf field is calculated as log(totalTokenCount / thisTokenCount) where totalTokenCount is the count of all tokens in all names and thisTokenCount counts the number of occurences of this token in all names. Note that the token stored is the normalized token (converted to lowercase and symbols translated according to translation_table stored in metadata).

Feature encoding

Features (geometry with optional metadata, similar to GeoJSON) are encoded as bytestreams with 128-bit varint encoding similar to Google protobuf. Unicode strings are converted to UTF-8 bytestrings and then encoded with explicit UTF-8 length stored as a varint and then followed with the UTF-8 bytes.

Feature collections are encoded as follows:

Field	Type	Description
n	varint	Number of features
features	Feature*n	List of features

Each feature is encoded as follows:

Field	Type	Description
id	varint	Delta-encoded relative to previous id
geometry	Geometry	Geometry of the feature
n	varint	Number of properties
properties	Property*n	List of properties

Geometry is encoded as follows:

Field	Type	Description
type	varint	1=Point, 2=MultiPoint, 3=LineString, 4=MultiLineString, 5=Polygon, 6=MultiPolygon, 7=Collection
coords/rings	...	Encoding depends on type

The encoding of coordinates or rings is 'natural' - for list types, first a number of elements is stored as a varint, following a list of subtypes. All coordinates are stored as integers by first multiplying each component with the value of encoding_precision (which is stored in metadata table) and then delta encoded relative to the previous coordinate.

Properties are encoded as pairs of (name, value), where name is a string and each value are encoded as follows:

Field	Type	Description
type	varint	0=Null, 1=Bool, 2=Int, 3=Float, 4=String
data	...	Encoding depends on type

Booleans are stored as varints containing either 0 or 1, Ints are stored as varints. Floats are stored using big-endian 32-bit IEEE 754 encoding,

Quadindex encoding

Geocoding database stores compact 64-bit spatial index for fast reverse geocoding requests. The space is represented internally as a quadtree up to a fixed level (stored as quadindex_level in metadata). Each quadtree node is encoded/represented as 64-bit integer as follows:

(((y << zoom) | x) << 5) | zoom

Each geometry is assigned to the smallest node that fully covers the geometry. To query nearest geometry given a location point we first need to find the smallest node that contains the location point and then move upwards to the parent node until we reach the root node. At each encountered node we query the database using the quadindex of the node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly