Data structures

Directory and file structure

TiedotDatabase         # A database "TiedotDatabase"
├── CollectionA        # A collection called "CollectionA"
│   ├── Book!Author!Name   # An index on path "Book" -> "Author" -> "Name"
│   │   ├── 0                  # Index data partition 0
│   │   └── 1                  # Index data partition 1
│   ├── dat_0              # Document data partition 0
│   ├── dat_1              # Document data partition 1
│   ├── id_0               # Document ID lookup table for partition 0
│   └── id_1               # Document ID lookup table for partition 1
├── CollectionB        # Another collection called "CollectionB"
│   ├── Day!Temperature!High
│   │   ├── 0
│   │   └── 1
│   ├── dat_0
│   ├── dat_1
│   ├── id_0
│   └── id_1
└── number_of_partitions

Data file structure

Collection data file contains document data. Every document has a binary header and UTF-8 text content. The file has an initial size (32MB) and will grow beyond the initial size (by 32MB incrementally) to fit more documents.

New documents are inserted to end-of-data position, and they are left with room for future updates and size growth. Every document is assigned to a randomly generated, practically unique document ID, which also decides into which partition the document goes.

Updating document usually happens in-place, however if there is not pre-allocated enough room for the updated version, the document has to be deleted and re-inserted; document ID remains the same.

Deleted documents are marked as deleted, the wasted space is recovered in the next scrub operation.

Document format on disk

Type	Size (bytes)	Description
Byte (8 bit signed integer)	1	Validity	0 - deleted, 1 - valid
Signed 64-bit integer	10	Allocated room	How much room is left for the document
Char Array	Size of document content	Document content	Encoded in UTF-8
Char Array	Allocated room - size of document	Padding (UTF-8 spaces)	Room for future updates, for the document to grow its size

Index hash table file structure

Hash table file contains binary content; it implements a static hash table made of hash buckets and integer entries.

Every bucket has a fixed number of entries. When a bucket becomes full, a new bucket is chained to it in order to store more entries. Every entry has an integer key and value.

An entry key may have multiple values assigned to it, however the combination of entry key and value must be unique across the entire hash table.

Bucket format on disk

Type	Size (bytes)	Description
Signed 64-bit integer	10	Next chained bucket number	When a bucket is the last in its chain, this number is 0.
Bucket Entry	21 * number of entries per bucket	Bucket entries	See "Bucket entry format"

Bucket entry format

Type	Size (bytes)	Description
Byte (signed 8-bit integer)	1	Validity	0 - deleted, 1 - valid
Signed 64-bit integer	10	Key	Entry key
Signed 64-bit integer	10	Value	Entry value

Tiedot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly