Skip to content

DynamoDB

Rico Hermans edited this page May 16, 2024 · 21 revisions

The database being used by the Hedy website is Amazon's DynamoDB. DynamoDB is a NoSQL database, which work a bit differently than SQL databases you might be familiar with.

In large installations, NoSQL databases like DynamoDB scale better and are more reliable than SQL databases typically are. In our case, we use it because at small installations it is nearly free 😌, and if you want to store additional fields on a record, you can do so without performing any database maintenance operations.

Modeling data

When you define a table, you don't have to specify what columns exist for a table and what type each column is. A DynamoDB table is schemaless, which means you can store as many or as few fields on every row as you want, and they can have any type, with only one exception.

The exception is the primary key of the table, which consists of 1 or 2 fields in the table. The types of the primary key fields must be predeclared, and values for the key fields must exist in every row (more information on keys, see the section "Table keys" below).

Fields in DynamoDB can have the following types:

  • string
  • number (int or float)
  • binary data (bytes in Python)
  • boolean
  • list (a list of values of any supported type)
  • dict (a mapping of string to any value of supported type)
  • set of all numbers, set of all strings, set of all binary values (no duplicates)

Even though non-key fields can be of any type, and you can mix types (one row could have a field public set to True, and another row could have the field public set to 1, or level could be the number 1 in one table but the string "1" in another), it's usually just a good idea to stick to a consistent set of data types. If you want to change the type of a column, you will have to update your code to write the new type and support reading the old and new type both until all old data has been rewritten to the new type.

Table keys

When you create a table, you declare at least one of your fields to be the partition key, and you can optionally declare one of the fields to be the sort key. The combination of these fields must be unique for every row. The types of the fields that are designated as keys must be one of string, number or binary data.

The roles are as follows:

  • Partition key: every table must have a partition key, and every query to the table must also include a value for the partition key (see the Query operation below). Partition keys are typically values like unique identifiers.
  • Sort key: a sort key is optional: tables may not have a sort key, and even if they have a sort key, you don't have to provide a sort key when querying the table. When you retrieve multiple elements from a table, they will always be ordered by the sort key, in either ascending or descending order. When you query a table, you can restrict the query by use a condition like == (equals) on a sort key, but you can also use a condition like >= or <. Sort keys are typically used for data with an inherent sort order, like timestamps or sequential numbers.

The combination of partition key and sort key (if you are using both), must be unique. If you are not using a sort key, then the partition key must be unique by itself.

Here are some examples of tables in Hedy:

The users table

username (PK) email created password ...more data....
hedy [email protected] 1681239902406 $2b$09$riZem.ck ...
felienne [email protected] 1637851898173 $2b$09$xOqiciz ...
student123 [email protected] 1700126054884 $2b$09$9z9d0a0 ...

Because this table only has a partition key (username) and no sort key, every query to this table will retrieve at most 1 row: you can query it by giving a username, and it will return the data for that username or tell you there is no such user. The table behaves like a Python dictionary.

Now let's look at a table that also has a sort key.

The quiz-stats table

This table also uses a sort key in addition to a partition key:

id#level (PK) week (SK) id level finished scores
hedy#1 2023-31 hedy 1 1 [ 60 ]
felienne#3 2022-08 felienne 3 0
felienne#3 2022-09 felienne 3 1 [ 95 ]
felienne#3 2022-10 felienne 3 1 [ 100 ]
student123#9 2023-23 student123 9 1 [ 40, 80 ]

This table has a partition key on the column id#level and a sort key on the column week. You can see a number of interesting things on this table:

  • The primary key column is a combination of two other fields: the column is called id#level, and it consists of the values of the individual columns id and level, combined with a separator. We can only have a single partition key: to query on the values of 2 columns at the same time (using what would have been an AND in an SQL query), we combine the values of those 2 columns into 1 column, so that we can query on that one partition key. There is nothing magical about this column: the Python code that inserts rows into this table builds the value for the id#level column as it's inserting. In fact, DynamoDB doesn't care whether the actual id and level fields are even in each row; our code just puts them there because it's convenient to have the original values as well.
  • Values with the same partition key are sorted by the value of the sort key (week). Since the sort key values are strings, but represent numbers, we have to take care to:
    • Put the most significant component of the number first (if we use <week>-<year>, 23-2000 will incorrectly sort after 22-2023).
    • Pad the numbers with 0 on the left (if we don't pad, 2022-9 will incorrectly sort after 2022-10).
  • Field values don't have to be primitives. The scores column contain a set of numbers, representing all the scores that user has scored on all their quizzes during that week.

Because we have a sort key, we now have the option to query this data in multiple ways:

  • Given a user, level and week, we can query for the single row that has the user's scores for that combination.
  • Given just a user and a level, we can query for all of their test scores on that level over all time: this query will return all rows for a given id#level value, in ascending or descending order of week.
  • We can also do partial queries over the week data of each user. For example, we can add conditions like starts_with(week, '2022-') to get all rows in a particular year, or week <= '2022-40', reverse=True to get all quiz statistics just before a particular week, most recent ones first.

Designing tables

When you are designing a new table, the key schema is the most important decision you have to make. You can't change the keys after you have created the table, so take a moment to think about this.

In order to pick effective keys, it's useful to think about the queries you're going to do against the table:

  • What pages are going to retrieve rows from this table?
  • What information will those pages have access to in order to query the tables? Think about things like "current username", "current level", "ID argument passed in the URL bar", etc.
  • We like our page loads to be fast, but the more data the page needs to read from the database, the slower it will be. If you can keep your expected rows to read under 10 (or at least under a 100), your page will probably load just fine.
  • See if you can find a sort key that will make it possible to retrieve multiple rows of useful data at the same time, given a partition key. If you can do this, maybe you don't even need to add an index! (See "Indexes" below).
  • Bonus points if the sort key has structure that makes it possible to filter and sort on it usefully as well!
  • (Advanced) Strict normalization of data is not necessarily a goal like it is in SQL databases: if you can save effort at read time by doing duplicate writes under slightly different keys at write time, that might be worth the trade-off (but if you do this, make sure that reads are much more common than writes and the risks of potential inconsistency between different copies of the data are not a huge problem.

Operations

DynamoDB doesn't work by executing SQL statements. Instead, you can perform a set of network API calls against the database to write or read data. In Hedy, we have an abstraction layer that we program against, which will perform those API calls for you.

Below is a list of the most important API calls, and the equivalents in the Hedy code base:

DynamoDB API Hedy API Description
PutItem TABLE.put(data) / TABLE.create(data) Write a row to the table. Will completely overwrite all fields in the row if a row with the same key already exists in the table.
UpdateItem TABLE.update(key, updates) Updates some fields of a single row in the table, leaving the rest of the row unaffected if it already exists. Updates can overwrite fields with new values, but can also do in-place updates like increment a field, or add an element to a list or set. Returns the new values. (DynamoDB can also do conditional updates, those are not supported in our abstraction layer just yet).
DeleteItem TABLE.delete(key) Delete a row from the table.
GetItem TABLE.get(key) Return a single row given a full primary key (partition key + sort key)
BatchGetItem TABLE.batch_get(keys) Retrieve multiple rows in parallel. keys must be a list of key dicts, or a dictionary mapping a string identifier to a key dict; the return value will be in the same format.
Query TABLE.get_many(key) key must contain the partition key, and may contain an equality or comparison condition on the sort key. Returns one or more rows, depending on the query and the table schema.
TABLE.get_page(key, page_size) Performs a get_many() call aware of both server-side and client-side filtering, and repeats it as many times as is necessary to come up with exactly page_size items. Responses will have a "next page" token and a "previous page" token for browsing.
Scan TABLE.scan() Return all items from the table. There is no filtering.
TABLE.del_many(key) First query all items for a given key, then delete them all one by one.

All DynamoDB operations are paginated: you don't necessarily get all records in one call: every call will return up to 1 MB of data. Every call that can return more than one row takes an optional pagination_token parameter, and every return object will have a next_page field if there are more pages to be retrieved. You can also use the ScanIterator or QueryIterator objects to have the pagination automatically performed for you.

The DynamoDB abstraction layer we use in Hedy has local emulation of the database. When Hedy is running on your own computer, the data is stored and read from dev_database.json. When Hedy is deployed to Heroku, it will run against real DynamoDB instances. You can work locally, but when you make changes to the database definition in code you have to make sure the real alpha and production databases have been updated before you deploy.

Indexes

You can only query a table by searching with a literal partition key value, and optionally a condition on the sort key. But what if you want to search using a non-key value?

The answer is that you can add an index on other fields. You can think of indexes as an additional copy of the table's data, except using a different partition key and sort key. Whenever you add a row to the base table, the new row is automatically copied to the index table as well.

This analogy is very literal: every index stores an additional copy of the table data. With one index, the table data takes up 2x the disk space, with 2 indexes it takes up 3x the disk space, etc. If the table can grow every large and you want to avoid storing all table data multiple times, you can do one of the following things:

  • Store only the base table indexes. That way, you can perform a search in the index, and then use the keys you found to lookup the full row in the base table.
  • Store only select fields from the base table that you know you will need on the page where you query the index. When you create the index, you specify what fields to copy over.

When you create an index, you should also add the information for the index to the database.py file in Hedy. The get_many() operation will automatically query from the index if you pass a key that is declared as indexed. For example:

USERS.get_many({ "username": "hedy" })       # Will query the base table
USERS.get_many({ "email": "[email protected]" }) # Will automatically use the email-index 

There is one difference between indexes and the base table: even if the index only has a partition key and no sort key, get_many() can still return more than one item. It will return all items that have that value in the record. The records will be returned in no particular order; if you want to guarantee an order, add a sort key.

Just like regular table keys, the fields in an index must be of type string, number or binary data. In particular: boolean is not a valid index key type. If you have a boolean field in your data that you would like to add an index on (for example, public), store 0/1 instead of False/True.

Filtering data with DynamoDB

Data from DynamoDB gets to the user via the following pipeline:

image

Note the 1MB limit on disk reading: if there is more data to read than this, you will get a pagination token that you can use in a new query to retrieve the next 1MB of data, etc. The reason for this 1MB limit is at the read stage, instead of the download stage, is because DynamoDB operations are designed to complete quickly, within a predictable time: if the server could have to read arbitrarily much data from disk before filling up a hypothetical 1MB download limit, there is no way the finish time would be predictable.

To efficiently filter through a large table you need indexes, so that you can efficiently find the interesting records in the table. You can fill the index in two ways:

  • All columns: the entire index will be a complete copy of all the data in the base table, just with different partition and range keys. We pay for an additional copy of storing all data. Use this if the primary table isn't too large, and you know you can find interesting records with one indexed query.
  • Keys only: the index will contain the partition and range keys from the primary table. You would use the index to find the keys of the interesting records, and then use batch_get to retrieve all records from the primary table in parallel. Use this if the primary table is so large, or you want so many indexes, that storing additional copies of all data is expensive. Also use this if you want to use the data from two different indexes, before fetching the final list of records. Downloading only the key data means there is less data downloaded unnecessarily.

After reading from disk, a Filter can be applied. This will remove rows from the result set that don't have certain fields set; filter fields do not have to be part of the partition or sort key, they can be any fields. We have already paid the time and money for them to be read from disk, but we can save the time and many for them to also be sent across the internet. Filtering only ever shrinks the result set, so we can retrieve less than 1MB of data if we use filtering. The use of the Limit parameter also applies to the disk read, so when we use filtering we can also ultimately receive less than the requested Limit set of rows from the database. This is something you have to keep in mind if you want to show a page of, say, exactly 50 rows to the user: you can send a query with limit=50 to the database, but combined with server-side filtering and potentially client-side filtering, you may end up with less than 50 rows, and you will need to query again.

Differences with SQL databases

Some things that you can do with SQL databases, are not possible with DynamoDB. The most important ones are:

  • Change the primary key on a table: in an SQL database, you can change the primary key fields, as long as the new field is still unique. In DynamoDB, you have to create a new table with the new key fields, and copy all data from the old table to the new table.
  • Update multiple records in one API call: DynamoDB only allows updating one record per API call. If you want to do a mass update across a number of records, you will have to write a for loop.
  • Joins: you can only query one table at a time. If you want to combine data from multiple tables, you have to manage the individual queries yourself and combine the data in Python.
  • Using accidentally unindexed operations: DynamoDB operations have limits on them, which means you may see it return incomplete or missing data if you didn't prepare for this. When filtering through a large data set, you should use indexes to efficiently find the data the server needs to return. You should also do on an SQL server, but if you don't the SQL server will happily do a table scan to get you the data you asked for anyway. When you do the same in DynamoDB on the other hand, you will see extremely slow response times and/or incomplete data. This is Dynamo's way of telling you you should add an index ☺️.
  • Combining arbitrary filters: an SQL server will let you AND and OR arbitrary expressions together, without giving too much thought to how those queries are executed and whether they are efficient. DynamoDB will not let you do that: you either need to accept that you are not efficient, and effectively downloading a large part of the table to filter it away in Python code, or make sure you have the right indexes set up beforehand.
  • You cannot easily add a multi-column index after the fact: the only way to index multiple columns at a time, is to add a column with the data of both columns in it, for example id#level. If you discover that you need to do this after having created the table already, you will need to do what is called a "backfill", write a script to go over all rows in the table and set the correct value for this column.

Worth mentioning:

  • Transactions: DynamoDB does support a limited form of transactions, but this is not currently implemented in our abstraction layer.

How do I do a data migration?

When you need to make backwards incompatible changes to a table (such as changing the key schema, or changing the type of a column), you have to create a new table and need to copy the data over from the old table.

There are 4 ways you can approach this:

  1. Create the new table, deploy the code that uses the new table. Don't copy the data; you lose whatever's currently in the table.
  2. Create the new table, deploy the code that uses the new table. As quick as you can, run a script that copies data from the old to the new table; there will be a period of time during which users may see a weird state of the website in which their data is gone.
  3. Create the new table, take down the website while you run the migration script, then deploy the new code that uses the new table, and bring the website back up; users will not see a weird state, they just won't see the website at all.
  4. Do a multi-step deployment in which the old and new table are in use all the time; in this approach, there is no weird state, no downtime, and no time pressure to do any type of migration scripting quickly, but it takes more time to complete the process overall.

A safe multi-step deployment that includes data migration looks like this:

  1. Create the new table
  2. Update the code to read from both the new and old table, but write only to the new table and delete rows from the old table
  3. Deploy this code
  4. Run the migration script that copies or moves data from the old to the new table
  5. Update the code to remove all references to the old table
  6. Deploy this code
  7. Delete the old table from the database

At every point of this process, the code and database are in a consistent state with each other, and you can proceed through these steps at any desired pace.

(This process of safe multi-step deployments is not specific to DynamoDB: it applies any time you need to make a change to an external system and a change to your code, and they need to be deployed in lock-step. The same thing would happen if you were going to change the type of a column in an SQL database, for example.)

Writing a migration script

  • It's easiest if a data migration script is idempotent. That means we can run it as often as we want and it won't change the outcome. For every record, the migration script should either do migration steps that are necessary, or nothing. It should not depend on the migration not having been carried out yet, for example. If you need bookkeeping to do this successfully, add some new columns to record if records have already been migrated or not.
  • Reads are a lot cheaper and faster than writes. A data migration script should try to only perform writes that are necessary (for example, rewriting every record with data that's already in there is both expensive, slow, and unnecessary).

Here is an example migration script we used recently. This migrates data in-place, so it doesn't copy to a new table. Use it as inspiration:

# Goal of this migration script: turn the 'public' field of adventures from False/True into 0/1, so we can index on it
from website.dynamo import ScanIterator
from website.database import ADVENTURES

def main():
    scanned = 0
    migrated = 0

    # Scan through the entire table
    for record in ScanIterator(ADVENTURES):
        scanned += 1

        # Only migrate records that actually have 'public' set. If it's not set, it can't be the wrong type.
        pub = record.get('public', None)

        # Need to use `is` to literally compare to booleans (`==` doesn't work because `0 == False` and `1 == True`)
        if pub is True or pub is False:
            new_value = 1 if pub else 0
            ADVENTURES.update({"id": record["id"]}, {"public": new_value})

            # Keep a counter for information
            migrated += 1

    print(f'Migrated {migrated} out of {scanned} records')

if __name__ == "__main__":
    main()