-
Notifications
You must be signed in to change notification settings - Fork 289
DynamoDB
The database being used by the Hedy website is Amazon's DynamoDB. DynamoDB is a NoSQL database, which work a bit differently than SQL databases you might be familiar with.
In large installations, NoSQL databases like DynamoDB scale better and are more reliable than SQL databases typically are. In our case, we use it because at small installations it is nearly free 😌, and if you want to store additional fields on a record, you can do so without performing any database maintenance operations.
When you define a table, you don't have to specify what columns exist for a table and what type each column is. A DynamoDB table is schemaless, which means you can store as many or as few fields on every row as you want, and they can have any type, with only one exception.
The exception is the primary key of the table, which consists of 1 or 2 fields in the table. The types of the primary key fields must be predeclared, and values for the key fields must exist in every row (more information on keys, see the section "Table keys" below).
Fields in DynamoDB can have the following types:
string
- number (
int
orfloat
) - binary data (
bytes
in Python) boolean
-
list
(a list of values of any supported type) -
dict
(a mapping of string to any value of supported type) -
set
of all numbers,set
of all strings,set
of all binary values (no duplicates)
Even though non-key fields can be of any type, and you can mix types (one row could have a field public
set to True
, and another row could have the field public
set to 1
, or level
could be the number 1
in one table but the string "1"
in another), it's usually just a good idea to stick to a consistent set of data types.
When you create a table, you declare at least one of your fields to be the partition key, and you can optionally declare one of the fields to be the sort key. The combination of these fields must be unique for every row. The types of the fields that are designated as keys must be one of string
, number
or binary data
.
The roles are as follows:
-
Partition key: every table must have a partition key, and every query to the table must also include a value for the partition key (see the
Query
operation below). Partition keys are typically values like unique identifiers. -
Sort key: a sort key is optional: tables may not have a sort key, and even if they have a sort key, you don't have to provide a sort key when querying the table. When you retrieve multiple elements from a table, they will always be ordered by the sort key, in either ascending or descending order. When you query a table, you can restrict the query by use a condition like
==
(equals) on a sort key, but you can also use a condition like>=
or<
. Sort keys are typically used for data with an ordering, like timestamps.
The combination of partition key and sort key (if you are using both), must be unique. If you are not using a sort key, then the partition key must be unique by itself.
Here are some examples of tables in Hedy:
username (PK) | created | password | ...more data.... | |
---|---|---|---|---|
hedy | [email protected] | 1681239902406 | $2b$09$riZem.ck | ... |
felienne | [email protected] | 1637851898173 | $2b$09$xOqiciz | ... |
student123 | [email protected] | 1700126054884 | $2b$09$9z9d0a0 | ... |
Because this table only has a partition key (username
) and no sort key, every query to this table will retrieve at most 1 row: you can query it by giving a username
, and it will return the data for that username
or tell you there is no such user. The table behaves like a Python dictionary.
Now let's look at a table that also has a sort key.
This table also uses a sort key in addition to a partition key:
id#level (PK) | week (SK) | id | level | finished | scores |
---|---|---|---|---|---|
hedy#1 | 2023-31 | hedy | 1 | 1 | [ 60 ] |
felienne#3 | 2022-08 | felienne | 3 | 0 | |
felienne#3 | 2022-09 | felienne | 3 | 1 | [ 95 ] |
felienne#3 | 2022-10 | felienne | 3 | 1 | [ 100 ] |
student123#9 | 2023-23 | student123 | 9 | 1 | [ 40, 80 ] |
This table has a partition key on the column id#level
and a sort key on the column week
. You can see a number of interesting things on this table:
- The primary key column is a combination of two other fields: the column is called
id#level
, and it consists of the values of the individual columnsid
andlevel
, combined with a separator. We can only have a single partition key: to query on the values of 2 columns at the same time (using what would have been anAND
in an SQL query), we combine the values of those 2 columns into 1 column, so that we can query on that one partition key. There is nothing magical about this column: the Python code that inserts rows into this table builds the value for theid#level
column as it's inserting. In fact, DynamoDB doesn't care whether the actualid
andlevel
fields are even in each row; our code just puts them there because it's convenient to have the original values as well. - Values with the same partition key are sorted by the value of the sort key (
week
). Since the sort key values arestring
s, but represent numbers, we have to take care to pad them with0
on the left (if we don't,2022-9
would incorrectly sort after2022-10
). - Field values don't have to be primitives. The
scores
column contain a set of numbers, representing all the scores that user has scored on all their quizzes during that week.
Because we have a sort key, we now have the option to query this data in multiple ways:
- Given a user, level and week, we can query for the single row that has the user's scores for that combination.
- Given just a user and a level, we can query for all of their test scores on that level over all time: this query will return all rows for a given
id#level
value, in ascending order ofweek
. - We can also do partial queries over the week data of each user. For example, we can add conditions like
starts_with(week, '2022-')
to get all rows in a particular year, orweek >= '2022-40'
to get all quiz statistics since a particular week.
When you are designing a new table, the key schema is the most important decision you have to make. You can't change the keys after you have created the table, so take a moment to think about this.
In order to pick effective keys, it's useful to think about the queries you're going to do against the table:
- What pages are going to retrieve rows from this table?
- What information will those pages have access to in order to query the tables? Think about things like "current username", "current level", "ID argument passed in the URL bar", etc.
- We like our page loads to be fast, but the more data the page needs to read from the database, the slower it will be. If you can keep your expected rows to read under 10 (or at least under a 100), your page will probably load just fine.
- See if you can find a sort key that will make it possible to retrieve multiple rows of useful data at the same time, given a partition key. If you can do this, maybe you don't even need to add an index! (See "Indexes" below).
- Bonus points if the sort key has structure that makes it possible to filter and sort on it usefully as well!
- (Advanced) Strict normalization of data is not necessarily a goal like it is in SQL databases: if you can save effort at read time by doing duplicate writes under slightly different keys at write time, that might be worth the trade-off (but if you do this, make sure that reads are much more common than writes and the risks of potential inconsistency between different copies of the data are not a huge problem.
DynamoDB doesn't work by executing SQL statements. Instead, you can perform a set of network API calls against the database to write or read data. In Hedy, we have an abstraction layer that we program against, which will perform those API calls for you.
Below is a list of the most important API calls, and the equivalents in the Hedy code base:
DynamoDB API | Hedy API | Description |
---|---|---|
PutItem |
TABLE.put(data) / TABLE.create(data)
|
Write a row to the table. Will completely overwrite all fields in the row if a row with the same key already exists in the table. |
UpdateItem | TABLE.update(key, updates) |
Updates some fields of a single row in the table, leaving the rest of the row unaffected if it already exists. Updates can overwrite fields with new values, but can also do in-place updates like increment a field, or add an element to a list or set. Returns the new values. (DynamoDB can also do conditional updates, those are not supported in our abstraction layer just yet). |
DeleteItem | TABLE.delete(key) |
Delete a row from the table. |
GetItem | TABLE.get(key) |
Return a single row given a full primary key (partition key + sort key) |
BatchGetItem | TABLE.batch_get(keys) |
Retrieve multiple rows in parallel. keys must be a list of key dicts, or a dictionary mapping a string identifier to a key dict; the return value will be in the same format. |
Query | TABLE.get_many(key) |
key must contain the partition key, and may contain an equality or comparison condition on the sort key. Returns one or more rows, depending on the query and the table schema. |
Scan | TABLE.scan() |
Return all items from the table. There is no filtering. |
TABLE.del_many(key) |
First query all items for a given key, then delete them all one by one. |
All DynamoDB operations are paginated: you don't necessarily get all records in one call: every call will return up to 1 MB of data. Every call that can return more than one row takes an optional pagination_token
parameter, and every return object will have a next_page
field if there are more pages to be retrieved. You can also use the ScanIterator
or QueryIterator
objects to have the pagination automatically performed for you.
(TODO)
(TODO)
(TODO)