DynamoDB v1 Approach Proposal #246

cawilson1 · 2021-06-25T12:31:06Z

cawilson1
Jun 25, 2021

After implementing v0 of the dynamodb data adapter, some quirks of ddb make a simple implementation challenging with hyper. What has likely been the primary challenge with creating a DDB adapter that works well for hyper is the listDocuments specification in the data port. The decisions for how to handle primary key sorting could have impacts on how the rest of the functions are implemented.
This is going off the assumption that hyper likes sortable ids:
#244

This is proposal of a hyper to handle ddb keys to get the benefits of simple key sorting and scalability, while ensuring composite key uniqueness, and while using hyper's current port interface without the dev having to dive into ddb internals.

Assuming we want the documents to be sortable without using a GSI (index search does not seem to be supported in listDocs anyway), we'll need to find a way to make the primary key sortable, which is challenging with hypers data spec combined with ddb's key internals.

With ddb, there are two primary key variants. A single (partition key), or composite (combo partition and sort key). In the composite key, it is the combination that is unique (more on that below)@@@

The partition key, which can be a standalone key, can be the pk, but it is not "sortable" for use in hyper's listDocs spec until there is a sort key.

It would be a ddb antipattern to simply make the partition key a single value, ex: "hyper", or a "table" value, ex: "users", "products", because the partition key requires high cardinality to decide which partition to place a record in. If there is a lopsided number of records for a particular partition key, ddb performance will suffer because certain partitions will be used more frequently.

This means to make a document "sortable" we'll have to manage some high cardinality value as the partition key, and assign the regular ID as the sort key. This comes with the caveat that none of the other data adapters use this forced composite key method for sorting, so hyper would need to manage the partition key in a way that is consistent, predictable, and ensures high cardinality of partition key, ideally without storing metadata about the partition key elsewhere.

A proposed approach is to make the partition key a function of the sort key, if we can ensure the output of the function has a high cardinality based on different inputs. A cryptographic hash algorithm would work well for this purpose. Crypto hash algs by design produce outputs with incredibly high cardinality.

We would set the sort key as the data port's ID, and run the ID through a cryptographic hash algorithm, md5, deprecated for cryptographic purposes because it is too fast, but perfect for us, because less computational time is good in this case.

@@@composite key uniqueness is ensured for any records generated through hyper because crypto hash algorithms are deterministic, so the sort key ID will always produce the same partition key. Meaning if someone tried to insert a duplicate ID (sort key), it would also produce the same hash (partition key), and fail. This negates the possibility of accidentally creating duplicates of the same partition key. (unless the dev inserts into the table not through hyper)

Downside - we abstract away some of the power and functionality of ddb, which is using the partition and sort keys in very particular ways for powerful querying and scalability of a particular data set. This is inevitable to a certain degree anyway.

Answered by cawilson1

Jul 6, 2021

This is an update to the proposal. There is no benefit to having the sort key for the proposal above. The sort key only sorts by partitions, and since each partition key is unique, the sort key would only sort one item per partition, and the benefits are lost.

The updated proposal is to only have a partition key, no sort key, and sort the list documents response on the adapter.

The partition key will remain a fast crypto hash of the hyper user's id though because I have seen in multiple places that auto-increment keys are a ddb anti pattern for a partition key. This is because ddb uses an internal hash to decide which partition to place an item in based on its key. This hash is proprietar…

View full answer

twilson63 · 2021-06-25T13:16:54Z

twilson63
Jun 25, 2021
Maintainer

The value of hyper is to support multiple adapters not tailor itself to one single service, and there will be no doubt tradeoffs, just like the tradeoffs of using a high-level language vs machine languages in programming we are sacrificing performance for flexibility. In cases where maximum performance is needed and then the need for flexibility must be discarded. But a large number of use cases do not need maximum performance, they need reasonable performance, if we can leverage DDB to provide reasonable performance and meet our need for the flexibility it will be a win for a solid managed multi-tenant system. If we can not, we may need to remove DDB from the data adapter family.

0 replies

TillaTheHun0 · 2021-06-25T16:09:05Z

TillaTheHun0
Jun 25, 2021
Maintainer

Thanks for putting this together Casey. Let me try to summarize to make sure I understand:

The hyper Data port supports sorting.
DDB also supports sorting, but via use of the sort key, which if defined, must be unique, and alongside a partition key.
Since hyper Data port only requires a single key, and it must support sort, hyper will need to procure one of the keys in a deterministic way
hyper listDocuments api allows sorting by key, which means we must use the user provided key as part of the sort key, ergo the partition key will need to procured underneath the hood
Ergo, the proposal is to generate partition key using a hash algorithm like md5 and use hyper id as sort key

Did I get that right? If so a couple things:

The sorting key in dynamo probably won't be able to be fully leveraged in hyper anyway. Hyper queryDocuments api supports sorting by many attributes in the form of "sort": [{"year": "DESC"}, {"title": "ASC"}]. For this adapter, that sort of sorting will probably need to be done programmatically in code because Dynamo doesn't support it. Sorting by key will be fast with your solution, but any other sorting will need to be done outside of Dynamo, which I think is fine given that hyper api doesn't map onto Dynamo as neatly as other persistence solutions ie. Mongo or Couch.

I think your proposal works and can/should be done, but I wouldn't stress about not being able to leverage Dynamo's bells and whistles, to Tom's point above. Dynamo is fast, but most apps just want good enough performance. If Dynamo through hyper can achieve good enough performance, then the consistent api hyper provides and the increase in developer productivity is a great tradeoff in my opinion.

1 reply

cawilson1 Jul 6, 2021
Author

Yes, you interpreted the original proposal correctly. Update to proposal below.

cawilson1 · 2021-07-06T21:47:30Z

cawilson1
Jul 6, 2021
Author

This is an update to the proposal. There is no benefit to having the sort key for the proposal above. The sort key only sorts by partitions, and since each partition key is unique, the sort key would only sort one item per partition, and the benefits are lost.

The updated proposal is to only have a partition key, no sort key, and sort the list documents response on the adapter.

The partition key will remain a fast crypto hash of the hyper user's id though because I have seen in multiple places that auto-increment keys are a ddb anti pattern for a partition key. This is because ddb uses an internal hash to decide which partition to place an item in based on its key. This hash is proprietary and subject to change, so it is possible for close key values (in the case of something like auto-incrementing) may be placed into the same partition and cause a hot partition. Ultimately, the our use of a crypto hash alg reduces the possibility of creating hot partitions simply due to a dev implementing a key due that doesn't work well for a partition key.

I'll use sha-1 for the crypto hash instead of md5 since it is slightly faster: https://medium.com/@chris_72272/what-is-the-fastest-node-js-hashing-algorithm-c15c1a0e164e

0 replies

cawilson1 · 2021-07-07T12:58:34Z

cawilson1
Jul 7, 2021
Author

Slight update to previous answer. The format implemented on the adapter for partition keys is as follows:

id##sha1(id)
for ex, an id of "1" would generate a partition key of: 1##356a192b7913b04c54574d18c28d46e6395428ab

After the ##, the sha1 is in hex format. This makes it so we'll always know that the last ## is the delimiter ## will never appear in the hex char set, and everything before ## is the id, regardless of charset of the id.

This slight update retains the uniqueness of the hash, allows the id to be easily split in the adapter by finding the last instance of "##" in the string. Also, if the hyper user wants to look at their ddb on aws, they can see the id they care about at the beginning of the partition key.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DynamoDB v1 Approach Proposal #246

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DynamoDB v1 Approach Proposal #246

cawilson1 Jun 25, 2021

Replies: 4 comments · 1 reply

twilson63 Jun 25, 2021 Maintainer

TillaTheHun0 Jun 25, 2021 Maintainer

cawilson1 Jul 6, 2021 Author

cawilson1 Jul 6, 2021 Author

cawilson1 Jul 7, 2021 Author

cawilson1
Jun 25, 2021

Replies: 4 comments 1 reply

twilson63
Jun 25, 2021
Maintainer

TillaTheHun0
Jun 25, 2021
Maintainer

cawilson1 Jul 6, 2021
Author

cawilson1
Jul 6, 2021
Author

cawilson1
Jul 7, 2021
Author