DynamoDB v1 Approach Proposal #246
-
After implementing v0 of the dynamodb data adapter, some quirks of ddb make a simple implementation challenging with hyper. What has likely been the primary challenge with creating a DDB adapter that works well for hyper is the listDocuments specification in the data port. The decisions for how to handle primary key sorting could have impacts on how the rest of the functions are implemented. This is proposal of a hyper to handle ddb keys to get the benefits of simple key sorting and scalability, while ensuring composite key uniqueness, and while using hyper's current port interface without the dev having to dive into ddb internals. Assuming we want the documents to be sortable without using a GSI (index search does not seem to be supported in listDocs anyway), we'll need to find a way to make the primary key sortable, which is challenging with hypers data spec combined with ddb's key internals. With ddb, there are two primary key variants. A single (partition key), or composite (combo partition and sort key). In the composite key, it is the combination that is unique (more on that below)@@@ The partition key, which can be a standalone key, can be the pk, but it is not "sortable" for use in hyper's listDocs spec until there is a sort key. It would be a ddb antipattern to simply make the partition key a single value, ex: "hyper", or a "table" value, ex: "users", "products", because the partition key requires high cardinality to decide which partition to place a record in. If there is a lopsided number of records for a particular partition key, ddb performance will suffer because certain partitions will be used more frequently. This means to make a document "sortable" we'll have to manage some high cardinality value as the partition key, and assign the regular ID as the sort key. This comes with the caveat that none of the other data adapters use this forced composite key method for sorting, so hyper would need to manage the partition key in a way that is consistent, predictable, and ensures high cardinality of partition key, ideally without storing metadata about the partition key elsewhere. A proposed approach is to make the partition key a function of the sort key, if we can ensure the output of the function has a high cardinality based on different inputs. A cryptographic hash algorithm would work well for this purpose. Crypto hash algs by design produce outputs with incredibly high cardinality. We would set the sort key as the data port's ID, and run the ID through a cryptographic hash algorithm, md5, deprecated for cryptographic purposes because it is too fast, but perfect for us, because less computational time is good in this case. @@@composite key uniqueness is ensured for any records generated through hyper because crypto hash algorithms are deterministic, so the sort key ID will always produce the same partition key. Meaning if someone tried to insert a duplicate ID (sort key), it would also produce the same hash (partition key), and fail. This negates the possibility of accidentally creating duplicates of the same partition key. (unless the dev inserts into the table not through hyper) Downside - we abstract away some of the power and functionality of ddb, which is using the partition and sort keys in very particular ways for powerful querying and scalability of a particular data set. This is inevitable to a certain degree anyway. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
The value of hyper is to support multiple adapters not tailor itself to one single service, and there will be no doubt tradeoffs, just like the tradeoffs of using a high-level language vs machine languages in programming we are sacrificing performance for flexibility. In cases where maximum performance is needed and then the need for flexibility must be discarded. But a large number of use cases do not need maximum performance, they need reasonable performance, if we can leverage DDB to provide reasonable performance and meet our need for the flexibility it will be a win for a solid managed multi-tenant system. If we can not, we may need to remove DDB from the data adapter family. |
Beta Was this translation helpful? Give feedback.
-
Thanks for putting this together Casey. Let me try to summarize to make sure I understand:
Did I get that right? If so a couple things: The sorting key in dynamo probably won't be able to be fully leveraged in hyper anyway. Hyper I think your proposal works and can/should be done, but I wouldn't stress about not being able to leverage Dynamo's bells and whistles, to Tom's point above. Dynamo is fast, but most apps just want good enough performance. If Dynamo through hyper can achieve good enough performance, then the consistent api hyper provides and the increase in developer productivity is a great tradeoff in my opinion. |
Beta Was this translation helpful? Give feedback.
-
This is an update to the proposal. There is no benefit to having the sort key for the proposal above. The sort key only sorts by partitions, and since each partition key is unique, the sort key would only sort one item per partition, and the benefits are lost. The updated proposal is to only have a partition key, no sort key, and sort the list documents response on the adapter. The partition key will remain a fast crypto hash of the hyper user's id though because I have seen in multiple places that auto-increment keys are a ddb anti pattern for a partition key. This is because ddb uses an internal hash to decide which partition to place an item in based on its key. This hash is proprietary and subject to change, so it is possible for close key values (in the case of something like auto-incrementing) may be placed into the same partition and cause a hot partition. Ultimately, the our use of a crypto hash alg reduces the possibility of creating hot partitions simply due to a dev implementing a key due that doesn't work well for a partition key. I'll use sha-1 for the crypto hash instead of md5 since it is slightly faster: https://medium.com/@chris_72272/what-is-the-fastest-node-js-hashing-algorithm-c15c1a0e164e |
Beta Was this translation helpful? Give feedback.
-
Slight update to previous answer. The format implemented on the adapter for partition keys is as follows: id##sha1(id) After the ##, the sha1 is in hex format. This makes it so we'll always know that the last ## is the delimiter ## will never appear in the hex char set, and everything before ## is the id, regardless of charset of the id. This slight update retains the uniqueness of the hash, allows the id to be easily split in the adapter by finding the last instance of "##" in the string. Also, if the hyper user wants to look at their ddb on aws, they can see the id they care about at the beginning of the partition key. |
Beta Was this translation helpful? Give feedback.
This is an update to the proposal. There is no benefit to having the sort key for the proposal above. The sort key only sorts by partitions, and since each partition key is unique, the sort key would only sort one item per partition, and the benefits are lost.
The updated proposal is to only have a partition key, no sort key, and sort the list documents response on the adapter.
The partition key will remain a fast crypto hash of the hyper user's id though because I have seen in multiple places that auto-increment keys are a ddb anti pattern for a partition key. This is because ddb uses an internal hash to decide which partition to place an item in based on its key. This hash is proprietar…