Author: Marko Sluga
Ask any database engineer if they would like to have an infinite scale, millisecond performance ACID-compliant datastore that’s fully managed and serverless to boot. They’ll tell you that’s just a pipe dream. You tell them it’s DynamoDB. Yes, ACID (Atomic, Consistent, Isolated, Durable) transactions have been fully supported on DynamoDB for a while, which means we now have a database that can fully support even the most demanding OLTP needs.
That sounds amazing! Let me get my SQL workbench ready.
Whoa there! SQL workbench? No, no – DynamoDB is a NoSQL database.
NoSQL you say? What does the really mean?
Well, NoSQL includes databases that support more than just SQL (NoSQL stands for Not only SQL) but that doesn’t mean SQL is natively supported on all of them. DynamoDB namely supports simple HTTP methods like PUT, GET, UPDATE and DELETE which are extended at the API level with the ability to address the service with more database-like operations like scan, query, batch reads and writes, etc. This means the full support of SQL operations you might find in a traditional RDBMS is not going to be available in DynamoDB. Well, at least not on its own, but we’ll discuss that at a future date.
Ok, then. Since it is addressable via the API, we can just build our requests via the CLI or SDK and start pumping data in and reading data out. No considerations, right?
Whoa again! No considerations? What about PIE?
Yeah, we can have some pie later, but now I need to finish this query so I can load this 100 million row table and click…
Stop. PIE not pie – PIE as in the PIE theorem – Pattern flexibility, Infinite scale and Efficiency. You can’t have all three at the time. If you want pattern flexibility and want to just push data in and out with no considerations than you would need to sacrifice infinite scale or efficiency. DynamoDB is all about efficiency and infinite scale. So, when we design our tables, we need to have the query in mind as it is not going to be equally efficient across all attributes in a table. Let me explain:
First of all, the data distribution is crucial for performance. The data in DynamoDB is partitioned by calculating the hash of the partition key (that’s why they used to call it the hash key) and then stored on a back-end that is responsible for the hashes. We need to ensure the primary key is selected in a way that will maximize the reads and writes across the partitions, so we don’t have a problem with hotkeys and objects. A hot object would be data where the primary key is very popular and thus is consuming lots of reads and writes. We need to make sure we keep a good eye on the application operations against DynamoDB over time. After all, understanding the hot object requests might also require keeping older statistics, like utilization graphs and request logs captured from our application or from the DynamoDB streaming back-end.
All right then I’ll set the primary key to a random 16digit UUID. That will distribute performance so randomly that…
Yeah, it’s not that simple. What you want to select is a key that has good distribution but is also known to the application. The reason is that you will query the database and the primary key is fully indexed. Meaning you want to use something like a username, customer name, product name where there are many users, many customers, or many products that will actually be used in queries. Think of an ecommerce site – when you shop for toilet paper online, as we all do now that it’s just not available in stores, do you search for “BN3sdk8vasf009” or do you search for “toilet paper roll”?
Toilet paper roll.
Then use that as your primary key.
But what about the color? I only buy pink toilet paper.
You can add a second attribute to the index via a sort key. You could use color as the sort key so your queries for the searching are much faster as you can search for a subset under many of the same database records. You can accelerate the queries further by adding local and global secondary indexes. A local secondary index is an alternative sort key whereas a global one is an alternative primary key. But be careful, adding secondary indexes will consume additional resources as well.
This sounds super complicated.
Well, really once you take into considerations that the table supports thousands, hundreds of thousands, millions, or even tens of millions of operations per second at millisecond latency, then it kind of gets worth figuring out the indexes vs queries. Because at the end of the day what are the most important factors for any application? Cost and performance. If it can do the same job faster and at a lower cost it can make a lot of sense.
And did I mention the free tier? 25 GB of Storage. 25 provisioned Write Capacity Units (WCUs). 25 provisioned Read Capacity Units (RCUs). Enough to handle up to 200M requests per month. For free. Forever.
Now that’s my kind of database!