๐Ÿ’พ DynamoDB Essentials: Everything You Need to Know


Hi Reader ๐Ÿ‘‹๐Ÿฝ

This newsletter is all about DynamoDB - one of the most famous AWS flagship services around.

๐Ÿ’ก A guessing question to get you started ๐Ÿค”โ€‹
What was the peak of requests per second for DynamoDB at the Amazon Prime Days 2022? Scroll down to find the answer. You'll be amazed.

โ€‹

Let's quickly go over the topics we'll cover in this issue.

  • Introduction
  • The differences between SQL & NoSQL
  • Key Concepts & Data Types
  • Primary Key Structure
  • Scans vs. Queries
  • Secondary Indexes
  • Write Operations
  • Capacity Modes
  • Global Tables
  • Backups
  • Streams

That's a lot but it's worth to go deep on this great service that's used by a lot of companies.

Let's go! ๐Ÿš€

Introduction

DynamoDB is a fully-managed NoSQL database that is is able to handle any scale. Additionally, it offers great features to integrate natively with other services. As itโ€™s not your common NoSQL storage but comes with a long list of unique points, itโ€™s important to understand its internals beforehand.

SQL vs NoSQL

DynamoDB is a NoSQL database, which means it is not explicitly enforcing a schema. You can add or remove fields with every write operation.

This does not mean that there is no schema to follow. It's just not enforced on the database level itself. Your application still implicitly expects that your data is structure in a certain way.

Changing a complex implicit NoSQL schema can be even more difficult than migrating a schema in SQL.

Key Concepts & Data Types

DynamoDBโ€™s internals is built around the following major concepts: tables, items, attributes, and types.

  • Table: a table is a collection of items and DynamoDB can store many of them. It's similar to tables in SQL.
  • Item: an item is a row in the table. It's a document in JSON format that can have several attributes.
  • Attribute: a field within an item. Comparable to a simple key in a JSON object. It's described as a type.
  • Type: DynamoDB allows for different types, including string, number, binary, boolean, null, list, map, and set.

With the types, DynamoDB enforces "it's own JSON format". This means each attribute is wrapped into it's type identifier.

On application level, you don't need to work with this nested (rather complex) fields as there are packages for all the popular languages that automatically map from JSON to DynamoDB JSON and visa versa.

One last important fact: each document in DynamoDB can be up to 400 kB in size, which is a lot and won't be reached in most applications. But as you're often aiming for a single table design (one document that contains all relevant information, instead of creating multiple tables that depend on each other), it's possible to exceed this.

Primary Key Structure

A primary key identifies an item uniquely. In DynamoDB, this key can be either a simple one or a composite.

  • Simple Primary Key: one attribute defines your primary key. This attribute is necessary for all queries on your table. In this case, the primary key is also your partition (or hash) key. This means, the key decides which internal partition will actually store your item.
  • Composite Primary Key: a composite key is built of two parts - a partition and sort key. Together they have to be unique, but the partition or sort key can have multiple items with the same value.

Composite keys extend the options to query for items, which we'll have a look in the next paragraph.

Scans vs. Queries

There are two ways to retrieve items: scans and queries.

  • A scan does not have any requirement. You can create a filter for any given attribute on your table (with the known comparators like equals, less than, greater than, ...). The problem: a scan will simply iterate over all your items and will only return on a match. As you pay for each iterated item and not only the retrieved item, this is expensive and slow.
  • A query always requires the partition key. If you're using a composite key, you can additionally provide a comparator for the range key. Queries are fast and cheap, as they won't iterate the table and you'll be only billed for the returned size of data.

Scans are always the last resort and should be avoided at all cost.

๐Ÿ’ก Looking at our previous paragraph about primary keys: if your partition key is not well-distributed across partitions (e.g. a single partition will receive a huge percentage of your items), this can lead to hot partitions.

Hot partitions will have a negative impact on the general performance, as those partitions will receive more read and/or write operations (later more on those in the capacity paragraph) as other partitions.

Why is this a problem? Because your read & write capacity units are distributed across all partitions. This means one hot partition can lead to throttles way before you reach your overall capacity.

Secondary Indexes

We've seen that you always require the partition key to query for items. This requires a very well-planned schema where you know all your query capabilities beforehand.

Often, this is difficult or simply not possible as requirements can change.

But you're not out of options and you don't have to fall back on scans, as DynamoDB also offers secondary indexes. With them, your query capabilities can be extended.

There are two different types of secondary indexes (SI):

  • Global (GSI) - create another primary key that's independent of the original one. You can also query on this new index. It can be created at any time, but items will be kept in it's own partition space. This means, DynamoDB will internally replicate your whole table to offer this feature.
  • Local (LSI) - can only be created during table generation. Your data will reside in the same partition space, as you have to re-use your partition key and you can only define another range key.

๐Ÿ’ก A Small Dive Into How Partitions Work In DynamoDB: a table is divided into multiple partitions, and each partition is stored on a different server. When an item is added to the table, it is assigned to a partition based on the partition key value. All items with the same partition key value are stored in the same partition and are therefore stored on the same server. This allows DynamoDB to distribute the data across multiple servers, which helps to scale the table as the size of the data grows. If youโ€™re interested in more detail about partitions check out this amazing article by Alex Debrie.

Write Operations

If you want to insert data into DynamoDB you pass Expression Names & Expression Values. This is also rather unintuitive in the first place, but you'll get used to it. It's also possible to just use another abstraction layer like DynamoDBMapper (which is available for the famous languages) that will make this easier with typed classes for your database items.

But let's have a look at how normal query would look with names and values in the CLI:

The expression attribute (which just act as variables) names start with a # and the expression attribute values with :.

  • We first apply the key definition: a field that's specified by #pk with a value that specified by :pk.
  • Then we define that we want to update a field which name is specified by #quantity and its value by :quantity.
  • Our primary key's name (#pk) is mapped to orderId and our target field's name is mapped to quantity.

Then we map the values in the same way only in the block --expression-attribute-values.

Capacity Modes

DynamoDB offers two different capacity modes: On-Demand and Provisioned. For on-demand, thereโ€™s no need how many reads and writes youโ€™ll need per second, as it will scale immediately. Provisioned capacity requires you to know your traffic patterns, at a steady level (that can also be scaled via CloudWatch, but much slower) of available read and write capacity.

In general, these capacity modes define two things.

  1. Your bill: on-demand capacity is more expensive.
  2. The possibility is that your request can get throttled. Throttling means your requests will be rejected by DynamoDB via a ThrottlingException.

DynamoDB charges you based on Read Capacity Units (RCU) and Write Capacity Units (WCU).

One read capacity unit refers to one strongly consistent read or two eventually consistent reads per second. This read can be for an item with a size of up to 4 KB. If the item has more than 4 KB you will consume more RCUs - e.g. 5 KB will consume 2 RCUs.

The On-Demand capacity mode doesnโ€™t require you to define any WCU or RCU. This is a good choice if you fulfill at least one of the following conditions:

  • Traffic patterns are unknown and vary greatly
  • You donโ€™t want to monitor and manage write & read access
  • Itโ€™s a table for an application under development, saving you upfront costs

On-Demand will cover almost any load, as the service limits are immense.

Provisioned capacity is up to 7 times cheaper than on-demand, but requires you to define RCUs and WCUs. Those will be billed, regardless if you actually use them.

Use this mode for predictable traffic. It doesnโ€™t need to be steady as you can scale RCUs and WCUs with auto-scaling policies.

When to use what - A Summary To Remember ๐Ÿ“š

  • Variable, unpredictable traffic โ†’ On-Demand
  • Variable, predictable traffic โ†’ Provisioned with Auto Scaling
  • Steady, predictable traffic โ†’ Provisioned with Reserved Capacity

Our Suggestion: Donโ€™t overthink this from the beginning. Use provisioned capacity with low RCUs and WCUs until you reach the Free Tier limits (25 RCUs/WCUs per month). Afterward, chose on-demand.

Global Tables

With DynamoDBโ€™s global table feature, you can synchronize tables across regions easily, increasing resiliency and following the patterns of the Well-Architected Framework of AWS.

Data is not only backed up to another region but has also a bi-directional synchronization. Regardless of the write region, each region within the global table definition will receive all updates.

Backups

DynamoDB offers a fully-managed backup solution. Complicated processes of backing up or restoring data are a thing of the past.

  • On-Demand Backups: Trigger backups manually or via a scheduled event
  • Continuous Backups: Point in Time recovery.Your backup will be done automatically and you can restore data to the last 35 days.
  • Exporting Backups to S3: Export all of your data to S3.You can restore it by importing the backup to a new table.

In general, DynamoDB differentiates if you use the AWS Backup service or if you use the direct backup functionality of DynamoDB.

Streams

With DynamoDB streams, you can invoke Lambda functions for item operations in DynamoDB. As an example, we want to send a confirmation email to the user when a new order is saved.

You can activate streams in the DynamoDB console by going to the tab Exports and Streams.


That's not all for DynamoDB, but the most important facts. It's also a great service to get started in combination with Lambda, as you can get things up and running really quickly.

Don't hesitate to get your hands on building ๐Ÿ—

โ€‹
We wish you nice rest of the week! ๐ŸŒŸ

Tobi & Sandro

โ€‹

๐Ÿ•ต๏ธโ€โ™€๏ธ P.S: The answer to the intro question is 105.2 million requests per second. DynamoDB reached this while still maintaining single-digit milliseconds response times! ๐Ÿ”ฅ


If you want to read more, learn why AWS Organizations is your best friend for large-scale projects and what's the difference between CloudWatch and CloudTrail! โ†“

AWS for the Real World

Join our community of over 8,800 readers delving into AWS. We highlight real-world best practices through easy-to-understand visualizations and one-pagers. Expect a fresh newsletter edition every two weeks.

Read more from AWS for the Real World

โŒ› Reading time: 13 minutes ๐ŸŽ“ Main Learning: How to Run Apps on Fargate via ECS ๐Ÿ‘พ GitHub Repository โœ๏ธ Read the Full Post Online ๐Ÿ”— Hey Reader ๐Ÿ‘‹๐Ÿฝ When building applications on AWS, we need to run our code somewhere: a computation service. There are a lot of well-known and mature computation services on AWS. Youโ€™ll often find Lambda as the primary choice, as itโ€™s where you donโ€™t need to manage any infrastructure. You only need to bring your code - itโ€™s Serverless โšก๏ธ. However, more options can be...

โŒ› Reading time: 10 minutes ๐ŸŽ“ Main Learning: Running Postgres on Aurora DSQL with Drizzle ๐Ÿ‘พ GitHub Repository โœ๏ธ Read the Full Post Online ๐Ÿ”— Hey Reader ๐Ÿ‘‹๐Ÿฝ With re:Invent 2024, AWS finally came up with an answer to what many people (including us) asked for years: "What if there were something like DynamoDB but for SQL?" With Amazon Aurora DSQL, this is finally possible. Itโ€™s not just a โ€œscales-to-zeroโ€ solution like Aurora Serverless V2. It is a true distributed, serverless, pay-per-use...

โŒ› Reading time: 12 minutes ๐ŸŽ“ Main Learning: CloudWatch Launches re:invent 2024 โœ๏ธ Read the Full Post Online ๐Ÿ”— Hey Reader ๐Ÿ‘‹๐Ÿฝ re:invent happened already two weeks ago and there were some amazing launches ๐Ÿ‘€ CloudWatch got a lot of love at that re:invent. This is why we are showing you our top CloudWatch launches for this year. We've worked through all of them, tried to get them working with our example application of the CloudWatch Book, and are now busy updating the book โœ๐Ÿฝ. Let's dive into...