πŸ’Ύ DynamoDB Essentials: Everything You Need to Know


Hi Reader πŸ‘‹πŸ½

This newsletter is all about DynamoDB - one of the most famous AWS flagship services around.

πŸ’‘ A guessing question to get you started πŸ€”β€‹
What was the peak of requests per second for DynamoDB at the Amazon Prime Days 2022? Scroll down to find the answer. You'll be amazed.

​

Let's quickly go over the topics we'll cover in this issue.

  • Introduction
  • The differences between SQL & NoSQL
  • Key Concepts & Data Types
  • Primary Key Structure
  • Scans vs. Queries
  • Secondary Indexes
  • Write Operations
  • Capacity Modes
  • Global Tables
  • Backups
  • Streams

That's a lot but it's worth to go deep on this great service that's used by a lot of companies.

Let's go! πŸš€

Introduction

DynamoDB is a fully-managed NoSQL database that is is able to handle any scale. Additionally, it offers great features to integrate natively with other services. As it’s not your common NoSQL storage but comes with a long list of unique points, it’s important to understand its internals beforehand.

SQL vs NoSQL

DynamoDB is a NoSQL database, which means it is not explicitly enforcing a schema. You can add or remove fields with every write operation.

This does not mean that there is no schema to follow. It's just not enforced on the database level itself. Your application still implicitly expects that your data is structure in a certain way.

Changing a complex implicit NoSQL schema can be even more difficult than migrating a schema in SQL.

Key Concepts & Data Types

DynamoDB’s internals is built around the following major concepts: tables, items, attributes, and types.

  • Table: a table is a collection of items and DynamoDB can store many of them. It's similar to tables in SQL.
  • Item: an item is a row in the table. It's a document in JSON format that can have several attributes.
  • Attribute: a field within an item. Comparable to a simple key in a JSON object. It's described as a type.
  • Type: DynamoDB allows for different types, including string, number, binary, boolean, null, list, map, and set.

With the types, DynamoDB enforces "it's own JSON format". This means each attribute is wrapped into it's type identifier.

On application level, you don't need to work with this nested (rather complex) fields as there are packages for all the popular languages that automatically map from JSON to DynamoDB JSON and visa versa.

One last important fact: each document in DynamoDB can be up to 400 kB in size, which is a lot and won't be reached in most applications. But as you're often aiming for a single table design (one document that contains all relevant information, instead of creating multiple tables that depend on each other), it's possible to exceed this.

Primary Key Structure

A primary key identifies an item uniquely. In DynamoDB, this key can be either a simple one or a composite.

  • Simple Primary Key: one attribute defines your primary key. This attribute is necessary for all queries on your table. In this case, the primary key is also your partition (or hash) key. This means, the key decides which internal partition will actually store your item.
  • Composite Primary Key: a composite key is built of two parts - a partition and sort key. Together they have to be unique, but the partition or sort key can have multiple items with the same value.

Composite keys extend the options to query for items, which we'll have a look in the next paragraph.

Scans vs. Queries

There are two ways to retrieve items: scans and queries.

  • A scan does not have any requirement. You can create a filter for any given attribute on your table (with the known comparators like equals, less than, greater than, ...). The problem: a scan will simply iterate over all your items and will only return on a match. As you pay for each iterated item and not only the retrieved item, this is expensive and slow.
  • A query always requires the partition key. If you're using a composite key, you can additionally provide a comparator for the range key. Queries are fast and cheap, as they won't iterate the table and you'll be only billed for the returned size of data.

Scans are always the last resort and should be avoided at all cost.

πŸ’‘ Looking at our previous paragraph about primary keys: if your partition key is not well-distributed across partitions (e.g. a single partition will receive a huge percentage of your items), this can lead to hot partitions.

Hot partitions will have a negative impact on the general performance, as those partitions will receive more read and/or write operations (later more on those in the capacity paragraph) as other partitions.

Why is this a problem? Because your read & write capacity units are distributed across all partitions. This means one hot partition can lead to throttles way before you reach your overall capacity.

Secondary Indexes

We've seen that you always require the partition key to query for items. This requires a very well-planned schema where you know all your query capabilities beforehand.

Often, this is difficult or simply not possible as requirements can change.

But you're not out of options and you don't have to fall back on scans, as DynamoDB also offers secondary indexes. With them, your query capabilities can be extended.

There are two different types of secondary indexes (SI):

  • Global (GSI) - create another primary key that's independent of the original one. You can also query on this new index. It can be created at any time, but items will be kept in it's own partition space. This means, DynamoDB will internally replicate your whole table to offer this feature.
  • Local (LSI) - can only be created during table generation. Your data will reside in the same partition space, as you have to re-use your partition key and you can only define another range key.

πŸ’‘ A Small Dive Into How Partitions Work In DynamoDB: a table is divided into multiple partitions, and each partition is stored on a different server. When an item is added to the table, it is assigned to a partition based on the partition key value. All items with the same partition key value are stored in the same partition and are therefore stored on the same server. This allows DynamoDB to distribute the data across multiple servers, which helps to scale the table as the size of the data grows. If you’re interested in more detail about partitions check out this amazing article by Alex Debrie.

Write Operations

If you want to insert data into DynamoDB you pass Expression Names & Expression Values. This is also rather unintuitive in the first place, but you'll get used to it. It's also possible to just use another abstraction layer like DynamoDBMapper (which is available for the famous languages) that will make this easier with typed classes for your database items.

But let's have a look at how normal query would look with names and values in the CLI:

The expression attribute (which just act as variables) names start with a # and the expression attribute values with :.

  • We first apply the key definition: a field that's specified by #pk with a value that specified by :pk.
  • Then we define that we want to update a field which name is specified by #quantity and its value by :quantity.
  • Our primary key's name (#pk) is mapped to orderId and our target field's name is mapped to quantity.

Then we map the values in the same way only in the block --expression-attribute-values.

Capacity Modes

DynamoDB offers two different capacity modes: On-Demand and Provisioned. For on-demand, there’s no need how many reads and writes you’ll need per second, as it will scale immediately. Provisioned capacity requires you to know your traffic patterns, at a steady level (that can also be scaled via CloudWatch, but much slower) of available read and write capacity.

In general, these capacity modes define two things.

  1. Your bill: on-demand capacity is more expensive.
  2. The possibility is that your request can get throttled. Throttling means your requests will be rejected by DynamoDB via a ThrottlingException.

DynamoDB charges you based on Read Capacity Units (RCU) and Write Capacity Units (WCU).

One read capacity unit refers to one strongly consistent read or two eventually consistent reads per second. This read can be for an item with a size of up to 4 KB. If the item has more than 4 KB you will consume more RCUs - e.g. 5 KB will consume 2 RCUs.

The On-Demand capacity mode doesn’t require you to define any WCU or RCU. This is a good choice if you fulfill at least one of the following conditions:

  • Traffic patterns are unknown and vary greatly
  • You don’t want to monitor and manage write & read access
  • It’s a table for an application under development, saving you upfront costs

On-Demand will cover almost any load, as the service limits are immense.

Provisioned capacity is up to 7 times cheaper than on-demand, but requires you to define RCUs and WCUs. Those will be billed, regardless if you actually use them.

Use this mode for predictable traffic. It doesn’t need to be steady as you can scale RCUs and WCUs with auto-scaling policies.

When to use what - A Summary To Remember πŸ“š

  • Variable, unpredictable traffic β†’ On-Demand
  • Variable, predictable traffic β†’ Provisioned with Auto Scaling
  • Steady, predictable traffic β†’ Provisioned with Reserved Capacity

Our Suggestion: Don’t overthink this from the beginning. Use provisioned capacity with low RCUs and WCUs until you reach the Free Tier limits (25 RCUs/WCUs per month). Afterward, chose on-demand.

Global Tables

With DynamoDB’s global table feature, you can synchronize tables across regions easily, increasing resiliency and following the patterns of the Well-Architected Framework of AWS.

Data is not only backed up to another region but has also a bi-directional synchronization. Regardless of the write region, each region within the global table definition will receive all updates.

Backups

DynamoDB offers a fully-managed backup solution. Complicated processes of backing up or restoring data are a thing of the past.

  • On-Demand Backups: Trigger backups manually or via a scheduled event
  • Continuous Backups: Point in Time recovery.Your backup will be done automatically and you can restore data to the last 35 days.
  • Exporting Backups to S3: Export all of your data to S3.You can restore it by importing the backup to a new table.

In general, DynamoDB differentiates if you use the AWS Backup service or if you use the direct backup functionality of DynamoDB.

Streams

With DynamoDB streams, you can invoke Lambda functions for item operations in DynamoDB. As an example, we want to send a confirmation email to the user when a new order is saved.

You can activate streams in the DynamoDB console by going to the tab Exports and Streams.


That's not all for DynamoDB, but the most important facts. It's also a great service to get started in combination with Lambda, as you can get things up and running really quickly.

Don't hesitate to get your hands on building πŸ—

​
We wish you nice rest of the week! 🌟

Tobi & Sandro

​

πŸ•΅οΈβ€β™€οΈ P.S: The answer to the intro question is 105.2 million requests per second. DynamoDB reached this while still maintaining single-digit milliseconds response times! πŸ”₯


If you want to read more, learn why AWS Organizations is your best friend for large-scale projects and what's the difference between CloudWatch and CloudTrail! ↓

AWS for the Real World

Join our community of over 8,800 readers delving into AWS. We highlight real-world best practices through easy-to-understand visualizations and one-pagers. Expect a fresh newsletter edition every two weeks.

Read more from AWS for the Real World

βŒ› Reading time: 14 minutes πŸŽ“ Main Learning: Feature Flags with AWS AppConfig πŸ‘Ύ GitHub Repository ✍️ Read the Full Post Online πŸ”— Hey Reader πŸ‘‹πŸ½ There's no other field where it's so common to have "a small side-project" like in the software industry. Even though it's possible to build things as quickly as ever before due to cloud providers, tools, platforms, and AI, many indie founders (and also large enterprises) tend to fall into the same trap: they tend to build features that users do not...

βŒ› Reading time: 17 minutes πŸŽ“ Main Learning: Observability at Scale with Open-Source πŸ‘Ύ GitHub Repository ✍️ Read the Full Post Online πŸ”— Hey Reader πŸ‘‹πŸ½ Welcome to this edition of the AWS Fundamentals newsletter! In this issue, we're focusing on observability with open-source tools on AWS. As most of you already know, we can use Amazon CloudWatch and X-Ray to monitor our application from every angle. But what if we want to hybrid setup where we run certain parts of our ecosystem outside of AWS?...

βŒ› Reading time: 9 minutes πŸŽ“ Main Learning: Polling or WebSockets: Choosing with Amazon API Gateway πŸ‘Ύ GitHub Repository ✍️ Read the Full Post Online πŸ”— Hey Reader πŸ‘‹πŸ½ What would you use for quick and regular data updates inside your web app? Or let's phrase it another way: how would you choose between Polling and WebSockets? πŸ’­ Understanding the nuances between these two communication methods is important, as they both come with their own advantages, gotchas, and side effects that are not very...