Add post DynamoDB making the slow explicit
ci/woodpecker/push/woodpecker Pipeline was successful Details

This commit is contained in:
Kaan Barmore-Genç 2023-02-26 16:58:48 -05:00
parent d0a964fc60
commit acf5d30862
Signed by: kaan
GPG Key ID: B2E280771CD62FCF
1 changed files with 90 additions and 0 deletions

View File

@ -0,0 +1,90 @@
---
title: "Making the Slow Explicit: Dynamodb vs SQL"
date: 2023-02-26T15:51:19-05:00
toc: false
images:
tags:
- dev
- web
---
SQL databases like MySQL, MariaDB, and PostgreSQL are highly performant and can
scale well. However in practice it's not rare that people run into performance
issues with these databases, and run to NoSQL solutions like DynamoDB.
Proponents of DynamoDB like Alex DeBrie, the author of ["The DynamoDB Book"](https://www.dynamodbbook.com/)
point to a few things for this difference: HTTP-based APIs of NoSQL databases are more efficient than TCP connections used by SQL databases,
table joins are slow, SQL databases are designed to save disk space while NoSQL databases take advantage of large modern disks.[^1]
[^1]: I don't have my copy of the book handy, so I wrote these arguments from
memory. I'm confident that I remember them correctly, but apologies if I
misremembered some details.
These claims don't make a lot of sense to me though. HTTP runs over TCP, it's
not going to be magically faster. Table joins do make queries complex, but they
are a common feature that SQL engines are designed to optimize. And I don't
understand the point about SQL databases being designed to save space. While
disk capacities have skyrocketed, even the fastest disks are extremely slow
compared to how fast CPUs can crunch numbers. A single cache miss can stall a
CPU core for millions of cycles, so it's critical to fit data in cache. That
means making your data take up as little space as possible. Perhaps Alex is
talking about data normalization which is a property of database schemas and not
the database itself, but normalization is not about saving space either, it's
about keeping a single source of truth for everything. I feel like at the end of
the day, these arguments just boil down to "SQL is old and ugly, NoSQL is new
and fresh".
That being said, I think there is still the undeniable truth that people in
practice do hit performance issues with SQL databases far more often than they
hit performance issues with NoSQL databases like DynamoDB. And I think I know
why: it's because DynamoDB makes what is slow explicit.
Look at these 2 SQL queries, can you spot the performance difference between
them?
```SQL
SELECT * FROM users WHERE user_id = ?;
SELECT * FROM users WHERE group_id = ?;
```
It's a trick question, of course you can't! Not without looking at the table
schema to check if there are indexes on `user_id` or `group_id`. And you'd
likely have to run `DESCRIBE ...` if the query was more complex to make sure the
database will actually execute it the way you think it will.
I think this makes it easy to write bad queries. Look at [Jesse Skinner's article](https://www.codingwithjesse.com/blog/debugging-a-slow-web-app/)
about the time where he found a web app where all the `SELECT` queries were using `LIKE` instead of `=`
which meant that the queries were not using indexes at all! While it's easy to
think that the developer who made the mistake of using `LIKE` everywhere was just
a bad developer, I think the realization we need to come to is that it is too easy to make these mistakes.
The same `SELECT` query could be looking up a single item by its primary key,
or it could be doing a slow table scan. The same syntax could return you a single result, or it could return you a million results.
If you make a mistake, there is no indication that you made a mistake until
your application has been live for months or even years and your database has grown to a size
where these queries are now choking.
On one hand I think this speaks to how high performance SQL databases are. You
can write garbage queries and still get decent performance until your tables
grow to hundreds of thousands of rows! But at the same time I think this is
exactly why DynamoDB ends up being more scalable in production: because bad
queries are explicit.
With DynamoDB, if you want to get just one item by its unique key, then you use
a `Get` operation that makes this explicit. If you make a query that selects
items based on a key condition, that's an explicit `Query` operation. And your
query will return you only a small number of results and require you to paginate
with a cursor. Again making it explicit that you could be querying for many
items! And a query never falls back to scanning an entire table, you do a `Scan`
operation for that which makes it explicit that you are doing something wrong.
Rather than any magic about table joins or differences in connection types, I
think this is really the biggest difference in what makes DynamoDB more
scalable. It's not because DynamoDB is magic, it's because it makes bad patterns
more visible. I think it's critical that we make our tools be explicit and even
painful when using them in bad patterns, because we will accidentally follow bad
patterns if it's easy to do so.
I want to add though, DynamoDB is not perfect in this regard either. I
particularly see this with filters. It's easy to see why Amazon added filters,
but it's not rare that people use filters without understanding how they work
and end up making mistakes (for example, [here](https://stackoverflow.com/questions/64814040/dynamodb-scan-filter-not-returning-results-for-some-requests)).