diff --git a/content/posts/2023.02.26.making-the-slow-explicit-dynamodb-sql.md b/content/posts/2023.02.26.making-the-slow-explicit-dynamodb-sql.md new file mode 100644 index 0000000..ae8a251 --- /dev/null +++ b/content/posts/2023.02.26.making-the-slow-explicit-dynamodb-sql.md @@ -0,0 +1,90 @@ +--- +title: "Making the Slow Explicit: Dynamodb vs SQL" +date: 2023-02-26T15:51:19-05:00 +toc: false +images: +tags: + - dev + - web +--- + +SQL databases like MySQL, MariaDB, and PostgreSQL are highly performant and can +scale well. However in practice it's not rare that people run into performance +issues with these databases, and run to NoSQL solutions like DynamoDB. + +Proponents of DynamoDB like Alex DeBrie, the author of ["The DynamoDB Book"](https://www.dynamodbbook.com/) +point to a few things for this difference: HTTP-based APIs of NoSQL databases are more efficient than TCP connections used by SQL databases, +table joins are slow, SQL databases are designed to save disk space while NoSQL databases take advantage of large modern disks.[^1] + +[^1]: I don't have my copy of the book handy, so I wrote these arguments from + memory. I'm confident that I remember them correctly, but apologies if I + misremembered some details. + +These claims don't make a lot of sense to me though. HTTP runs over TCP, it's +not going to be magically faster. Table joins do make queries complex, but they +are a common feature that SQL engines are designed to optimize. And I don't +understand the point about SQL databases being designed to save space. While +disk capacities have skyrocketed, even the fastest disks are extremely slow +compared to how fast CPUs can crunch numbers. A single cache miss can stall a +CPU core for millions of cycles, so it's critical to fit data in cache. That +means making your data take up as little space as possible. Perhaps Alex is +talking about data normalization which is a property of database schemas and not +the database itself, but normalization is not about saving space either, it's +about keeping a single source of truth for everything. I feel like at the end of +the day, these arguments just boil down to "SQL is old and ugly, NoSQL is new +and fresh". + +That being said, I think there is still the undeniable truth that people in +practice do hit performance issues with SQL databases far more often than they +hit performance issues with NoSQL databases like DynamoDB. And I think I know +why: it's because DynamoDB makes what is slow explicit. + +Look at these 2 SQL queries, can you spot the performance difference between +them? + +```SQL +SELECT * FROM users WHERE user_id = ?; +SELECT * FROM users WHERE group_id = ?; +``` + +It's a trick question, of course you can't! Not without looking at the table +schema to check if there are indexes on `user_id` or `group_id`. And you'd +likely have to run `DESCRIBE ...` if the query was more complex to make sure the +database will actually execute it the way you think it will. + +I think this makes it easy to write bad queries. Look at [Jesse Skinner's article](https://www.codingwithjesse.com/blog/debugging-a-slow-web-app/) +about the time where he found a web app where all the `SELECT` queries were using `LIKE` instead of `=` +which meant that the queries were not using indexes at all! While it's easy to +think that the developer who made the mistake of using `LIKE` everywhere was just +a bad developer, I think the realization we need to come to is that it is too easy to make these mistakes. +The same `SELECT` query could be looking up a single item by its primary key, +or it could be doing a slow table scan. The same syntax could return you a single result, or it could return you a million results. +If you make a mistake, there is no indication that you made a mistake until +your application has been live for months or even years and your database has grown to a size +where these queries are now choking. + +On one hand I think this speaks to how high performance SQL databases are. You +can write garbage queries and still get decent performance until your tables +grow to hundreds of thousands of rows! But at the same time I think this is +exactly why DynamoDB ends up being more scalable in production: because bad +queries are explicit. + +With DynamoDB, if you want to get just one item by its unique key, then you use +a `Get` operation that makes this explicit. If you make a query that selects +items based on a key condition, that's an explicit `Query` operation. And your +query will return you only a small number of results and require you to paginate +with a cursor. Again making it explicit that you could be querying for many +items! And a query never falls back to scanning an entire table, you do a `Scan` +operation for that which makes it explicit that you are doing something wrong. + +Rather than any magic about table joins or differences in connection types, I +think this is really the biggest difference in what makes DynamoDB more +scalable. It's not because DynamoDB is magic, it's because it makes bad patterns +more visible. I think it's critical that we make our tools be explicit and even +painful when using them in bad patterns, because we will accidentally follow bad +patterns if it's easy to do so. + +I want to add though, DynamoDB is not perfect in this regard either. I +particularly see this with filters. It's easy to see why Amazon added filters, +but it's not rare that people use filters without understanding how they work +and end up making mistakes (for example, [here](https://stackoverflow.com/questions/64814040/dynamodb-scan-filter-not-returning-results-for-some-requests)).