Add post DynamoDB making the slow explicit
	
		
			
	
		
	
	
		
	
		
			All checks were successful
		
		
	
	
		
			
				
	
				ci/woodpecker/push/woodpecker Pipeline was successful
				
			
		
		
	
	
				
					
				
			
		
			All checks were successful
		
		
	
	ci/woodpecker/push/woodpecker Pipeline was successful
				
			This commit is contained in:
		
							parent
							
								
									d0a964fc60
								
							
						
					
					
						commit
						acf5d30862
					
				|  | @ -0,0 +1,90 @@ | |||
| --- | ||||
| title: "Making the Slow Explicit: Dynamodb vs SQL" | ||||
| date: 2023-02-26T15:51:19-05:00 | ||||
| toc: false | ||||
| images: | ||||
| tags: | ||||
|   - dev | ||||
|   - web | ||||
| --- | ||||
| 
 | ||||
| SQL databases like MySQL, MariaDB, and PostgreSQL are highly performant and can | ||||
| scale well. However in practice it's not rare that people run into performance | ||||
| issues with these databases, and run to NoSQL solutions like DynamoDB. | ||||
| 
 | ||||
| Proponents of DynamoDB like Alex DeBrie, the author of ["The DynamoDB Book"](https://www.dynamodbbook.com/) | ||||
| point to a few things for this difference: HTTP-based APIs of NoSQL databases are more efficient than TCP connections used by SQL databases, | ||||
| table joins are slow, SQL databases are designed to save disk space while NoSQL databases take advantage of large modern disks.[^1] | ||||
| 
 | ||||
| [^1]: I don't have my copy of the book handy, so I wrote these arguments from | ||||
|     memory. I'm confident that I remember them correctly, but apologies if I | ||||
|     misremembered some details. | ||||
| 
 | ||||
| These claims don't make a lot of sense to me though. HTTP runs over TCP, it's | ||||
| not going to be magically faster. Table joins do make queries complex, but they | ||||
| are a common feature that SQL engines are designed to optimize. And I don't | ||||
| understand the point about SQL databases being designed to save space. While | ||||
| disk capacities have skyrocketed, even the fastest disks are extremely slow | ||||
| compared to how fast CPUs can crunch numbers. A single cache miss can stall a | ||||
| CPU core for millions of cycles, so it's critical to fit data in cache. That | ||||
| means making your data take up as little space as possible. Perhaps Alex is | ||||
| talking about data normalization which is a property of database schemas and not | ||||
| the database itself, but normalization is not about saving space either, it's | ||||
| about keeping a single source of truth for everything. I feel like at the end of | ||||
| the day, these arguments just boil down to "SQL is old and ugly, NoSQL is new | ||||
| and fresh". | ||||
| 
 | ||||
| That being said, I think there is still the undeniable truth that people in | ||||
| practice do hit performance issues with SQL databases far more often than they | ||||
| hit performance issues with NoSQL databases like DynamoDB. And I think I know | ||||
| why: it's because DynamoDB makes what is slow explicit. | ||||
| 
 | ||||
| Look at these 2 SQL queries, can you spot the performance difference between | ||||
| them? | ||||
| 
 | ||||
| ```SQL | ||||
| SELECT * FROM users WHERE user_id = ?; | ||||
| SELECT * FROM users WHERE group_id = ?; | ||||
| ``` | ||||
| 
 | ||||
| It's a trick question, of course you can't! Not without looking at the table | ||||
| schema to check if there are indexes on `user_id` or `group_id`. And you'd | ||||
| likely have to run `DESCRIBE ...` if the query was more complex to make sure the | ||||
| database will actually execute it the way you think it will. | ||||
| 
 | ||||
| I think this makes it easy to write bad queries. Look at [Jesse Skinner's article](https://www.codingwithjesse.com/blog/debugging-a-slow-web-app/) | ||||
| about the time where he found a web app where all the `SELECT` queries were using `LIKE` instead of `=` | ||||
| which meant that the queries were not using indexes at all! While it's easy to | ||||
| think that the developer who made the mistake of using `LIKE` everywhere was just | ||||
| a bad developer, I think the realization we need to come to is that it is too easy to make these mistakes. | ||||
| The same `SELECT` query could be looking up a single item by its primary key, | ||||
| or it could be doing a slow table scan. The same syntax could return you a single result, or it could return you a million results. | ||||
| If you make a mistake, there is no indication that you made a mistake until | ||||
| your application has been live for months or even years and your database has grown to a size | ||||
| where these queries are now choking. | ||||
| 
 | ||||
| On one hand I think this speaks to how high performance SQL databases are. You | ||||
| can write garbage queries and still get decent performance until your tables | ||||
| grow to hundreds of thousands of rows! But at the same time I think this is | ||||
| exactly why DynamoDB ends up being more scalable in production: because bad | ||||
| queries are explicit. | ||||
| 
 | ||||
| With DynamoDB, if you want to get just one item by its unique key, then you use | ||||
| a `Get` operation that makes this explicit. If you make a query that selects | ||||
| items based on a key condition, that's an explicit `Query` operation. And your | ||||
| query will return you only a small number of results and require you to paginate | ||||
| with a cursor. Again making it explicit that you could be querying for many | ||||
| items! And a query never falls back to scanning an entire table, you do a `Scan` | ||||
| operation for that which makes it explicit that you are doing something wrong. | ||||
| 
 | ||||
| Rather than any magic about table joins or differences in connection types, I | ||||
| think this is really the biggest difference in what makes DynamoDB more | ||||
| scalable. It's not because DynamoDB is magic, it's because it makes bad patterns | ||||
| more visible. I think it's critical that we make our tools be explicit and even | ||||
| painful when using them in bad patterns, because we will accidentally follow bad | ||||
| patterns if it's easy to do so. | ||||
| 
 | ||||
| I want to add though, DynamoDB is not perfect in this regard either. I | ||||
| particularly see this with filters. It's easy to see why Amazon added filters, | ||||
| but it's not rare that people use filters without understanding how they work | ||||
| and end up making mistakes (for example, [here](https://stackoverflow.com/questions/64814040/dynamodb-scan-filter-not-returning-results-for-some-requests)). | ||||
		Loading…
	
		Reference in a new issue