Database Isolation Levels - explained as if you were building a Snowman
Chronicles of making Databases Fun
Hi there! 👋
It’s Saturday and what a better way to spend 3 mins of your weekend than by refreshing your knowledge on DB’s Transaction Isolation Levels, right? RIGHT? 🤓
I know, I know, who cares about that? It’s a boring and hard-to-grasp topic. And you have your fancy ORM that does it for you anyway … Which all makes sense until you learn that there are some weird scenarios like “Lost Updates” and “Dirty Reads”. Ouch!
“Well, no problem”, you say, “I’ll just tighten my Transaction Isolation Level. I don’t even need to understand how that works, because the higher the better”. And that’s true - higher isolation level is better; until you see the performance hit :)
Here’s what I want to do - I want to give you a frame in which to think about DB Transaction Isolation levels, but through lens of something that you can easily remember. Through the lens of Building a Snowman :)
As usual, we go with infographic first and then with more detailed text.
I hope that was amusing enough :) Because, Isolation levels are actually really cool concepts to grasp. They do have their own set of intricacies, but, for starters, you should at least be aware on what they are.
Before I give you more context, I just want to let you know of the results of my 15-sec google search on Default Isolation levels in each DB:
MySQL - REPEATABLE READ
PostgreSQL - READ COMMITTED
SQL Server - READ COMMITTED
Oracle - READ COMMITTED
I want to emphasize that I literally did a 5-sec search for each and this is what came as a result. As usual, your mileage on your production database may vary, so do make sure to check it out.
So what is this all about? Well, it’s really simple - Transaction Isolation Levels are all about what happens when multiple users are trying to read & write data from the same row. That’s concurrency. It tries to answer the question of “what will happen and how do I resolve the conflicts”. You can also think of it as two users trying to edit the same Excel row at the same time. That’d be concurrency as well.
And here’s what happens under each level:
Under READ UNCOMMITTED, everything is fully transparent and everybody sees everything. If one user starts a transaction and starts editing the data, all other users will see all their changes even though they weren’t committed. This is great in cases where you know that concurrency is highly unlikely to happen (e.g. you have single writer and multiple readers or whatever).
The problem would arise if user reads something that another user decides to roll back. We call this scenario a “dirty read” and depending on the context, consequences could range from “meh” to “well, we’re in deep shit now”.
Under READ COMMITTED, database ensures that you see only the data that has been committed to DB. This is actually pretty good for most scenarios, except, and this is amusing - except if you wanted to ensure that you don’t see different data within a same transaction.
It’s a hard to grasp scenario so I’ll give you an example - long-running analytics query; or a machine learning thingy, whatever. Under READ COMMITTED, you would see some data that has been written to DB. All good. But imagine if your transaction was a long-running one (e.g. it lasts for several minutes). What could happen is that somebody modifies the data AGAIN, and if you re-read this you’d see a different value. This problem of possibly seeing DIFFERENT data under two SELECT statements in a single transaction is something that is not ideal for all scenarios. And that’s what gets solved with REPEATABLE READ.
Under REPEATABLE READ your transaction is guaranteed to see the EXACT SAME data all the time, for as long as transaction takes. Even if data changes in the meantime, you are still guaranteed that, in a single long-running transaction, you keep seeing the exact same values.
Frankly speaking, this is probably OK for 99% of the scenarios. Those 1% of cases would be if you want to ensure that no new data is added or removed while your transaction is running (e.g. think if your query depends on the exact number of rows). For such cases, you want to go with Seralizable Level.
Under SERIALIZABLE level, DB ensures that concurrent transactions are executed serially (i.e. one after another). Think of it as physically locking the database while you are doing stuff in it. It’s super-slow because there’s no concurrency, but it’s effective if you need to ensure data is exactly the same.
And that would be it, really :) As you can see - it’s not that hard, right? READ COMMITTED would serve most of the use-cases, but if you need to level-up the game - you can always upgrade to REPEATABLE read.
Finally, based on your use-case, you could also consider DOWNGRADING the isolation level. For example, if you are 100% sure that two transactions would never touch the same row, or that they’d operate on different columns - you can always consider easing the isolation and effectively improving the performance :)
Hope you enjoyed this article! If you did, I’d appreciate if you share it with others as well :)
You can also subscribe if you haven’t already :) I publish at least twice a week, but usually way more often:
Thanks for reading!