The Jackpot That Went Wrong
A case for Serializability
This morning I came across multiple such posts on LinkedIn & it got me thinking why this could have possibly happened.
Let’s put our tech brains to use and reproduce this scenario at a very small scale. We have a bunch of customers who decided to play & we can choose only 2 of them for the Jackpot.
- Create a table
bets
CREATE TABLE bets (
name VARCHAR(255),
bet DECIMAL(10, 2),
jackpot DECIMAL(15, 2)
);
2. Four people decided to play. So let’s put their names in the pot. No one has won the jackpot yet.
INSERT INTO bets (name, bet, jackpot) VALUES
('Alice', 1000, NULL),
('Bob', 1000, NULL),
('Charlie', 1000, NULL),
('Diana', 1000, NULL);
3. To decide on the jackpot, our best developer wrote some code which mistakenly spawns multiple threads. [In the real world, this may be deliberate to go through a large volume of participants.]
We are using READ COMMITTED isolation level for this. This isolation level is the default in many databases including Postgres. So, you may not need to do anything.
3 sessions run the code block below each selecting a different name. One used Bob, another Alice & yet another used Charlie.
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Count the number of people who have won the jackpot
WITH jackpot_winners AS (
SELECT COUNT(*) AS winner_count
FROM bets
WHERE jackpot IS NOT NULL
)
-- Conditionally update the row for 'Bob' if there are fewer than 2 winners
UPDATE bets
SET jackpot = 10000
WHERE name = 'Alice'
AND jackpot IS NULL
AND (SELECT winner_count FROM jackpot_winners) < 2;
No one has committed yet, but 3 sessions have been able to run this code block.
Now all 3 commit & we run a query to see how many people won.
Surprisingly, even though each session selected only when there were less than 2 winners, now we have 3 winners.
What went wrong?
Read Committed allows you to read committed values but doesn’t care if the select conditions have had a change.
4. Let’s now run this with the modified code block below after emptying & refilling the pot.
Serializable is the default isolation level in CockroachDB. So, there is no change needed, and the developers don’t have to think about the isolation level.
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Count the number of people who have won the jackpot
WITH jackpot_winners AS (
SELECT COUNT(*) AS winner_count
FROM bets
WHERE jackpot IS NOT NULL
)
-- Conditionally update the row for 'Bob' if there are fewer than 2 winners
UPDATE bets
SET jackpot = 10000
WHERE name = 'Charlie'
AND jackpot IS NULL
AND (SELECT winner_count FROM jackpot_winners) < 2;
This time, while all the statements get executed similar to how they ran in first case, only 1 will be able to commit. Rest will need to retry.
This preserves the correctness of the system.
What would have happened if the transactions were making changes to the same row under read committed isolation level in this case?
In the above scenario, we assumed that each thread randomly selects a different record. Thus, we were able to verify the wrong behavior in this isolation level. However, such bugs usually go unnoticed e.g: in this case, nothing would have been wrong if each thread randomly selected the same record.
The second session would wait for the first to commit because the first session has a lock on the row. When first session commits, the second session would not make any changes since the condition that the jackpot
is not NULL
would not match. Also, in this particular case, the outcome would not be any different even if it wrote the data again.
However, this may not be the case in your application say where the jackpot amount updated by each thread is different & the total value of the jackpot matters rather than the number of winners.
Serializable Isolation just makes life easy & guarantees correctness under all conditions. Consider using this before the lower isolation levels which may be more performant but compromise correctness.
Remember: Premature optimization is the root of all evil