What is SKIP LOCKED for in PostgreSQL 9.5?

Jay Taylor's notes

back to listing index

What is SKIP LOCKED for in PostgreSQL 9.5? |

[web search]

Original source (blog.2ndquadrant.com)

Tags: database postgres postgresql fifo queue priority-queue blog.2ndquadrant.com

Clipped on: 2018-10-29

Friday, October 26



Most work queue implementations in SQL are wrong

The majority of PostgreSQL-based implementations of work queues I’ve seen in applications, on Stack Overflow etc have been buggy in one of a few ways:

It fails to consider that statements don’t execute atomically, tries to use subqueries and/or writeable CTEs as if the whole statement is a single atomic unit, and as a result hands out the same work queue entry to multiple workers when run concurrently;
It marks a job as done as soon as it hands it out and no recovery mechanism for if a worker takes a job then fails, so the job is just lost; or
It thinks it’s handing jobs out concurrently, but in practice all but one worker are blocked on a row lock so all workers take turns getting jobs.

The few exceptions I’ve seen generally use PostgreSQL’s advisory locking features or use various time-and-expiry based methods of queue cleanup and recovery. Including the popular off-the-shelf ones that are known to work well. They do, just at a performance cost.

“Find me the next unclaimed row” shouldn’t be hard, though, surely?

The hard part of the problem boils down to:

“How do I find the first row (by some given ordering) in a queue table that nobody else has claimed and claim it for myself? It needs to automatically revert to being unclaimed again if I crash or exit for any reason. Many other workers will be doing the same thing at the same time. It is vital that each item get processed exactly once; none may be skipped and none may be processed more than once.”

This is harder than you’d think because SQL statements do not execute atomically. A subquery might run before the outer query, depending on how the planner/optimizer does things. Many of the race conditions that can affect series of statements can also affect single statements with CTEs and subqueries, but the window in which they occur is narrower because the statement-parts usually run closer together. So lots of code that looks right proves not to be, it’s just right 99.95% of the time, or it’s always right until that day your business gets a big surge of business and concurrency goes up. Sometimes that’s good enough. Often it isn’t.

Additionally, PostgreSQL uses MVCC with snapshots to control row visibility. The main consequence is that you can’t “see” a row another transaction inserted until it commits. Even then PostgreSQL only lets that row “appear” to existing transactions at certain limited points, including the start of a new statement. PostgreSQL makes a few exceptions to this rule, most notably with UNIQUE indexes, but in general you can’t see uncommitted rows. PostgreSQL does not support the DIRTY READ isolation level that would permit this.

That means that tricks like:

UPDATE queue
SET is_done = 't'
WHERE itemno = (
  SELECT itemno
  FROM queue
  WHERE NOT is_done
  ORDER BY itemno
  FOR UPDATE
  LIMIT 1
)
RETURNING itemno

look good, but don’t work.

If two are started at exactly the same moment, the subSELECTs for each will be processed first. (It’s not that simple, but we can pretend for this purpose. Don’t rely on subqueries executing in any particular order for correctness). Each one scans for a row with is_done = 'f'. Both find the same row and attempt to lock it. One succeeds, one waits on the other one’s lock. Whoops, your “concurrent” queue is serialized. If the first xact to get the lock rolls back the second gets the row and tries the same row.

If the first xact commits, the second actually gets zero rows and UPDATE and returns nothing. PostgreSQL doesn’t re-execute the SELECT part, it just notices that the row was modified by another transaction while locked and re-evaluates the WHERE clause for the row. Since is_done = 't' now, the row no longer matches the condition and is excluded. The subquery returns zero rows, which is null, and no itemid is = NULL because nothing is equal to null, so the UPDATE does nothing.

If we remove the FOR UPDATE in the subquery we get a different bug. Now both subqueries can find the same itemid, but one won’t wait on a lock. Both will return it. Both UPDATEs will race to lock the row. One will succeed, the other will wait for the first to commit or roll back. If the first rolls back we’re fine, the second can continue. But if the first commits, the second will also mark the already-completed row as is_done = 't' and return it!. The outer UPDATEs don’t have any test for is_done since the inner subquery does that… so they’ll both happily update the same row.

If you protect against that by adding a second WHERE is_done = 'f' to the UPDATE its self, you get straight back to the first situation where one waits for the other and then returns zero rows.

You can create variants on this ad-nauseum, but they won’t work. Some of them will merrily return duplicate rows during concurrent execution. Some will not return duplicate rows but will not be usefully concurrent. Some will return duplicate rows and still wait for each other.

Things that don’t work

Solutions that rely on updating the row to set a ‘claimed’ flag or active worker ID or similar need another update to mark the row as completed after it’s claimed, which has a performance cost. They need a way to determine that a worker is never going to finish updating the row, be certain that the worker isn’t going to still be trying, and then assign it to another worker. Additionally, they still can’t actually assign new work items concurrently, they just try to make the assignment step very fast. These approaches don’t scale to very high queue throughput.

Solutions that rely on locking the row while the worker proceeds need an open transaction usually look concurrent, but aren’t. Everything except the worker that’s currently processing will just be waiting on the row lock held by the active worker. You might have 100 workers, but 99 of them are waiting for work.

Solutions that use SERIALIZABLE isolation can be effective, but they fall down badly as concurrency goes up. The serializable conflict rate increases and they start generating a lot of database bloat and doing a lot of wasted work. They don’t tend to scale, but work well for cases where there’s not much concurrency.

Solutions that use advisory locking can work well within limits. Instead of using tuple locks they use pg_try_advisory_xact_lock(...) in a loop or using a LIMIT clause to attempt to grab the first unlocked row. It works, but it requires that users go way outside the normal SQL programming model. They can’t use their queue table’s normal keys, they have to map them to either a 64-bit integer or two 32-bit integers. That namespace is shared across the whole database, it’s not per-table and it can’t be isolated per-application. Multiple apps that all want to use advisory locking on the same DB will tend to upset each other.

How `SKIP LOCKED` helps

SKIP LOCKED tries to make this easier by letting you use normal SQL to write efficient, safe queue systems. You don’t need to import a large and complex 3rd party app or library to implement a queue, and you don’t need to deal with the key mapping and namespace issues with advisory locking.

Given a trivial queue:

CREATE TABLE queue(
  itemid INTEGER PRIMARY KEY,
  is_done BOOLEAN NOT NULL DEFAULT 'f'
);

INSERT INTO queue(itemid)
SELECT x FROM generate_series(1,20) x;

an application can grab a single queue item safely while holding an open transaction with:

DELETE FROM queue
WHERE itemid = (
  SELECT itemid
  FROM queue
  ORDER BY itemid
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
RETURNING *;

This:

Scans the queue table in itemid order
Tries to acquire a lock on each row. If it fails to acquire the lock, it ignores the row as if it wasn’t in the table at all and carries on.
Stops scanning once it’s locked one item
Returns the itemid of the locked item
Looks up the found itemid in the index to get its physical location
Marks the tuple as deleted (but this doesn’t take effect until commit)

The open transaction holds a lock on the row now. Other transactions will skip it so long as this transaction keeps on running. If this transaction aborts, the row becomes unlocked and the deletion is undone by the abort so another transaction can grab the row. If this transaction commits, the deletion is committed along with the other changes in the xact, so other xacts won’t see the queue item anymore.

This is ideal whenever your work queue processing makes changes to the same database since it’s 100% atomic.

Example

A somewhat contrived usage example might be processing of a queue of pending balance exchanges:

BEGIN;

DELETE FROM pending_balance_transfers
WHERE itemid = (
  SELECT itemid FROM pending_balance_transfers 
  ORDER BY itemid 
  FOR UPDATE SKIP LOCKED 
  LIMIT 1
)
RETURNING itemid, from_customer, to_customer, amount;

-- Do its processing and database updates in the same transaction
-- now, e.g.: if the queue contained "transfer balance of $100 from customer A to customer B"
-- we might:

UPDATE bank_balances
SET balance = balance + 100
WHERE customerid = 'customer_a';

UPDATE bank_balances
SET balance = balance - 100
WHERE customerid = 'customer_b';

-- and when it commits, the queue entry is marked completed atomically with
-- the work that was queued being committed.

COMMIT;

Downsides?

There are downsides to using SKIP LOCKED to implement a queue.

Each transaction scans the table and skips over locked rows, so with high numbers of active workers it can land up doing a bit of work to acquire a new item. It’s not just popping items off a stack. The query will probably have to walk an index with an index scan, fetching each candidate item from the heap and checking the lock status. With any reasonable queue this will all be in memory but it’s still a fair bit of churn.

(You can slightly reduce this by abusing ctid lookups instead of doing a second index lookup to find the row to update or delete, but that requires caution and I’m not going to go into details on it here).

A queue implemented in the RDBMS will never match the performance of a fast dedicated queueing system, even one that makes the same atomicity and durability guarantees as PostgreSQL. Using SKIP LOCKED is better than existing in-database approaches, but you’ll still go faster using a dedicated and highly optimised external queueing engine.

Using SKIP LOCKED of course takes you away from ANSI SQL. ORM users will need to use native queries and developers will in general have to be aware that they can’t expect to just run the same SQL on other platforms. (As if you ever can in reality). That said, Oracle has the same syntax with what looks like similar or the same semantics. Microsoft SQL Server has the READPAST locking hint. DB2 has SKIP LOCKED DATA. As far as I can tell MySQL has no equivalent yet. Still, three out of four ain’t bad.

What about external systems and 2PC?

When you want to do something outside the database the queue resides in as part of your queue processing you need two-phase commit to do it reliably. SKIP LOCKED can’t help you with the atomicity aspect there.

It still can help you acquire new work items efficiently, safely and concurrently. Items will stay locked while an xact is prepared after PREPARE TRANSACTION has run, will become available again if a prepared xact does a ROLLBACK PREPARED and will be removed when COMMIT PREPARED runs.

Jay Taylor's notes