Postgres as a workflow engine

Every time a team needs to run some background jobs, someone immediately suggests spinning up Kafka, RabbitMQ, or NATS.

Here's the thing. Unless you are actually pushing tens of thousands of messages per second, or you have a hard requirement for event sourcing, adding a dedicated queue system is usually a mistake. It’s another piece of infrastructure to monitor, another set of networking rules to debug, and another place for state to get out of sync with your primary database.

You probably just need Postgres.

Specifically, you need a mechanism that lets multiple workers safely poll the same table without running into deadlocks or stepping on each other's toes. To prove it, I built a tiny interactive sandbox that spins up a real WebAssembly-compiled PostgreSQL instance right here in your browser, running three parallel workers against it.

Booting WASM Postgres...

If you start the engine, you’ll see the workers independently polling the job table. They don’t block each other, and they don’t accidentally pick up the same row. You can even click "Crash" to simulate a worker dying mid-job.

The secret sauce here isn't a complex distributed lock. It’s five words appended to a standard query: FOR UPDATE SKIP LOCKED.

The magic words

Let's say you have a jobs table with a status column. Your worker wants to pick up the next pending job. A naive approach looks like this:

sql
-- Don't do this
BEGIN;
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1;
-- If two workers run this at the same time, they get the same ID.
UPDATE jobs SET status = 'running' WHERE id = ?;
COMMIT;

To fix the race condition, you might try locking the row:

sql
-- Still don't do this
BEGIN;
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1 FOR UPDATE;
-- Worker 1 locks the row. 
-- Worker 2 waits for Worker 1 to finish.
UPDATE jobs SET status = 'running' WHERE id = ?;
COMMIT;

This prevents the race condition, but it completely destroys your throughput. Worker 2 is just sitting there, twiddling its thumbs, waiting for Worker 1 to release the lock so it can fetch the next row.

This is where Postgres shines:

sql
-- This is the way
BEGIN;
SELECT id FROM jobs 
WHERE status = 'pending' 
LIMIT 1 
FOR UPDATE SKIP LOCKED;
 
UPDATE jobs SET status = 'running' WHERE id = ? RETURNING id;
COMMIT;

SKIP LOCKED does exactly what it says on the tin. If Worker 1 has locked row 1, Worker 2's query simply skips row 1 and immediately returns row 2. It grabs the lock for row 2, and goes to work. Worker 3 grabs row 3, and so on. No waiting, no race conditions, high concurrency.

Handling failure

In the toy above, try clicking "Crash" on one of the workers while it's processing a job.

When a worker process dies in a real system, you have a problem: the job is marked as running, but no one is actually doing the work. It's essentially a ghost job.

A robust Postgres workflow engine handles this trivially without complex dead-letter queues. You just add a heartbeat_at timestamp column to your jobs table.

When a worker locks a job, it updates heartbeat_at = NOW().
Every few seconds while working, the worker updates heartbeat_at = NOW().
A separate cron job (or just a regular query before fetching new jobs) looks for stranded jobs:

sql
UPDATE jobs 
SET status = 'pending'
WHERE status = 'running' 
AND heartbeat_at < NOW() - INTERVAL '5 minutes';

If a worker crashes, its heartbeat stops. Five minutes later, the job is automatically returned to the pending state, ready for another healthy worker to pick it up.

Closing notes

Put the jobs in Postgres first. SKIP LOCKED gets you a lot further than most people expect before you need a dedicated broker.
Polling is fine when the thing you're polling is cheap, local, and easy to reason about. A boring worker loop is usually a feature, not a bug.
Keep job creation in the same transaction as the state change that made the job necessary. If the write rolls back, the work should disappear with it.
Reach for Kafka, NATS, or RabbitMQ when you actually need their shape of power: fan-out, replay, strict stream semantics, or much higher sustained throughput.

Most teams don't have a queue problem. They have a too-many-moving-parts problem. A jobs table and a few workers won't solve everything, but it's a much better default than volunteering for another distributed system on day one.