Atomic payment confirmation: closing a race condition in production

How a four-way write contention against a single orders row stranded a customer's USDT on-chain, and the SELECT FOR UPDATE pattern that fixed it. From the LeoHydra repo, migration 031.

When I launched the Leo Hydra Studio commerce site in late 2025, I had one persistent worry. USDT lands on-chain, and on-chain is immutable. If our application logic isn't airtight around payment confirmation, money can get stranded. Paid but un-credited, or canceled but already paid. The wallet has the funds. The database thinks the order is dead. The customer is upset. There's no automatic path to reconcile.

Six months and three security audit rounds later, that fear materialized exactly once. It produced one of the more interesting race conditions I've debugged in a while. This post walks through how we found it and how we fixed it. The fix is migration 031_confirm_payment_atomic_lock_and_buffer.sql in the repo, if you're tracking along.

The setup

LeoHydra accepts two checkout paths. Bank transfer (customer initiates, admin confirms manually in the panel), and USDT-on-Polygon (customer sends USDT to a hot wallet, then a Vercel Cron scanner walks Transfer events on the receiving wallet every two minutes, matches them against initiated payments rows by amount, and calls confirm_payment_atomic(...) when there's a match).

When confirm_payment_atomic is invoked, it does three writes:

  1. UPDATE payments SET status = 'confirmed' WHERE id = $1 AND status = 'initiated'
  2. INSERT INTO crypto_payments (...) VALUES (...)
  3. UPDATE orders SET status = 'paid' WHERE id = $2 AND status = 'pending_payment'

Each write has a status guard. If either row has already moved on, the function refuses to act on it. That's idempotent by design.

But there's a problem.

The race

Three other paths update the same orders row, and none of them touch payments first.

The customer-facing status poll (/api/orders/[id]/status) has an auto-expire branch. If the order is past its deadline, the poll flips status to 'expired'. The admin panel has a cancel endpoint that does the same on demand for bank orders. And pg_cron runs expire_orders_and_payments() once a minute, which cascades stale 'pending_payment' orders to 'expired'.

None of these writers read or update payments, so the status guard on the payments row inside confirm_payment_atomic doesn't protect against them. They just take the orders row and run.

Here's the timing that produced the stranded transfer:

T = 0      A user sends USDT on Polygon.
T = 1 s    The customer-facing status poll fires (third tab? phone reload?).
           Poll sees expires_at + buffer < now() and triggers auto-expire.
           Executes:
             UPDATE orders SET status='expired'
              WHERE id=$1 AND status='pending_payment'
           UPDATE succeeds. orders.status = 'expired'.

T = 2 s    Polygon block containing the user's transfer is mined.

T = 8 s    The Cron scanner runs. Sees the new Transfer event,
           matches it to an initiated payment row, calls
           confirm_payment_atomic().
              Step 1: UPDATE payments WHERE status='initiated'
                      → 1 row updated   ✓
              Step 2: INSERT INTO crypto_payments
                      → success         ✓
              Step 3: UPDATE orders WHERE status='pending_payment'
                      → 0 rows updated  ✗
              RAISES EXCEPTION. Whole transaction rolls back.

Net result:
  • The crypto_payments row is gone (rollback).
  • The payment is back to 'initiated' for a future retry.
  • The on-chain transfer? Still happened. USDT is in our hot wallet.
  • The order? Still 'expired'.
  • The customer's funds are stranded. Reconciliation is manual.

The window is small. A few hundred milliseconds in most cases. On the open internet, that means it fires occasionally, especially when something else is slow. When it fires, it strands money, which is the worst possible failure mode for a payment system.

The fix

The standard SQL move for this shape of problem is to take a pessimistic row lock on the order before doing any work, and structure the lock so all the conflicting writers serialize through it.

CREATE OR REPLACE FUNCTION confirm_payment_atomic(...)
RETURNS void
LANGUAGE plpgsql SECURITY DEFINER
AS $$
DECLARE
  v_order_id UUID;
BEGIN
  -- 0) Resolve the order from the payment, then SELECT … FOR UPDATE.
  -- This takes a row-level pessimistic lock on the orders row that's
  -- held until the transaction commits or rolls back. Any other writer
  -- against the same row blocks here.
  SELECT order_id INTO v_order_id FROM public.payments WHERE id = p_payment_id;
  IF v_order_id IS NULL THEN
    RAISE EXCEPTION 'Payment not found';
  END IF;
  PERFORM 1 FROM public.orders WHERE id = v_order_id FOR UPDATE;
 
  -- 1) Update payment (status-guarded, idempotent)
  UPDATE public.payments
  SET status = 'confirmed', confirmed_at = now()
  WHERE id = p_payment_id AND status = 'initiated';
 
  -- 2) Insert crypto_payments row
  INSERT INTO public.crypto_payments (...) VALUES (...);
 
  -- 3) Update order (also status-guarded; we hold the lock, no contenders)
  UPDATE public.orders
  SET status = 'paid', paid_at = now()
  WHERE id = v_order_id AND status = 'pending_payment';
 
  ...
END;
$$;

The new line is PERFORM 1 FROM public.orders WHERE id = v_order_id FOR UPDATE;. Once that runs, any other transaction trying to update the same orders row (auto-expire, admin cancel, cron expire) blocks at its own write until our transaction commits. On commit, those queued writers wake up and re-evaluate their WHERE status = 'pending_payment' clauses against the post-commit state. They find 'paid', match zero rows, and no-op.

The on-chain payment is credited. The crypto_payments row is recorded. The order is marked paid. Atomically.

If we acquire the lock but the order was already in a terminal state before we arrived (an admin canceled it 200 ms before our function fired, say), Step 3's status-guarded UPDATE matches zero rows. The function raises, the whole transaction rolls back, and the on-chain transfer sits in the hot wallet for manual reconciliation. That's the right outcome. We don't pretend a stranded transfer didn't happen. We surface it for human triage instead of swallowing it silently.

Aligning the deadline buffer

There was one more wrinkle. lib/verify-payment.ts has a 60-second grace window, named EXPIRY_BLOCK_TIMESTAMP_BUFFER_MS = 60_000, to absorb Polygon block-time variance and modest clock drift. A transfer whose block timestamp lands within that buffer should still be accepted.

But expire_orders_and_payments() was checking strictly: expires_at < now(). So the cron could flip an order to 'expired' during the 60-second buffer, breaking the same race in reverse. A transfer that verify-payment would have accepted gets pre-empted by an over-eager cron, and we're back to stranded funds.

Fix: align the buffer everywhere.

CREATE OR REPLACE FUNCTION expire_orders_and_payments()
RETURNS void LANGUAGE plpgsql AS $$
BEGIN
  UPDATE public.payments
  SET status = 'expired'
  WHERE status = 'initiated'
    AND order_id IN (
      SELECT id FROM public.orders
       WHERE status = 'pending_payment'
         AND expires_at + interval '60 seconds' < now()
    );
 
  UPDATE public.orders
  SET status = 'expired'
  WHERE status = 'pending_payment'
    AND expires_at + interval '60 seconds' < now();
END;
$$;

Now all three deadline-enforcement sites agree on when an order is dead: verify-payment, the status-route auto-expire, and the cron. No more disagreement. The same constant governs all three.

The takeaway

The race here wasn't subtle in retrospect. Four writers, one row, no coordination. Of course there's a race. But it's the kind of race that's invisible during normal testing because you'd never hit the millisecond-level timing manually. You need either chaos engineering, a stranded transfer in production, or someone deliberately drawing all the writers on paper to find it.

Three lessons I'm carrying forward.

First, idempotent status guards aren't enough on their own. If multiple transactions write to the same row, you need a lock at the entry of the critical section. Status guards are a great second line of defense, but they don't serialize against writers that don't read the row before writing.

Second, all deadline checks must agree. If verify-payment has a 60-second grace, every cron, trigger, and route that enforces the same deadline needs the same grace. Otherwise you have an off-by-one (or off-by-60-seconds) bug waiting.

Third, audit any place where you have multiple writers to the same row. I found this by drawing the writers on a napkin: confirm-payment, status-route auto-expire, admin cancel, cron expire. Four writers, one row, no coordination. The napkin was the bug.

The whole fix is migration 031 in the LeoHydra codebase. Six lines of net new code, twenty lines of comment explaining why. The cheapest fixes are the ones that look obvious in retrospect.