Testing Background Jobs With Realistic Retry Behavior

Background job tests should cover duplicate delivery, partial failure, timeout recovery, and observable retry state.

Background jobs rarely fail like demos fail. A demo job either works or throws. A production job sends the email, then crashes before marking the row complete. A webhook arrives twice. A payment provider times out after accepting the charge. A retry runs after the customer changed the underlying record.

Retry tests should treat those conditions as ordinary, not exotic.

Test the queue contract, not the queue brand

Most teams do not need to reimplement their queue in tests. They need to test the assumptions their handler makes:

- the same job may run more than once
- attempts may be separated by minutes or hours
- downstream services may complete work and still time out
- job input may be stale by the next attempt
- operators need enough state to decide whether manual retry is safe

Those rules apply whether the queue is BullMQ, Sidekiq, SQS, Cloud Tasks, or a database table.

Run the same job twice

Start with idempotency. A side-effecting job should survive replay with the same input.

it("does not send the receipt email twice when the job is retried", async () => {
  const order = await createPaidOrder();

  await sendReceiptJob({ orderId: order.id });
  await sendReceiptJob({ orderId: order.id });

  expect(emailClient.sentMessages()).toHaveLength(1);
  expect(await receiptEventsFor(order.id)).toHaveLength(1);
});

The handler needs a stable operation key, such as receipt:${orderId}. A random key generated inside the worker makes retries almost impossible to reason about.

Force failure after the side effect

The most valuable test is often “side effect succeeded, bookkeeping failed.”

it("reuses an uploaded export after metadata write fails", async () => {
  storage.failAfterUploadOnce();

  await expect(exportReportJob({ reportId: "r_123" })).rejects.toThrow(
    "metadata write failed"
  );

  await exportReportJob({ reportId: "r_123" });

  expect(storage.uploadsFor("r_123")).toHaveLength(1);
  expect(await reportStatus("r_123")).toBe("ready");
});

Small fakes are enough. The fake storage client records uploads and can throw after a specific step. That gives you a repeatable version of the failure that usually appears only in logs.

Treat timeouts as unknown, not failed

A timeout means the worker did not receive an answer. It does not mean the downstream system did nothing. For payments, provisioning, email sends, or third-party writes, the retry path should usually query by idempotency key before doing the side effect again.

const existing = await provider.findByIdempotencyKey(operationKey);
if (existing) {
  return markProvisioned(accountId, existing.externalId);
}

const created = await provider.createAccount({ accountId, operationKey });
return markProvisioned(accountId, created.externalId);

The test should simulate the ambiguous case: the provider accepts the request, the client times out, and the retry observes the external record.

Assert observable state

A retry system that only logs “failed” is hard to operate. Tests can verify that attempts and outcomes are recorded:

{
  "jobId": "job_48291",
  "operationKey": "receipt:ord_91",
  "accountId": "acct_7",
  "attempt": 2,
  "state": "retrying",
  "lastError": "SMTP timeout",
  "safeToRetry": true
}

Operators need to know which attempt failed, what side effect was attempted, and whether manual retry might duplicate work.

Keep the handler boring

The easiest job handlers to test are coordinators. They load state, call a domain operation, record the outcome, and exit. When business logic is buried inside queue glue, the only way to test it is to run the whole queue.

Realistic retry tests are not about pessimism. They make the normal production edges visible before the first incident review.