Skip to content

ADR-0007: Background-jobs runtime is pg-boss on the existing Postgres

Context

RFC-0002 compared three runtimes for scheduled + on-demand background work: BullMQ on Redis, pg-boss on the existing Postgres, and GitHub Actions cron. The current state was a code smell: bullmq ^5.76.2 and ioredis ^5.10.1 listed in package.json, zero imports in src/, and a Redis container running in docker-compose.yml only for that queue use.

Workload at M1: ~24 scrape jobs/day; M2 estimate: ~120 jobs/day with future on-demand work (price-drop alerts in #14, image fetcher, LLM extraction in #21).

Decision

Use pg-boss on the existing Postgres for all scheduled and on-demand background work. Drop bullmq, ioredis, and the Redis container.

The runtime works under all three ADR-0006 hosting shapes (Cloudflare Cron Triggers + scheduled drain, Vercel Cron + drain, Hetzner systemd unit) — pg-boss is just SQL.

Consequences

Positive

  • No new stateful infra. We already operate Postgres; we don't add Redis as production-critical infrastructure.
  • Transactionally consistent with data writes. A scrape can UPDATE listings ... and mark its job complete in one transaction, eliminating "data written, job not marked complete" failure modes.
  • Right-shaped for our scale. 24–120 jobs/day fits poll-based queueing trivially. Default 2s poll latency is invisible.
  • Cost. Zero. No Upstash dependency, no Redis container, one less prod secret (REDIS_URL drops from env-vars.md).
  • Migration target if scale grows — re-implementing the queue glue (src/lib/jobs.ts) for BullMQ is ~1 day of work; the job bodies don't change.

Negative

  • No shipped dashboard. We write a small /admin/jobs page reading pgboss.job (M2 polish) or live with psql queries until then.
  • Polling load on Postgres. Default 2s poll. Invisible at our scale; would matter at 1000s of jobs/sec.
  • Smaller ecosystem than BullMQ. Fewer Stack Overflow answers, fewer plugins. API surface is small (boss.send, boss.work, boss.schedule) — manageable.
  • Polling latency up to 2s from enqueue to start. Acceptable for daily scrapes and price-drop dispatch; would matter for a streaming pipeline (we don't have any).

Neutral

  • Queue contention on the same DB. At 1000s of jobs/sec this is real. Not real at 24–120/day. Migration trigger documented below.

Migration triggers — pre-recorded

Revisit (and probably switch to BullMQ) if any of these fire:

  1. Scrape volume or fanout grows by 10× (>1,200 jobs/day).
  2. A new feature wants sub-second per-listing handoff (e.g. #21 LLM-extraction-on-scrape with concurrency).
  3. A new feature wants user-triggered async work with sub-100ms latency expectations (none in M1/M2 scope).

Recording these in advance so we don't drift past the threshold without noticing.

Alternatives considered

BullMQ on existing Redis — rejected for now

Pre-installed answer; mature; sub-second pub/sub latency; @bull-board UI ships. Loses on: introduces Redis as production-critical infra for a workload that doesn't need it. Visibility-timeout / stalled-job semantics are subtle (bites people who write long-running jobs without heartbeats). At 24 jobs/day this is a sledgehammer. Stays as the named migration target.

GitHub Actions scheduled workflows — rejected

Zero infra to operate; free on public repos. Loses on the deal-breaker: no queue. Price-drop alerts and image-fetcher are app-triggered, not cron-only. GH Actions also has cron drift, silent skip after 60-day inactivity, and whole-job retries only. Worth keeping as a belt-and-braces watcher (heartbeat cron alerting if no recent scrape ran) — separate ticket, not primary runtime.

Decisions on the open questions from RFC-0002

  • Semantics for scrape jobs: at-least-once. Listing upsert is keyed by (retailerId, url) (per Prisma models), so duplicated scrapes are idempotent at the DB level.
  • Job state retention: pg-boss defaults — 7 days for completed, 14 for failed. Tunable later.
  • Admin dashboard timing: M2 polish. Until then, psql queries against pgboss.job are documented in a runbook (filed as a separate ticket).
  • Drop deps timing: in #18's implementation PR — keeps the audit + RFCs on this commit, code change on its own commit.

References