Skip to content

RFC-0002: Background-jobs runtime

Decision recorded. pg-boss on the existing Postgres. Drops bullmq, ioredis, the Redis container. See ADR-0007 for the locked decision; this RFC remains as the comparative analysis that produced it.

Summary

961tech has scrape jobs, future price-drop alerts, and a future image fetcher. Today, scrapes run by hand via npm run scrape. We need a runtime for the scheduled and on-demand background work. Recommendation: pg-boss on the existing Postgres — no new infrastructure, transactionally consistent with the data scrapes operate on, sized correctly for the 24–120 jobs/day workload. Drop bullmq and ioredis from package.json once this RFC lands. BullMQ remains the right answer if scrape volume or fanout grows by 10× (revisit at M3+). GitHub Actions cron is cheaper still but has no answer for the non-cron work in the roadmap.

Motivation

The current state is a code smell:

  • package.json lists bullmq ^5.76.2 + ioredis ^5.10.1 (caret major).
  • src/ has zero imports of either.
  • docker-compose.yml already runs Redis 7-alpine (only ever for queue use).
  • #18 "implement scraper queue" is unscheduled.
  • Architecture → Deployment § Scraper workers names BullMQ as the leading candidate, but no decision is recorded.

The deps were a decision-by-reflex: "we need a queue → BullMQ is the queue → install it." That reflex deserves a second look at this scale (~24 jobs/day at M1, ~120/day at M2), because adding Redis as production-critical infrastructure is a non-trivial ops decision and we already have a battle-tested transactional store sitting right there.

The roadmap also includes non-cron work that's coming up:

  • Price-drop alert dispatch (#10 / M2) — triggered when ListingPrice insert detects a drop ≥ threshold.
  • Image fetcher — currently scrapers grab image URLs only; eventually, fetch the bytes and push to a CDN. Triggered per-product on first sight.
  • LLM-assisted spec extraction (#21) — async per-listing.

A pure-cron solution (GitHub Actions) covers scrapes but doesn't cover any of these.

Proposal

Recommendation: pg-boss on the existing Postgres 17

Why pg-boss, not BullMQ.

Dimension pg-boss BullMQ
New infra to operate None — adds 10ish tables in a pgboss schema on Postgres we already run Redis as critical infra (or Upstash REST + a Workers-compatible client if RFC-0001 lands on Cloudflare)
Transactional consistency with our data Yes — a job can UPDATE listings ... and mark itself complete in one transaction No — jobs ack in Redis, data writes go to Postgres separately
Cron / delayed / one-off / singleton All four (boss.schedule, startAfter, singletonKey) All four
Retries with backoff First-class (retryLimit, retryDelay, retryBackoff: true) First-class
Dead-letter First-class (deadLetter: 'queue-name') De-facto via the failed set
Concurrency control per queue teamSize × teamConcurrency Worker(name, fn, { concurrency })
Dashboard None shipped — write a small admin page (or query pgboss.job directly) @bull-board/api (Express/Fastify/Next adapter)
Polling vs pub/sub Polling (default 2s newJobCheckInterval — tunable) Redis pub/sub — sub-second latency
Latency from enqueue to start Up to poll interval Sub-second
Observability story SQL queries against pgboss.job — composable with the rest of our DB @bull-board UI; metrics over events

At our scale, pg-boss's downsides — smaller ecosystem, polling latency, no shipped dashboard — don't bite. The upside (no new stateful infra, transactional consistency, fewer failure modes) is real every day.

Data and code shape

sequenceDiagram
    participant Cron as Schedule (pg-boss cron)
    participant Worker as Worker process
    participant PG as Postgres
    participant Retailer as Retailer

    Cron->>PG: Insert job row in pgboss.job
    Worker->>PG: poll for ready job (fetchNextJob)
    PG-->>Worker: job for retailer X, category CPU
    Worker->>Retailer: fetch listings
    Retailer-->>Worker: HTML
    Worker->>PG: BEGIN; upsert listings; UPDATE pgboss.job state=completed; COMMIT

Single transaction at the end keeps "data written, job not marked complete" impossible.

Worker process model

The worker is a separate long-running process from the Next.js server (tsx scripts/worker.ts), invoked under whatever process supervisor the chosen host provides:

  • If RFC-0001 lands on Vercel: worker runs on a tiny external host (Railway/Fly/$3 VPS), or — at our scale — Vercel Cron triggers a route handler that drains N jobs per invocation.
  • If RFC-0001 lands on Cloudflare: Cloudflare Cron Triggers + a scheduled Worker handler that drains N jobs from pgboss.job per invocation. No long-running consumer at all. Polling latency is whatever the cron cadence is.
  • If RFC-0001 lands on Hetzner: systemd unit running tsx scripts/worker.ts 24/7, restart-on-failure. Standard.

pg-boss happens to be friendly to all three shapes — it's just SQL.

What we drop

  • bullmq ^5.76.2 from package.json
  • ioredis ^5.10.1 from package.json
  • The Redis container from docker-compose.yml (unless RFC-0001 needs Redis for a non-queue reason — it doesn't today)

The REDIS_URL env var (env-vars.md § REDIS_URL) drops with them. One less prod secret to manage.

What we add

  • pg-boss as a runtime dep
  • A src/lib/jobs.ts module that owns the boss instance, schedule registration, and queue handler registration
  • scripts/worker.ts as the long-running entry point
  • A small src/app/admin/jobs/page.tsx (or similar) to read job state from pgboss.job (M2 polish; not blocking)

Trade-offs

Cost What it buys
Polling load on Postgres. Default 2s poll. Invisible at our scale; would matter at 1000s of jobs/sec. No second stateful infra to operate.
No shipped dashboard. Have to write a small admin page (or live with SQL queries against pgboss.job). One less thing to operate / authenticate / patch.
Smaller ecosystem. Fewer Stack Overflow answers, fewer plugins. API surface is small (boss.send, boss.work, boss.schedule) — manageable solo.
Polling latency. Up to 2s from enqueue to start. Daily scrapes don't notice; user-triggered work (alert dispatch) tolerates 2s easily.
Queue contention on the same DB. At 1000s of jobs/sec this is real. Not real at 24–120/day. Revisit if alert dispatch volume changes the picture.
Migration cost if scale demands BullMQ later. Rewriting the queue glue (~src/lib/jobs.ts) — maybe a day's work. The job bodies don't change. Keeps BullMQ as a real fallback, not a stranded option.

Alternatives

BullMQ on the existing Redis

The pre-installed answer. Mature, the obvious "queue in Node" solution. Web UI via @bull-board/api. Sub-second latency.

  • Operational fit: producer in route handlers, consumer in a separate process. During deploys, BullMQ workers SIGTERM-graceful-shutdown; in-flight jobs finish or get re-queued via stalled-job recovery (Redis visibility timeout).
  • Cost: marginal. Redis container exists. Memory for ~120 jobs/day is kilobytes.
  • Where it loses: runs Redis as production-critical infra for a workload that doesn't need it. Stalled-job semantics are subtle (visibility-timeout model bites people who write long-running jobs without heartbeats). At 24 jobs/day this is a sledgehammer.
  • When this wins: scrape volume or fanout grows by 10×, or when #21 LLM extraction wants per-listing sub-second handoff and concurrency that pg-boss polling can't keep up with.

Migration from pg-boss to BullMQ is a bounded rewrite if it turns out we need it. Migration the other way is rare.

GitHub Actions scheduled workflows

.github/workflows/scrape.yml with schedule: cron: '0 3 * * *' running npm ci && npm run scrape. Zero infra to operate. Free on a public repo; ~$30/mo at M2 volume on a private repo.

  • Where it loses (the deal-breaker): no queue. No way to enqueue work dynamically from the app. Price-drop alert dispatch and image fetcher are app-triggered work — that's a queue, not a cron.
  • Other footguns: cron drift 5–15 min under GH load; silent skip if the default branch is inactive for 60 days; no in-flight visibility beyond the Actions log; whole-job retries only (can't retry per-listing).
  • Where it earns its keep: as a belt-and-braces redundant trigger alongside pg-boss — a daily GH Actions cron that hits a /api/jobs/heartbeat endpoint, alerting if no recent scrape ran. That's a future "watcher" ticket, not a primary runtime.

Open questions

  1. Semantics for scrape jobs — at-least-once or exactly-once? Recommendation: at-least-once is fine. Listing upsert is keyed by (retailerId, url) (see Prisma models § Listing), so a duplicated scrape is idempotent at the DB level. Confirm with MASTER.
  2. Job state retention. How long do we keep completed/failed job rows? pg-boss defaults to 7 days for completed, 14 for failed. Worth confirming — at 120 jobs/day the table stays tiny anyway.
  3. Admin dashboard timing. The "small admin page reading pgboss.job" is M2 polish — is that acceptable, or does MASTER want job visibility from day one? If day one, we use psql queries documented in a runbook until the page exists.
  4. Drop deps now or after the implementation ticket? Recommendation: drop in #18's implementation PR, not in this RFC's commit. Keeps the audit and the RFC on the same branch they live on now.

Implementation plan

Once MASTER picks pg-boss, the implementation work for #18 is roughly:

  • Lock decision as ADR 0005 (or whichever number is next)
  • Add pg-boss to dependencies
  • Drop bullmq and ioredis from dependencies
  • Drop redis from docker-compose.yml
  • Drop REDIS_URL from .env.example and env-vars.md
  • Create src/lib/jobs.tsboss singleton, schedule registration, queue handler registration
  • Create scripts/worker.ts — long-running entry, graceful SIGTERM
  • Migrate npm run scrape to enqueue a scrape:<retailer>:<category> job per retailer/category, with daily cron schedule
  • Update Architecture → Deployment § Scraper workers and Architecture → Ingest pipeline to match
  • M2 polish: write the /admin/jobs page reading pgboss.job

The migration from "manual npm run scrape" to "scheduled pg-boss job" is mechanical — the scrape bodies in src/scrapers/sites/*.ts don't change.

Out of scope

  • Cron syntax / specific schedules for individual jobs — owned by #18.
  • Per-job retry policies (how many attempts, how long the backoff) — owned by #18.
  • Drift / health alerting (parser returns 0 listings → Telegram alert) — separate ticket; depends on observability decision pulled along by RFC-0001.
  • LLM extraction queue specifics (#21) — pg-boss handles it; the LLM-call shape is a different RFC.
  • Replacing pg-boss with BullMQ — explicitly not this RFC. If scale demands it later, write a fresh RFC then.