RFC-0002: Background-jobs runtime¶
- Status: Accepted — locked by ADR-0007 on 2026-04-28
- Author: MASTER (drafted by Claude as part of #34)
- Date: 2026-04-28
- Related: #18, ADR-0007, Architecture → Deployment, tech-stack reference, RFC-0001
Decision recorded. pg-boss on the existing Postgres. Drops
bullmq,ioredis, the Redis container. See ADR-0007 for the locked decision; this RFC remains as the comparative analysis that produced it.
Summary¶
961tech has scrape jobs, future price-drop alerts, and a future image fetcher. Today, scrapes run by hand via npm run scrape. We need a runtime for the scheduled and on-demand background work. Recommendation: pg-boss on the existing Postgres — no new infrastructure, transactionally consistent with the data scrapes operate on, sized correctly for the 24–120 jobs/day workload. Drop bullmq and ioredis from package.json once this RFC lands. BullMQ remains the right answer if scrape volume or fanout grows by 10× (revisit at M3+). GitHub Actions cron is cheaper still but has no answer for the non-cron work in the roadmap.
Motivation¶
The current state is a code smell:
package.jsonlistsbullmq ^5.76.2+ioredis ^5.10.1(caret major).src/has zero imports of either.docker-compose.ymlalready runs Redis 7-alpine (only ever for queue use).- #18 "implement scraper queue" is unscheduled.
- Architecture → Deployment § Scraper workers names BullMQ as the leading candidate, but no decision is recorded.
The deps were a decision-by-reflex: "we need a queue → BullMQ is the queue → install it." That reflex deserves a second look at this scale (~24 jobs/day at M1, ~120/day at M2), because adding Redis as production-critical infrastructure is a non-trivial ops decision and we already have a battle-tested transactional store sitting right there.
The roadmap also includes non-cron work that's coming up:
- Price-drop alert dispatch (#10 / M2) — triggered when
ListingPriceinsert detects a drop ≥ threshold. - Image fetcher — currently scrapers grab image URLs only; eventually, fetch the bytes and push to a CDN. Triggered per-product on first sight.
- LLM-assisted spec extraction (#21) — async per-listing.
A pure-cron solution (GitHub Actions) covers scrapes but doesn't cover any of these.
Proposal¶
Recommendation: pg-boss on the existing Postgres 17¶
Why pg-boss, not BullMQ.
| Dimension | pg-boss | BullMQ |
|---|---|---|
| New infra to operate | None — adds 10ish tables in a pgboss schema on Postgres we already run |
Redis as critical infra (or Upstash REST + a Workers-compatible client if RFC-0001 lands on Cloudflare) |
| Transactional consistency with our data | Yes — a job can UPDATE listings ... and mark itself complete in one transaction |
No — jobs ack in Redis, data writes go to Postgres separately |
| Cron / delayed / one-off / singleton | All four (boss.schedule, startAfter, singletonKey) |
All four |
| Retries with backoff | First-class (retryLimit, retryDelay, retryBackoff: true) |
First-class |
| Dead-letter | First-class (deadLetter: 'queue-name') |
De-facto via the failed set |
| Concurrency control per queue | teamSize × teamConcurrency |
Worker(name, fn, { concurrency }) |
| Dashboard | None shipped — write a small admin page (or query pgboss.job directly) |
@bull-board/api (Express/Fastify/Next adapter) |
| Polling vs pub/sub | Polling (default 2s newJobCheckInterval — tunable) |
Redis pub/sub — sub-second latency |
| Latency from enqueue to start | Up to poll interval | Sub-second |
| Observability story | SQL queries against pgboss.job — composable with the rest of our DB |
@bull-board UI; metrics over events |
At our scale, pg-boss's downsides — smaller ecosystem, polling latency, no shipped dashboard — don't bite. The upside (no new stateful infra, transactional consistency, fewer failure modes) is real every day.
Data and code shape¶
sequenceDiagram
participant Cron as Schedule (pg-boss cron)
participant Worker as Worker process
participant PG as Postgres
participant Retailer as Retailer
Cron->>PG: Insert job row in pgboss.job
Worker->>PG: poll for ready job (fetchNextJob)
PG-->>Worker: job for retailer X, category CPU
Worker->>Retailer: fetch listings
Retailer-->>Worker: HTML
Worker->>PG: BEGIN; upsert listings; UPDATE pgboss.job state=completed; COMMIT
Single transaction at the end keeps "data written, job not marked complete" impossible.
Worker process model¶
The worker is a separate long-running process from the Next.js server (tsx scripts/worker.ts), invoked under whatever process supervisor the chosen host provides:
- If RFC-0001 lands on Vercel: worker runs on a tiny external host (Railway/Fly/$3 VPS), or — at our scale — Vercel Cron triggers a route handler that drains N jobs per invocation.
- If RFC-0001 lands on Cloudflare: Cloudflare Cron Triggers + a scheduled Worker handler that drains N jobs from
pgboss.jobper invocation. No long-running consumer at all. Polling latency is whatever the cron cadence is. - If RFC-0001 lands on Hetzner: systemd unit running
tsx scripts/worker.ts24/7, restart-on-failure. Standard.
pg-boss happens to be friendly to all three shapes — it's just SQL.
What we drop¶
bullmq ^5.76.2frompackage.jsonioredis ^5.10.1frompackage.json- The Redis container from
docker-compose.yml(unless RFC-0001 needs Redis for a non-queue reason — it doesn't today)
The REDIS_URL env var (env-vars.md § REDIS_URL) drops with them. One less prod secret to manage.
What we add¶
pg-bossas a runtime dep- A
src/lib/jobs.tsmodule that owns thebossinstance, schedule registration, and queue handler registration scripts/worker.tsas the long-running entry point- A small
src/app/admin/jobs/page.tsx(or similar) to read job state frompgboss.job(M2 polish; not blocking)
Trade-offs¶
| Cost | What it buys |
|---|---|
| Polling load on Postgres. Default 2s poll. Invisible at our scale; would matter at 1000s of jobs/sec. | No second stateful infra to operate. |
No shipped dashboard. Have to write a small admin page (or live with SQL queries against pgboss.job). |
One less thing to operate / authenticate / patch. |
| Smaller ecosystem. Fewer Stack Overflow answers, fewer plugins. | API surface is small (boss.send, boss.work, boss.schedule) — manageable solo. |
| Polling latency. Up to 2s from enqueue to start. | Daily scrapes don't notice; user-triggered work (alert dispatch) tolerates 2s easily. |
| Queue contention on the same DB. At 1000s of jobs/sec this is real. | Not real at 24–120/day. Revisit if alert dispatch volume changes the picture. |
Migration cost if scale demands BullMQ later. Rewriting the queue glue (~src/lib/jobs.ts) — maybe a day's work. The job bodies don't change. |
Keeps BullMQ as a real fallback, not a stranded option. |
Alternatives¶
BullMQ on the existing Redis¶
The pre-installed answer. Mature, the obvious "queue in Node" solution. Web UI via @bull-board/api. Sub-second latency.
- Operational fit: producer in route handlers, consumer in a separate process. During deploys, BullMQ workers SIGTERM-graceful-shutdown; in-flight jobs finish or get re-queued via stalled-job recovery (Redis visibility timeout).
- Cost: marginal. Redis container exists. Memory for ~120 jobs/day is kilobytes.
- Where it loses: runs Redis as production-critical infra for a workload that doesn't need it. Stalled-job semantics are subtle (visibility-timeout model bites people who write long-running jobs without heartbeats). At 24 jobs/day this is a sledgehammer.
- When this wins: scrape volume or fanout grows by 10×, or when #21 LLM extraction wants per-listing sub-second handoff and concurrency that pg-boss polling can't keep up with.
Migration from pg-boss to BullMQ is a bounded rewrite if it turns out we need it. Migration the other way is rare.
GitHub Actions scheduled workflows¶
.github/workflows/scrape.yml with schedule: cron: '0 3 * * *' running npm ci && npm run scrape. Zero infra to operate. Free on a public repo; ~$30/mo at M2 volume on a private repo.
- Where it loses (the deal-breaker): no queue. No way to enqueue work dynamically from the app. Price-drop alert dispatch and image fetcher are app-triggered work — that's a queue, not a cron.
- Other footguns: cron drift 5–15 min under GH load; silent skip if the default branch is inactive for 60 days; no in-flight visibility beyond the Actions log; whole-job retries only (can't retry per-listing).
- Where it earns its keep: as a belt-and-braces redundant trigger alongside pg-boss — a daily GH Actions cron that hits a
/api/jobs/heartbeatendpoint, alerting if no recent scrape ran. That's a future "watcher" ticket, not a primary runtime.
Open questions¶
- Semantics for scrape jobs — at-least-once or exactly-once? Recommendation: at-least-once is fine. Listing upsert is keyed by
(retailerId, url)(see Prisma models § Listing), so a duplicated scrape is idempotent at the DB level. Confirm with MASTER. - Job state retention. How long do we keep completed/failed job rows? pg-boss defaults to 7 days for completed, 14 for failed. Worth confirming — at 120 jobs/day the table stays tiny anyway.
- Admin dashboard timing. The "small admin page reading
pgboss.job" is M2 polish — is that acceptable, or does MASTER want job visibility from day one? If day one, we usepsqlqueries documented in a runbook until the page exists. - Drop deps now or after the implementation ticket? Recommendation: drop in #18's implementation PR, not in this RFC's commit. Keeps the audit and the RFC on the same branch they live on now.
Implementation plan¶
Once MASTER picks pg-boss, the implementation work for #18 is roughly:
- Lock decision as ADR
0005(or whichever number is next) - Add
pg-bosstodependencies - Drop
bullmqandioredisfromdependencies - Drop
redisfromdocker-compose.yml - Drop
REDIS_URLfrom.env.exampleand env-vars.md - Create
src/lib/jobs.ts—bosssingleton, schedule registration, queue handler registration - Create
scripts/worker.ts— long-running entry, graceful SIGTERM - Migrate
npm run scrapeto enqueue ascrape:<retailer>:<category>job per retailer/category, with daily cron schedule - Update Architecture → Deployment § Scraper workers and Architecture → Ingest pipeline to match
- M2 polish: write the
/admin/jobspage readingpgboss.job
The migration from "manual npm run scrape" to "scheduled pg-boss job" is mechanical — the scrape bodies in src/scrapers/sites/*.ts don't change.
Out of scope¶
- Cron syntax / specific schedules for individual jobs — owned by #18.
- Per-job retry policies (how many attempts, how long the backoff) — owned by #18.
- Drift / health alerting (parser returns 0 listings → Telegram alert) — separate ticket; depends on observability decision pulled along by RFC-0001.
- LLM extraction queue specifics (#21) — pg-boss handles it; the LLM-call shape is a different RFC.
- Replacing pg-boss with BullMQ — explicitly not this RFC. If scale demands it later, write a fresh RFC then.