RFC-0002: Background-jobs runtime¶

Status: Accepted — locked by ADR-0007 on 2026-04-28
Author: MASTER (drafted by Claude as part of #34)
Date: 2026-04-28
Related: #18, ADR-0007, Architecture → Deployment, tech-stack reference, RFC-0001

Decision recorded. pg-boss on the existing Postgres. Drops bullmq, ioredis, the Redis container. See ADR-0007 for the locked decision; this RFC remains as the comparative analysis that produced it.

Summary¶

961tech has scrape jobs, future price-drop alerts, and a future image fetcher. Today, scrapes run by hand via npm run scrape. We need a runtime for the scheduled and on-demand background work. Recommendation: pg-boss on the existing Postgres — no new infrastructure, transactionally consistent with the data scrapes operate on, sized correctly for the 24–120 jobs/day workload. Drop bullmq and ioredis from package.json once this RFC lands. BullMQ remains the right answer if scrape volume or fanout grows by 10× (revisit at M3+). GitHub Actions cron is cheaper still but has no answer for the non-cron work in the roadmap.

Motivation¶

The current state is a code smell:

package.json lists bullmq ^5.76.2 + ioredis ^5.10.1 (caret major).
src/ has zero imports of either.
docker-compose.yml already runs Redis 7-alpine (only ever for queue use).
#18 "implement scraper queue" is unscheduled.
Architecture → Deployment § Scraper workers names BullMQ as the leading candidate, but no decision is recorded.

The deps were a decision-by-reflex: "we need a queue → BullMQ is the queue → install it." That reflex deserves a second look at this scale (~24 jobs/day at M1, ~120/day at M2), because adding Redis as production-critical infrastructure is a non-trivial ops decision and we already have a battle-tested transactional store sitting right there.

The roadmap also includes non-cron work that's coming up:

Price-drop alert dispatch (#10 / M2) — triggered when ListingPrice insert detects a drop ≥ threshold.
Image fetcher — currently scrapers grab image URLs only; eventually, fetch the bytes and push to a CDN. Triggered per-product on first sight.
LLM-assisted spec extraction (#21) — async per-listing.

A pure-cron solution (GitHub Actions) covers scrapes but doesn't cover any of these.

Proposal¶

Recommendation: pg-boss on the existing Postgres 17¶

Why pg-boss, not BullMQ.

Dimension	pg-boss	BullMQ
New infra to operate	None — adds 10ish tables in a `pgboss` schema on Postgres we already run	Redis as critical infra (or Upstash REST + a Workers-compatible client if RFC-0001 lands on Cloudflare)
Transactional consistency with our data	Yes — a job can `UPDATE listings ...` and mark itself complete in one transaction	No — jobs ack in Redis, data writes go to Postgres separately
Cron / delayed / one-off / singleton	All four (`boss.schedule`, `startAfter`, `singletonKey`)	All four
Retries with backoff	First-class (`retryLimit`, `retryDelay`, `retryBackoff: true`)	First-class
Dead-letter	First-class (`deadLetter: 'queue-name'`)	De-facto via the `failed` set
Concurrency control per queue	`teamSize` × `teamConcurrency`	`Worker(name, fn, { concurrency })`
Dashboard	None shipped — write a small admin page (or query `pgboss.job` directly)	`@bull-board/api` (Express/Fastify/Next adapter)
Polling vs pub/sub	Polling (default 2s `newJobCheckInterval` — tunable)	Redis pub/sub — sub-second latency
Latency from enqueue to start	Up to poll interval	Sub-second
Observability story	SQL queries against `pgboss.job` — composable with the rest of our DB	`@bull-board` UI; metrics over events

At our scale, pg-boss's downsides — smaller ecosystem, polling latency, no shipped dashboard — don't bite. The upside (no new stateful infra, transactional consistency, fewer failure modes) is real every day.

Data and code shape¶

sequenceDiagram
    participant Cron as Schedule (pg-boss cron)
    participant Worker as Worker process
    participant PG as Postgres
    participant Retailer as Retailer

    Cron->>PG: Insert job row in pgboss.job
    Worker->>PG: poll for ready job (fetchNextJob)
    PG-->>Worker: job for retailer X, category CPU
    Worker->>Retailer: fetch listings
    Retailer-->>Worker: HTML
    Worker->>PG: BEGIN; upsert listings; UPDATE pgboss.job state=completed; COMMIT

Single transaction at the end keeps "data written, job not marked complete" impossible.

Worker process model¶

The worker is a separate long-running process from the Next.js server (tsx scripts/worker.ts), invoked under whatever process supervisor the chosen host provides:

If RFC-0001 lands on Vercel: worker runs on a tiny external host (Railway/Fly/$3 VPS), or — at our scale — Vercel Cron triggers a route handler that drains N jobs per invocation.
If RFC-0001 lands on Cloudflare: Cloudflare Cron Triggers + a scheduled Worker handler that drains N jobs from pgboss.job per invocation. No long-running consumer at all. Polling latency is whatever the cron cadence is.
If RFC-0001 lands on Hetzner: systemd unit running tsx scripts/worker.ts 24/7, restart-on-failure. Standard.

pg-boss happens to be friendly to all three shapes — it's just SQL.

What we drop¶

bullmq ^5.76.2 from package.json
ioredis ^5.10.1 from package.json
The Redis container from docker-compose.yml (unless RFC-0001 needs Redis for a non-queue reason — it doesn't today)

The REDIS_URL env var (env-vars.md § REDIS_URL) drops with them. One less prod secret to manage.

What we add¶

pg-boss as a runtime dep
A src/lib/jobs.ts module that owns the boss instance, schedule registration, and queue handler registration
scripts/worker.ts as the long-running entry point
A small src/app/admin/jobs/page.tsx (or similar) to read job state from pgboss.job (M2 polish; not blocking)

Trade-offs¶

Cost	What it buys
Polling load on Postgres. Default 2s poll. Invisible at our scale; would matter at 1000s of jobs/sec.	No second stateful infra to operate.
No shipped dashboard. Have to write a small admin page (or live with SQL queries against `pgboss.job`).	One less thing to operate / authenticate / patch.
Smaller ecosystem. Fewer Stack Overflow answers, fewer plugins.	API surface is small (`boss.send`, `boss.work`, `boss.schedule`) — manageable solo.
Polling latency. Up to 2s from enqueue to start.	Daily scrapes don't notice; user-triggered work (alert dispatch) tolerates 2s easily.
Queue contention on the same DB. At 1000s of jobs/sec this is real.	Not real at 24–120/day. Revisit if alert dispatch volume changes the picture.
Migration cost if scale demands BullMQ later. Rewriting the queue glue (~`src/lib/jobs.ts`) — maybe a day's work. The job bodies don't change.	Keeps BullMQ as a real fallback, not a stranded option.

Alternatives¶

BullMQ on the existing Redis¶

The pre-installed answer. Mature, the obvious "queue in Node" solution. Web UI via @bull-board/api. Sub-second latency.

Operational fit: producer in route handlers, consumer in a separate process. During deploys, BullMQ workers SIGTERM-graceful-shutdown; in-flight jobs finish or get re-queued via stalled-job recovery (Redis visibility timeout).
Cost: marginal. Redis container exists. Memory for ~120 jobs/day is kilobytes.
Where it loses: runs Redis as production-critical infra for a workload that doesn't need it. Stalled-job semantics are subtle (visibility-timeout model bites people who write long-running jobs without heartbeats). At 24 jobs/day this is a sledgehammer.
When this wins: scrape volume or fanout grows by 10×, or when #21 LLM extraction wants per-listing sub-second handoff and concurrency that pg-boss polling can't keep up with.

Migration from pg-boss to BullMQ is a bounded rewrite if it turns out we need it. Migration the other way is rare.

GitHub Actions scheduled workflows¶

.github/workflows/scrape.yml with schedule: cron: '0 3 * * *' running npm ci && npm run scrape. Zero infra to operate. Free on a public repo; ~$30/mo at M2 volume on a private repo.

Where it loses (the deal-breaker): no queue. No way to enqueue work dynamically from the app. Price-drop alert dispatch and image fetcher are app-triggered work — that's a queue, not a cron.
Other footguns: cron drift 5–15 min under GH load; silent skip if the default branch is inactive for 60 days; no in-flight visibility beyond the Actions log; whole-job retries only (can't retry per-listing).
Where it earns its keep: as a belt-and-braces redundant trigger alongside pg-boss — a daily GH Actions cron that hits a /api/jobs/heartbeat endpoint, alerting if no recent scrape ran. That's a future "watcher" ticket, not a primary runtime.

Open questions¶

Semantics for scrape jobs — at-least-once or exactly-once? Recommendation: at-least-once is fine. Listing upsert is keyed by (retailerId, url) (see Prisma models § Listing), so a duplicated scrape is idempotent at the DB level. Confirm with MASTER.
Job state retention. How long do we keep completed/failed job rows? pg-boss defaults to 7 days for completed, 14 for failed. Worth confirming — at 120 jobs/day the table stays tiny anyway.
Admin dashboard timing. The "small admin page reading pgboss.job" is M2 polish — is that acceptable, or does MASTER want job visibility from day one? If day one, we use psql queries documented in a runbook until the page exists.
Drop deps now or after the implementation ticket? Recommendation: drop in #18's implementation PR, not in this RFC's commit. Keeps the audit and the RFC on the same branch they live on now.

Implementation plan¶

Once MASTER picks pg-boss, the implementation work for #18 is roughly:

The migration from "manual npm run scrape" to "scheduled pg-boss job" is mechanical — the scrape bodies in src/scrapers/sites/*.ts don't change.

Out of scope¶

Cron syntax / specific schedules for individual jobs — owned by #18.
Per-job retry policies (how many attempts, how long the backoff) — owned by #18.
Drift / health alerting (parser returns 0 listings → Telegram alert) — separate ticket; depends on observability decision pulled along by RFC-0001.
LLM extraction queue specifics (#21) — pg-boss handles it; the LLM-call shape is a different RFC.
Replacing pg-boss with BullMQ — explicitly not this RFC. If scale demands it later, write a fresh RFC then.