Skip to content

Scraper failure runbook

A scraper fails when the retailer changes their HTML structure, blocks our user agent, or their server goes down. This runbook is the diagnosis tree.

Symptom: a retailer's listing count drops to zero (or near-zero)

Most often: the retailer changed their HTML class names. Sometimes: anti-bot challenge or geo-block.

Triage tree

1. Confirm scope — single retailer, or all of them?

docker exec 961tech-postgres psql -U postgres -d tech961 -c \
  "SELECT r.name, COUNT(l.id) FILTER (WHERE l.\"deletedAt\" IS NULL) AS live FROM \"Retailer\" r LEFT JOIN \"Listing\" l ON l.\"retailerId\"=r.id GROUP BY r.name ORDER BY live DESC;"
  • Single retailer at zero, others healthy → step 2 (HTML drift / block on that retailer).
  • All zero → step 5 (our infra problem).
  • Steady but no growth → step 3 (the scraper ran, found nothing new, soft-delete sweep is doing its job).

2. Reproduce — can you fetch the category page at all?

curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \
  "<retailer-base>/<category-path>" -o /tmp/page.html -w "%{http_code}\n"
Result Cause Fix
200 + content HTML drift step 4
403 / 429 UA block / rate-limit step 6
503 + Cloudflare-managed-challenge bot block step 7
Connection refused / timeout retailer down wait, retry in 30 min
200 but empty body JS-rendered page (SPA) step 8

3. The scraper ran fine but produced no NEW listings

Expected behavior — the catalog is steady-state. Soft-delete sweep marks anything not seen in 48h as deletedAt. To verify nothing is wrong:

# Should be a small positive number for active retailers.
docker exec 961tech-postgres psql -U postgres -d tech961 -c \
  "SELECT r.name, COUNT(*) FILTER (WHERE l.\"lastSeenAt\" > NOW() - INTERVAL '6 hours') AS fresh_6h FROM \"Retailer\" r LEFT JOIN \"Listing\" l ON l.\"retailerId\"=r.id WHERE l.\"deletedAt\" IS NULL GROUP BY r.name;"

If fresh_6h is zero on a retailer that ran in the last hour, step 4.

4. HTML drift — selectors stopped matching

Compare the live HTML against the test fixture:

# Re-pull a known category page.
curl -A "Mozilla/5.0" "<retailer-base>/<category-path>" -o /tmp/live.html

# Diff against the fixture.
diff <(grep -E 'class=' /tmp/live.html | sort -u) \
     <(grep -E 'class=' tests/scrapers/fixtures/<retailer>/<category>.html | sort -u) | head

Open src/scrapers/sites/<retailer>.ts and find the selector that no longer matches. Update it. Run the unit test (it'll fail until you update the fixture too):

npx vitest run tests/scrapers/<retailer>.test.ts

Update the fixture by saving the live HTML under tests/scrapers/fixtures/<retailer>/<category>.html (trim it to ~10-20 listings; full pages bloat the repo).

5. All retailers at zero — our problem

Check the scrape script ran at all:

# Recent successful run? Each row carries a per-retailer log line.
tail -100 /var/log/961tech/scrape.log 2>/dev/null || \
  tail -100 .review-screenshots/scrape.log 2>/dev/null || \
  echo 'no log found — script may not have run'

If the script never ran: - Cloudflare Cron Triggers (per ADR-0006) — check Workers Cron Events panel - pg-boss schedule (per ADR-0007 + #67) — query pgboss.job WHERE name = 'scrape'

If the script ran but failed early — check IP_HASH_SECRET, DATABASE_URL, and npm run build output.

6. UA / IP block (403 or 429)

Try a different UA. If a polite UA gets through, update src/scrapers/core/http.ts. If even a real-browser UA gets blocked, the retailer has IP-banned us — escalate to MASTER for retailer outreach (#81).

Mitigation: add a referrer header, randomize the UA across requests, throttle to 1 req/2s.

7. Cloudflare bot challenge

The retailer enabled Cloudflare's "I'm Under Attack" mode. Cannot be defeated by scraper headers alone.

Options: - Wait — these modes are usually temporary (24-72 hours) - Outreach — ask the retailer to allowlist 961tech (#81) - Defer — tag the retailer as active = false until unblock

Document under docs/reference/retailers.md access-blocker register.

8. JS-rendered SPA

The retailer's catalog is rendered client-side. Cheerio (used by all current scrapers) sees an empty <div id="root"></div> in the HTML.

Options: - Look for an XHR API the SPA calls. Most SPAs use JSON endpoints; intercept via DevTools Network tab and fetch directly. - Headless browser — Playwright / Puppeteer. Heavier dep; use only when no API is available.

Document the discovery pattern in the per-retailer file.

Post-fix

  • Re-run the scrape: npm run scrape
  • Confirm fresh listings landed: query above
  • Update docs/reference/retailers.md if the diagnosis revealed a blocker class worth recording
  • File a follow-up issue with the new HTML pattern as a regression test fixture