Scraper failure runbook¶
A scraper fails when the retailer changes their HTML structure, blocks our user agent, or their server goes down. This runbook is the diagnosis tree.
Symptom: a retailer's listing count drops to zero (or near-zero)¶
Most often: the retailer changed their HTML class names. Sometimes: anti-bot challenge or geo-block.
Triage tree¶
1. Confirm scope — single retailer, or all of them?¶
docker exec 961tech-postgres psql -U postgres -d tech961 -c \
"SELECT r.name, COUNT(l.id) FILTER (WHERE l.\"deletedAt\" IS NULL) AS live FROM \"Retailer\" r LEFT JOIN \"Listing\" l ON l.\"retailerId\"=r.id GROUP BY r.name ORDER BY live DESC;"
- Single retailer at zero, others healthy → step 2 (HTML drift / block on that retailer).
- All zero → step 5 (our infra problem).
- Steady but no growth → step 3 (the scraper ran, found nothing new, soft-delete sweep is doing its job).
2. Reproduce — can you fetch the category page at all?¶
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \
"<retailer-base>/<category-path>" -o /tmp/page.html -w "%{http_code}\n"
| Result | Cause | Fix |
|---|---|---|
| 200 + content | HTML drift | step 4 |
| 403 / 429 | UA block / rate-limit | step 6 |
| 503 + Cloudflare-managed-challenge | bot block | step 7 |
| Connection refused / timeout | retailer down | wait, retry in 30 min |
| 200 but empty body | JS-rendered page (SPA) | step 8 |
3. The scraper ran fine but produced no NEW listings¶
Expected behavior — the catalog is steady-state. Soft-delete sweep marks anything not seen in 48h as deletedAt. To verify nothing is wrong:
# Should be a small positive number for active retailers.
docker exec 961tech-postgres psql -U postgres -d tech961 -c \
"SELECT r.name, COUNT(*) FILTER (WHERE l.\"lastSeenAt\" > NOW() - INTERVAL '6 hours') AS fresh_6h FROM \"Retailer\" r LEFT JOIN \"Listing\" l ON l.\"retailerId\"=r.id WHERE l.\"deletedAt\" IS NULL GROUP BY r.name;"
If fresh_6h is zero on a retailer that ran in the last hour, step 4.
4. HTML drift — selectors stopped matching¶
Compare the live HTML against the test fixture:
# Re-pull a known category page.
curl -A "Mozilla/5.0" "<retailer-base>/<category-path>" -o /tmp/live.html
# Diff against the fixture.
diff <(grep -E 'class=' /tmp/live.html | sort -u) \
<(grep -E 'class=' tests/scrapers/fixtures/<retailer>/<category>.html | sort -u) | head
Open src/scrapers/sites/<retailer>.ts and find the selector that no longer matches. Update it. Run the unit test (it'll fail until you update the fixture too):
Update the fixture by saving the live HTML under tests/scrapers/fixtures/<retailer>/<category>.html (trim it to ~10-20 listings; full pages bloat the repo).
5. All retailers at zero — our problem¶
Check the scrape script ran at all:
# Recent successful run? Each row carries a per-retailer log line.
tail -100 /var/log/961tech/scrape.log 2>/dev/null || \
tail -100 .review-screenshots/scrape.log 2>/dev/null || \
echo 'no log found — script may not have run'
If the script never ran:
- Cloudflare Cron Triggers (per ADR-0006) — check Workers Cron Events panel
- pg-boss schedule (per ADR-0007 + #67) — query pgboss.job WHERE name = 'scrape'
If the script ran but failed early — check IP_HASH_SECRET, DATABASE_URL, and npm run build output.
6. UA / IP block (403 or 429)¶
Try a different UA. If a polite UA gets through, update src/scrapers/core/http.ts. If even a real-browser UA gets blocked, the retailer has IP-banned us — escalate to MASTER for retailer outreach (#81).
Mitigation: add a referrer header, randomize the UA across requests, throttle to 1 req/2s.
7. Cloudflare bot challenge¶
The retailer enabled Cloudflare's "I'm Under Attack" mode. Cannot be defeated by scraper headers alone.
Options:
- Wait — these modes are usually temporary (24-72 hours)
- Outreach — ask the retailer to allowlist 961tech (#81)
- Defer — tag the retailer as active = false until unblock
Document under docs/reference/retailers.md access-blocker register.
8. JS-rendered SPA¶
The retailer's catalog is rendered client-side. Cheerio (used by all current scrapers) sees an empty <div id="root"></div> in the HTML.
Options: - Look for an XHR API the SPA calls. Most SPAs use JSON endpoints; intercept via DevTools Network tab and fetch directly. - Headless browser — Playwright / Puppeteer. Heavier dep; use only when no API is available.
Document the discovery pattern in the per-retailer file.
Post-fix¶
- Re-run the scrape:
npm run scrape - Confirm fresh listings landed: query above
- Update
docs/reference/retailers.mdif the diagnosis revealed a blocker class worth recording - File a follow-up issue with the new HTML pattern as a regression test fixture