Ingest pipeline — scrape, match, persist¶
What this answers: what happens during a single scraper run, from trigger through to listings persisted with prices and (eventually) matched products.
Sequence¶
sequenceDiagram
participant Sched as Scheduler
participant Run as Runner (run-scrapers.ts)
participant Scr as Scraper (per retailer)
participant R as Retailer Site
participant M as Matcher (matching.ts)
participant DB as Postgres
Sched->>Run: trigger run(retailer?, category?)
loop per retailer × category
Run->>Scr: fetch CATEGORY_URLS[category]
Scr->>R: GET category index page
R-->>Scr: HTML
Scr->>Scr: parseListings(html) → ScrapedListing[]
Scr-->>Run: listings (title, url, price, inStock, imageUrl)
loop per scraped listing
Run->>M: match(category, titleRaw, retailer)
M->>DB: SELECT Products WHERE category, brand fuzzy-match
DB-->>M: candidates
M-->>Run: { productId?, confidence }
Run->>DB: UPSERT Listing (retailerId, url, titleRaw, productId, matchConfidence, lastSeenAt)
Run->>DB: INSERT ListingPrice (listingId, priceUsd, inStock, scrapedAt)
opt productId set + product missing image
Run->>DB: UPDATE Product SET imageUrl = scraped image
end
end
end
Run-->>Sched: report (listings touched, matched, errors)
Stage responsibilities¶
Scheduler¶
Today: manual npm run scrape. After #18: BullMQ scheduled jobs, daily refresh per retailer.
Runner¶
Single entry point at scripts/run-scrapers.ts. Iterates retailers and categories, dispatches to the per-retailer scraper, then matches and persists.
Scraper (per retailer)¶
One module per retailer in src/scrapers/sites/: pcandparts.ts, souq961.ts, macrotronics.ts. Each exports CATEGORY_URLS and a parse function.
Shared infrastructure in src/scrapers/core/:
http.ts—fetchHtmlviaundici, with retries and a custom user-agentparse.ts— cheerio loadernormalize.ts—normalizePrice, image URL absolutising
For the how-to, see Guides → Writing a scraper.
Matcher¶
src/lib/matching.ts. Category-specific. CPU and GPU matchers are the most evolved; GPU has an AIB fallback (when the title doesn't lead with NVIDIA/AMD, look for the reference model token like "RTX 4070" — bug #1).
Lower-match-rate categories (Cooler/RAM/Storage/PSU) are pending #21 LLM-assisted spec extraction.
Returns { productId?: string, confidence: number }. Confidence below threshold → productId left null, matchStatus = 'unmatched'.
Persistence¶
Two writes per scraped listing:
Listingupsert by(retailerId, url)— updatestitleRaw,productId,matchConfidence,lastSeenAt. Never deletes.ListingPriceinsert — append-only price snapshot. The state-machine "Active vs OutOfStock vs QuoteOnly" derives from the latestListingPrice.
If the matched Product has no imageUrl and the scraper found one, backfill Product.imageUrl. First scraper to find a usable image wins.
Failure modes¶
| Failure | Effect | Mitigation |
|---|---|---|
| Retailer site down | Scraper run for that retailer fails | Run continues for other retailers; failed retailer retries on next schedule |
| Retailer changes HTML | Parser returns 0 listings | Drift alert (planned, #19 infra) |
| Matcher mis-matches | Listing linked to wrong Product | Operator override via UC-J (#16) |
| New listings overwhelm matcher | Slow run | Limit concurrency in runner; not currently a problem at 3-retailer scale |
What this pipeline does not do¶
- No image hosting — scraped image URLs are stored as-is.
next/imageproxies/caches them. M2 may move to R2 for durability. - No price normalisation across currencies — all retailers list in USD. If a Lebanese retailer ever switches to LBP, that's a parser change + a normalisation step.
- No incremental updates — every run scrapes the full category index. Fine at current scale; will need pagination + delta detection if catalogs grow large.
See Listing lifecycle for the state machine governing each Listing produced by this pipeline.