Skip to content

Ingest pipeline — scrape, match, persist

What this answers: what happens during a single scraper run, from trigger through to listings persisted with prices and (eventually) matched products.

Sequence

sequenceDiagram
    participant Sched as Scheduler
    participant Run as Runner (run-scrapers.ts)
    participant Scr as Scraper (per retailer)
    participant R as Retailer Site
    participant M as Matcher (matching.ts)
    participant DB as Postgres

    Sched->>Run: trigger run(retailer?, category?)
    loop per retailer × category
        Run->>Scr: fetch CATEGORY_URLS[category]
        Scr->>R: GET category index page
        R-->>Scr: HTML
        Scr->>Scr: parseListings(html) → ScrapedListing[]
        Scr-->>Run: listings (title, url, price, inStock, imageUrl)

        loop per scraped listing
            Run->>M: match(category, titleRaw, retailer)
            M->>DB: SELECT Products WHERE category, brand fuzzy-match
            DB-->>M: candidates
            M-->>Run: { productId?, confidence }

            Run->>DB: UPSERT Listing (retailerId, url, titleRaw, productId, matchConfidence, lastSeenAt)
            Run->>DB: INSERT ListingPrice (listingId, priceUsd, inStock, scrapedAt)

            opt productId set + product missing image
                Run->>DB: UPDATE Product SET imageUrl = scraped image
            end
        end
    end

    Run-->>Sched: report (listings touched, matched, errors)

Stage responsibilities

Scheduler

Today: manual npm run scrape. After #18: BullMQ scheduled jobs, daily refresh per retailer.

Runner

Single entry point at scripts/run-scrapers.ts. Iterates retailers and categories, dispatches to the per-retailer scraper, then matches and persists.

Scraper (per retailer)

One module per retailer in src/scrapers/sites/: pcandparts.ts, souq961.ts, macrotronics.ts. Each exports CATEGORY_URLS and a parse function.

Shared infrastructure in src/scrapers/core/:

  • http.tsfetchHtml via undici, with retries and a custom user-agent
  • parse.ts — cheerio loader
  • normalize.tsnormalizePrice, image URL absolutising

For the how-to, see Guides → Writing a scraper.

Matcher

src/lib/matching.ts. Category-specific. CPU and GPU matchers are the most evolved; GPU has an AIB fallback (when the title doesn't lead with NVIDIA/AMD, look for the reference model token like "RTX 4070" — bug #1).

Lower-match-rate categories (Cooler/RAM/Storage/PSU) are pending #21 LLM-assisted spec extraction.

Returns { productId?: string, confidence: number }. Confidence below threshold → productId left null, matchStatus = 'unmatched'.

Persistence

Two writes per scraped listing:

  1. Listing upsert by (retailerId, url) — updates titleRaw, productId, matchConfidence, lastSeenAt. Never deletes.
  2. ListingPrice insert — append-only price snapshot. The state-machine "Active vs OutOfStock vs QuoteOnly" derives from the latest ListingPrice.

If the matched Product has no imageUrl and the scraper found one, backfill Product.imageUrl. First scraper to find a usable image wins.

Failure modes

Failure Effect Mitigation
Retailer site down Scraper run for that retailer fails Run continues for other retailers; failed retailer retries on next schedule
Retailer changes HTML Parser returns 0 listings Drift alert (planned, #19 infra)
Matcher mis-matches Listing linked to wrong Product Operator override via UC-J (#16)
New listings overwhelm matcher Slow run Limit concurrency in runner; not currently a problem at 3-retailer scale

What this pipeline does not do

  • No image hosting — scraped image URLs are stored as-is. next/image proxies/caches them. M2 may move to R2 for durability.
  • No price normalisation across currencies — all retailers list in USD. If a Lebanese retailer ever switches to LBP, that's a parser change + a normalisation step.
  • No incremental updates — every run scrapes the full category index. Fine at current scale; will need pagination + delta detection if catalogs grow large.

See Listing lifecycle for the state machine governing each Listing produced by this pipeline.