Ingest pipeline — scrape, match, persist¶

What this answers: what happens during a single scraper run, from trigger through to listings persisted with prices and (eventually) matched products.

Sequence¶

sequenceDiagram
    participant Sched as Scheduler
    participant Run as Runner (run-scrapers.ts)
    participant Scr as Scraper (per retailer)
    participant R as Retailer Site
    participant M as Matcher (matching.ts)
    participant DB as Postgres

    Sched->>Run: trigger run(retailer?, category?)
    loop per retailer × category
        Run->>Scr: fetch CATEGORY_URLS[category]
        Scr->>R: GET category index page
        R-->>Scr: HTML
        Scr->>Scr: parseListings(html) → ScrapedListing[]
        Scr-->>Run: listings (title, url, price, inStock, imageUrl)

        loop per scraped listing
            Run->>M: match(category, titleRaw, retailer)
            M->>DB: SELECT Products WHERE category, brand fuzzy-match
            DB-->>M: candidates
            M-->>Run: { productId?, confidence }

            Run->>DB: UPSERT Listing (retailerId, url, titleRaw, productId, matchConfidence, lastSeenAt)
            Run->>DB: INSERT ListingPrice (listingId, priceUsd, inStock, scrapedAt)

            opt productId set + product missing image
                Run->>DB: UPDATE Product SET imageUrl = scraped image
            end
        end
    end

    Run-->>Sched: report (listings touched, matched, errors)

Stage responsibilities¶

Scheduler¶

Today: manual npm run scrape. After #18: BullMQ scheduled jobs, daily refresh per retailer.

Runner¶

Single entry point at scripts/run-scrapers.ts. Iterates retailers and categories, dispatches to the per-retailer scraper, then matches and persists.

Scraper (per retailer)¶

One module per retailer in src/scrapers/sites/: pcandparts.ts, souq961.ts, macrotronics.ts. Each exports CATEGORY_URLS and a parse function.

Shared infrastructure in src/scrapers/core/:

http.ts — fetchHtml via undici, with retries and a custom user-agent
parse.ts — cheerio loader
normalize.ts — normalizePrice, image URL absolutising

For the how-to, see Guides → Writing a scraper.

Matcher¶

src/lib/matching.ts. Category-specific. CPU and GPU matchers are the most evolved; GPU has an AIB fallback (when the title doesn't lead with NVIDIA/AMD, look for the reference model token like "RTX 4070" — bug #1).

Lower-match-rate categories (Cooler/RAM/Storage/PSU) are pending #21 LLM-assisted spec extraction.

Returns { productId?: string, confidence: number }. Confidence below threshold → productId left null, matchStatus = 'unmatched'.

Persistence¶

Two writes per scraped listing:

Listing upsert by (retailerId, url) — updates titleRaw, productId, matchConfidence, lastSeenAt. Never deletes.
ListingPrice insert — append-only price snapshot. The state-machine "Active vs OutOfStock vs QuoteOnly" derives from the latest ListingPrice.

If the matched Product has no imageUrl and the scraper found one, backfill Product.imageUrl. First scraper to find a usable image wins.

Failure modes¶

Failure	Effect	Mitigation
Retailer site down	Scraper run for that retailer fails	Run continues for other retailers; failed retailer retries on next schedule
Retailer changes HTML	Parser returns 0 listings	Drift alert (planned, #19 infra)
Matcher mis-matches	Listing linked to wrong Product	Operator override via UC-J (#16)
New listings overwhelm matcher	Slow run	Limit concurrency in runner; not currently a problem at 3-retailer scale

What this pipeline does not do¶

No image hosting — scraped image URLs are stored as-is. next/image proxies/caches them. M2 may move to R2 for durability.
No price normalisation across currencies — all retailers list in USD. If a Lebanese retailer ever switches to LBP, that's a parser change + a normalisation step.
No incremental updates — every run scrapes the full category index. Fine at current scale; will need pagination + delta detection if catalogs grow large.

See Listing lifecycle for the state machine governing each Listing produced by this pipeline.