Skip to content

Listing lifecycle

What this answers: what states does a single retailer Listing move through, from first scrape to disappearance?

State machine

stateDiagram-v2
    [*] --> Discovered : first scrape finds URL
    Discovered --> Unmatched : matcher couldn't resolve
    Discovered --> Matched : matched to canonical Product
    Unmatched --> Matched : later run matches (or operator override)
    Matched --> Active : has price + in stock
    Matched --> QuoteOnly : retailer lists no public price
    Active --> OutOfStock : stock indicator flips
    OutOfStock --> Active : retailer restocks
    Active --> QuoteOnly : retailer removes price
    QuoteOnly --> Active : retailer adds price
    Active --> Stale : not seen in N days
    OutOfStock --> Stale : not seen in N days
    QuoteOnly --> Stale : not seen in N days
    Stale --> Active : reappears
    Stale --> Removed : URL 404 confirmed
    Removed --> [*]

States explained

State Meaning Driven by
Discovered Scraper just inserted the row. No match attempted yet. Scraper run
Unmatched Matcher ran but couldn't resolve to a canonical Product. productId is null. Matcher
Matched Linked to a Product. productId set, matchConfidence > threshold. Matcher
Active Has a priceUsd and the latest ListingPrice.inStock = true. Latest ListingPrice row
QuoteOnly priceUsd is null on the latest ListingPrice. Common on 961Souq. See #3. Latest ListingPrice row
OutOfStock inStock = false on latest ListingPrice. Latest ListingPrice row
Stale lastSeenAt is more than N days ago. Data is showing but flagged old. Cron / staleness guard
Removed Retailer URL no longer returns a product. Soft-deleted to preserve Click history integrity. Scraper 404 detection

Implementation notes

  • The Listing row's matchStatus field captures Discovered/Unmatched/Matched. The Active/QuoteOnly/OutOfStock distinctions are derived from the latest ListingPrice rather than stored on the Listing itself — this keeps the price history append-only.
  • Stale is a guard, not a stored state. It's a query-time computation: "Active but lastSeenAt < now() - 7 days." If the next scraper run sees the URL again, lastSeenAt updates and the Listing is no longer stale.
  • Removed should be soft delete. A scraper run that 404s on a previously-known URL flips a removed: true flag (or sets lastSeenAt to far past + a tombstone marker). Hard-deleting would orphan Click rows and break click history reports.

Match rate concern

M1 overall match rate is ~13.5%, much higher for CPU/GPU (39-40%) thanks to category-specific matchers. The lower categories — Cooler 6.4%, RAM 9.8%, Storage 3.3%, PSU 3.9% — sit in Unmatched indefinitely.

The fix is LLM-assisted spec extraction (#21). Until then, those categories accumulate Unmatched Listings that never transition further.

See also Ingest pipeline for the matcher's role in the scrape flow.