Listing lifecycle¶
What this answers: what states does a single retailer Listing move through, from first scrape to disappearance?
State machine¶
stateDiagram-v2
[*] --> Discovered : first scrape finds URL
Discovered --> Unmatched : matcher couldn't resolve
Discovered --> Matched : matched to canonical Product
Unmatched --> Matched : later run matches (or operator override)
Matched --> Active : has price + in stock
Matched --> QuoteOnly : retailer lists no public price
Active --> OutOfStock : stock indicator flips
OutOfStock --> Active : retailer restocks
Active --> QuoteOnly : retailer removes price
QuoteOnly --> Active : retailer adds price
Active --> Stale : not seen in N days
OutOfStock --> Stale : not seen in N days
QuoteOnly --> Stale : not seen in N days
Stale --> Active : reappears
Stale --> Removed : URL 404 confirmed
Removed --> [*]
States explained¶
| State | Meaning | Driven by |
|---|---|---|
Discovered |
Scraper just inserted the row. No match attempted yet. | Scraper run |
Unmatched |
Matcher ran but couldn't resolve to a canonical Product. productId is null. |
Matcher |
Matched |
Linked to a Product. productId set, matchConfidence > threshold. |
Matcher |
Active |
Has a priceUsd and the latest ListingPrice.inStock = true. |
Latest ListingPrice row |
QuoteOnly |
priceUsd is null on the latest ListingPrice. Common on 961Souq. See #3. |
Latest ListingPrice row |
OutOfStock |
inStock = false on latest ListingPrice. |
Latest ListingPrice row |
Stale |
lastSeenAt is more than N days ago. Data is showing but flagged old. |
Cron / staleness guard |
Removed |
Retailer URL no longer returns a product. Soft-deleted to preserve Click history integrity. |
Scraper 404 detection |
Implementation notes¶
- The
Listingrow'smatchStatusfield captures Discovered/Unmatched/Matched. The Active/QuoteOnly/OutOfStock distinctions are derived from the latestListingPricerather than stored on the Listing itself — this keeps the price history append-only. - Stale is a guard, not a stored state. It's a query-time computation: "Active but
lastSeenAt < now() - 7 days." If the next scraper run sees the URL again,lastSeenAtupdates and the Listing is no longer stale. - Removed should be soft delete. A scraper run that 404s on a previously-known URL flips a
removed: trueflag (or setslastSeenAtto far past + a tombstone marker). Hard-deleting would orphanClickrows and break click history reports.
Match rate concern¶
M1 overall match rate is ~13.5%, much higher for CPU/GPU (39-40%) thanks to category-specific matchers. The lower categories — Cooler 6.4%, RAM 9.8%, Storage 3.3%, PSU 3.9% — sit in Unmatched indefinitely.
The fix is LLM-assisted spec extraction (#21). Until then, those categories accumulate Unmatched Listings that never transition further.
See also Ingest pipeline for the matcher's role in the scrape flow.