Writing a scraper for a new retailer¶
End state: a new file under src/scrapers/sites/, a new row in the Retailer seed, the runner picks it up automatically, and npm run scrape brings the retailer's listings into the catalog.
When to add a retailer¶
The bar: a Lebanese merchant with a public computer-parts catalog where prices are listed in USD or convertible to USD. Quote-only retailers (like 961Souq, partly) are still worth indexing — they show up in price comparison even without USD prices, and we surface that explicitly.
Target for M2 is 6-8 retailers (#20).
1. Survey the site¶
Before writing code, browse:
- The retailer's category pages (CPU, GPU, etc.). Are URLs predictable? E.g.,
/collections/cpus,/category/processors? - A category index page. Open DevTools and inspect:
- What HTML element wraps each product card?
- How is the title rendered? Is it inside an
<a>? - Where's the price? Is there always a price, or sometimes "Call for Price"?
- How is stock indicated? Disabled "Add to cart" button? A badge?
- Where's the product image? Is it in
src,data-src, orsrcset? - A product detail page. Same questions but more detailed.
Take notes. The scraper module starts as a comment block describing what you found.
2. Pick the platform¶
Most Lebanese retailers run one of:
- Shopify (default theme or custom — Macrotronics is custom Shopify)
- WooCommerce with a popular theme (PCAndParts uses Flatsome)
- Custom (961Souq is bespoke)
Existing scrapers in src/scrapers/sites/ cover one of each — read the closest as a template.
3. Add the retailer to the seed¶
Edit prisma/seed.ts (or wherever retailers are seeded). Add a row:
{
id: 'retailer-newretailer', // stable ID, used by Click and Listing
name: 'New Retailer Name',
slug: 'newretailer',
domain: 'newretailer.com',
scraperId: 'newretailer', // matches the scraper module file name
active: true,
}
Then re-seed: npx prisma db seed.
4. Write the scraper¶
Create src/scrapers/sites/newretailer.ts:
import { load } from '@/scrapers/core/parse';
import { normalizePrice } from '@/scrapers/core/normalize';
import { fetchHtml } from '@/scrapers/core/http';
import type { ScrapedListing } from './pcandparts';
export const RETAILER_ID = 'newretailer';
const BASE = 'https://newretailer.com';
export const CATEGORY_URLS = {
CPU: `${BASE}/category/processors`,
GPU: `${BASE}/category/graphics-cards`,
// ... 8 total: CPU, GPU, MOTHERBOARD, COOLER, RAM, STORAGE, PSU, CASE
};
/**
* Document what you found:
* - product card selector
* - title / URL location
* - price location (and any "no price" handling)
* - stock indicator
* - image extraction priority
*/
function parseListings(html: string): ScrapedListing[] {
const $ = load(html);
const listings: ScrapedListing[] = [];
$('.your.product-card-selector').each((_, el) => {
const $el = $(el);
const title = $el.find('a.title').text().trim();
const linkHref = $el.find('a.title').attr('href');
if (!title || !linkHref) return;
const url = linkHref.startsWith('http') ? linkHref : `${BASE}${linkHref}`;
const priceText = $el.find('.price').text().trim();
const isSoldOut = $el.find('.sold-out-indicator').length > 0;
const imageUrl = absolutise($el.find('img').attr('src'));
listings.push({
url,
titleRaw: title,
priceUsd: normalizePrice(priceText),
inStock: !isSoldOut,
imageUrl,
});
});
return listings;
}
export async function scrape(category: keyof typeof CATEGORY_URLS): Promise<ScrapedListing[]> {
const url = CATEGORY_URLS[category];
const html = await fetchHtml(url);
return parseListings(html);
}
Key things to match the existing pattern:
- Export
RETAILER_IDmatching the seed'sscraperId - Export
CATEGORY_URLSwith all 8 category keys - Export
scrape(category)returningScrapedListing[] - Use the shared
core/helpers — don't fetch with native fetch, usefetchHtml(handles retries, UA, etc.)
5. Register in the runner¶
The runner should auto-detect new modules in src/scrapers/sites/ if you follow the convention. If not, update scripts/run-scrapers.ts to import the new module.
6. Test it¶
Expected: terminal logs the URL fetched, count of listings parsed, count of matches. Sample output:
[newretailer:CPU] fetched https://newretailer.com/category/processors → 47 listings
[newretailer:CPU] matched 12 / 47 (25.5%)
If you got 0 listings, your selectors are wrong — go back and inspect again.
7. Verify in the UI¶
Open http://localhost:3000/products?retailer=newretailer. You should see real products with the retailer's name in the price comparison.
8. Add tests¶
Create tests/scrapers/sites/newretailer.test.ts. Use a saved HTML fixture (download a real category page, save in tests/fixtures/newretailer/cpu.html) and assert:
- Parses the expected number of listings
- Title, URL, price, stock state, image URL all extracted
- Sold-out items are flagged correctly
9. Document gotchas¶
If the retailer has anything unusual (e.g., quote-only listings, weird URL patterns, anti-bot detection), add it to the JSDoc at the top of the scraper module.
10. Open the PR¶
Conventional commit:
PR body checklist:
- Seed updated
- All 8 categories represented in
CATEGORY_URLS -
npm run scrape -- --retailer newretailerruns end-to-end without errors - At least one matched product per category in spot-check
- Test fixtures + specs added
- No retailer-specific logic leaked into shared
core/
Reference issue: #20.
Common failure modes¶
| Symptom | Likely cause | Fix |
|---|---|---|
| 0 listings parsed | Wrong CSS selector | Re-inspect the page; selectors changed since you wrote the scraper |
| 403 / anti-bot challenge | Retailer detected scraping | Add a custom User-Agent in core/http.ts; consider rate-limiting |
| Match rate near 0 | Title format diverges from canonical | Likely needs LLM-assisted extraction (#21) |
| Images missing | Lazy-loaded images use data-src |
Check srcset and data-src before falling back to src |
priceUsd always null |
Prices in LBP, not USD | Parse currency, convert. Or: surface as quote-only if conversion isn't trustworthy |