Skip to content

Writing a scraper for a new retailer

End state: a new file under src/scrapers/sites/, a new row in the Retailer seed, the runner picks it up automatically, and npm run scrape brings the retailer's listings into the catalog.

When to add a retailer

The bar: a Lebanese merchant with a public computer-parts catalog where prices are listed in USD or convertible to USD. Quote-only retailers (like 961Souq, partly) are still worth indexing — they show up in price comparison even without USD prices, and we surface that explicitly.

Target for M2 is 6-8 retailers (#20).

1. Survey the site

Before writing code, browse:

  • The retailer's category pages (CPU, GPU, etc.). Are URLs predictable? E.g., /collections/cpus, /category/processors?
  • A category index page. Open DevTools and inspect:
  • What HTML element wraps each product card?
  • How is the title rendered? Is it inside an <a>?
  • Where's the price? Is there always a price, or sometimes "Call for Price"?
  • How is stock indicated? Disabled "Add to cart" button? A badge?
  • Where's the product image? Is it in src, data-src, or srcset?
  • A product detail page. Same questions but more detailed.

Take notes. The scraper module starts as a comment block describing what you found.

2. Pick the platform

Most Lebanese retailers run one of:

  • Shopify (default theme or custom — Macrotronics is custom Shopify)
  • WooCommerce with a popular theme (PCAndParts uses Flatsome)
  • Custom (961Souq is bespoke)

Existing scrapers in src/scrapers/sites/ cover one of each — read the closest as a template.

3. Add the retailer to the seed

Edit prisma/seed.ts (or wherever retailers are seeded). Add a row:

{
  id: 'retailer-newretailer',           // stable ID, used by Click and Listing
  name: 'New Retailer Name',
  slug: 'newretailer',
  domain: 'newretailer.com',
  scraperId: 'newretailer',             // matches the scraper module file name
  active: true,
}

Then re-seed: npx prisma db seed.

4. Write the scraper

Create src/scrapers/sites/newretailer.ts:

import { load } from '@/scrapers/core/parse';
import { normalizePrice } from '@/scrapers/core/normalize';
import { fetchHtml } from '@/scrapers/core/http';
import type { ScrapedListing } from './pcandparts';

export const RETAILER_ID = 'newretailer';
const BASE = 'https://newretailer.com';

export const CATEGORY_URLS = {
  CPU: `${BASE}/category/processors`,
  GPU: `${BASE}/category/graphics-cards`,
  // ... 8 total: CPU, GPU, MOTHERBOARD, COOLER, RAM, STORAGE, PSU, CASE
};

/**
 * Document what you found:
 * - product card selector
 * - title / URL location
 * - price location (and any "no price" handling)
 * - stock indicator
 * - image extraction priority
 */
function parseListings(html: string): ScrapedListing[] {
  const $ = load(html);
  const listings: ScrapedListing[] = [];

  $('.your.product-card-selector').each((_, el) => {
    const $el = $(el);

    const title = $el.find('a.title').text().trim();
    const linkHref = $el.find('a.title').attr('href');
    if (!title || !linkHref) return;

    const url = linkHref.startsWith('http') ? linkHref : `${BASE}${linkHref}`;
    const priceText = $el.find('.price').text().trim();
    const isSoldOut = $el.find('.sold-out-indicator').length > 0;
    const imageUrl = absolutise($el.find('img').attr('src'));

    listings.push({
      url,
      titleRaw: title,
      priceUsd: normalizePrice(priceText),
      inStock: !isSoldOut,
      imageUrl,
    });
  });

  return listings;
}

export async function scrape(category: keyof typeof CATEGORY_URLS): Promise<ScrapedListing[]> {
  const url = CATEGORY_URLS[category];
  const html = await fetchHtml(url);
  return parseListings(html);
}

Key things to match the existing pattern:

  • Export RETAILER_ID matching the seed's scraperId
  • Export CATEGORY_URLS with all 8 category keys
  • Export scrape(category) returning ScrapedListing[]
  • Use the shared core/ helpers — don't fetch with native fetch, use fetchHtml (handles retries, UA, etc.)

5. Register in the runner

The runner should auto-detect new modules in src/scrapers/sites/ if you follow the convention. If not, update scripts/run-scrapers.ts to import the new module.

6. Test it

npm run scrape -- --retailer newretailer --category CPU

Expected: terminal logs the URL fetched, count of listings parsed, count of matches. Sample output:

[newretailer:CPU] fetched https://newretailer.com/category/processors → 47 listings
[newretailer:CPU] matched 12 / 47 (25.5%)

If you got 0 listings, your selectors are wrong — go back and inspect again.

7. Verify in the UI

npm run dev

Open http://localhost:3000/products?retailer=newretailer. You should see real products with the retailer's name in the price comparison.

8. Add tests

Create tests/scrapers/sites/newretailer.test.ts. Use a saved HTML fixture (download a real category page, save in tests/fixtures/newretailer/cpu.html) and assert:

  • Parses the expected number of listings
  • Title, URL, price, stock state, image URL all extracted
  • Sold-out items are flagged correctly

9. Document gotchas

If the retailer has anything unusual (e.g., quote-only listings, weird URL patterns, anti-bot detection), add it to the JSDoc at the top of the scraper module.

10. Open the PR

Conventional commit:

feat(scrapers): add NewRetailer (#20 — third of 5 new retailers)

PR body checklist:

  • Seed updated
  • All 8 categories represented in CATEGORY_URLS
  • npm run scrape -- --retailer newretailer runs end-to-end without errors
  • At least one matched product per category in spot-check
  • Test fixtures + specs added
  • No retailer-specific logic leaked into shared core/

Reference issue: #20.

Common failure modes

Symptom Likely cause Fix
0 listings parsed Wrong CSS selector Re-inspect the page; selectors changed since you wrote the scraper
403 / anti-bot challenge Retailer detected scraping Add a custom User-Agent in core/http.ts; consider rate-limiting
Match rate near 0 Title format diverges from canonical Likely needs LLM-assisted extraction (#21)
Images missing Lazy-loaded images use data-src Check srcset and data-src before falling back to src
priceUsd always null Prices in LBP, not USD Parse currency, convert. Or: surface as quote-only if conversion isn't trustworthy