Skip to content

AI discoverability — what makes 961tech an LLM-citable surface

Reference for the surfaces 961tech ships so AI assistants cite us when a Lebanese user asks ChatGPT / Claude / Perplexity / Google AI Overviews / Apple Intelligence "where is the cheapest RTX 4070 in Beirut" or "what laptops are available under $800 in Lebanon." Produced for Foundation: AI discoverability (#47); pairs with RFC-0009 which carries the actual decisions.

1. Scope & method

What this is. A per-surface reference covering six surfaces — robots.txt, llms.txt, schema.org / JSON-LD, OpenGraph + Twitter Card, page-content shape, machine-readable feeds — with M1 / M2 / deferred status per recommendation. Grounded in current AI-assistant behavior verified against vendor documentation, not aspirational SEO folklore.

What this isn't. Not a Google-search SEO strategy — that's #38. Not a security or anti-scraping policy — that's #44. Not a KPI definition — that's #43. Not implementation — no app/robots.ts, no app/sitemap.ts, no JSON-LD components in this work; that's a follow-up code ticket.

Method. Verified facts against:

Confidence taxonomy. Same three buckets as personas.md §1.3:

Mark Meaning
Vendor-stated Direct quote or paraphrase from the vendor's own documentation
Hypothesis Reasoned from vendor docs + observed behavior; defensible but not a vendor commitment
Untested Speculation included for completeness; flagged

Inline tags follow the persona-doc convention (silence = vendor-stated; (hypothesis) / (untested) otherwise).

Honest limits.

  • No major AI assistant publishes a complete grounding pipeline doc. OpenAI, Anthropic, Perplexity, Google, and Apple all document their crawlers; none documents exactly which signals (HTML text vs JSON-LD vs OpenGraph vs llms.txt) feed citations. Recommendations on content shape are Hypothesis-grade based on observed behavior + the structural fact that all crawlers ingest HTML as text.
  • "Honors robots.txt" is a vendor claim. Multiple 2024–2025 third-party audits found Perplexity and others fetching via undeclared user agents. We follow vendor-stated behavior for policy; we don't pretend it's enforced.
  • The genre has no consensus posture. Cross-aggregator survey (§5) shows everything from blanket-block (PCPartPicker, BuildMyPC) to fully-open (Newegg, LDLC, Pricena). There is no "industry baseline" we'd be deviating from.
  • llms.txt has zero documented LLM consumers as of April 2026. Producers exist (Anthropic, Stripe, Vercel, Supabase, Cloudflare); no major AI vendor has stated they read it from third-party sites.

2. Per-surface recommendations

2.1 robots.txt — AI crawler posture

The single highest-leverage surface. Three classes of crawler need separate decisions:

Class What it does Effect of blocking Examples
A. Training crawlers Bulk-crawl the open web; data feeds future model weights Opts you out of future training. Already-trained models unchanged. No effect on whether you're cited today. GPTBot, Google-Extended, Applebot-Extended, meta-externalagent (training portion), ClaudeBot
B. AI-search index crawlers Build a fresh index used to surface citations when a user runs an AI search query Blocking removes you from AI-assisted search results. This is the one that kills citations. OAI-SearchBot, Claude-SearchBot, PerplexityBot, meta-webindexer
C. On-demand assistant fetchers Fetch a specific URL right now because a user asked the assistant a question Blocking prevents the assistant from reading your page in response to a direct user question. Many bypass robots.txt anyway because the request is user-initiated. ChatGPT-User, Claude-User, Perplexity-User, meta-externalfetcher, DuckAssistBot

Per-bot reference (vendor-stated)

Vendor UA Class Honors robots.txt? Notes
OpenAI GPTBot Training Yes
OpenAI ChatGPT-User On-demand "Rules may not apply" — user-initiated
OpenAI OAI-SearchBot AI-search index Yes
OpenAI OAI-AdsBot Ads landing-page Yes Only fires if you submit ads
Anthropic ClaudeBot Training Yes Earlier framing was broader; current doc treats as training
Anthropic Claude-User On-demand Yes (Anthropic states all bots respect)
Anthropic Claude-SearchBot AI-search index Yes
Anthropic Claude-Web / anthropic-ai Legacy Unclear — not in current doc List defensively; harmless
Perplexity PerplexityBot AI-search index Yes Perplexity is a major shopping/comparison citation source
Perplexity Perplexity-User On-demand "Generally ignores" — user-initiated
Google Googlebot Search index (also feeds AI Overviews) Yes AI Overviews has no separate UA
Google Google-Extended Training opt-out (Gemini) Yes Does NOT affect Search ranking
Google GoogleOther R&D one-offs Yes
Google Google-CloudVertexBot Vertex AI agent build Yes Only fires if site owner builds an agent
Apple Applebot Search index (Spotlight, Siri, Safari Suggestions) Yes Data may feed Apple foundation models unless Applebot-Extended is disallowed
Apple Applebot-Extended Training opt-out Yes Does NOT crawl itself; governs reuse of Applebot data
Meta meta-externalagent Training + product indexing (bundled) Yes Blocking costs Meta AI indexing too
Meta meta-webindexer AI-search index Yes "Helps us cite and link to your content in Meta AI's responses"
Meta meta-externalfetcher On-demand / agentic "May bypass"
Meta facebookexternalhit Link previews "Might bypass" for security Drives WhatsApp/FB/IG share-card grounding
DuckDuckGo DuckAssistBot On-demand Yes (~72h propagation) "Explicitly NOT used to train AI models"
Microsoft Bingbot Search index (also feeds Copilot) Yes No separate Copilot training UA documented

961tech recommendation

961tech's strategic position: small Lebanese aggregator with near-zero brand recognition in MENA AI surfaces, competing for citation traffic against effectively-no-one (Pricena explicitly skips Lebanon per competitive-landscape.md §4.1). Citation traffic is the entire monetisation funnel for the AI-search era — every user who lands on us via "ChatGPT recommended 961tech" is a click-through to retailer affiliate.

Posture: Allow everything in M1. Block nothing. Preserve every citation pathway. If scraping abuse emerges later (which only happens once we're worth scraping — see #44), narrow then.

This is the opposite of PCPartPicker/BuildMyPC's blanket Cloudflare-managed block (5M+ MAU sites that can afford to assert rights), and lighter than Skroutz's tiered model (allow assistants on HTML pages, deny on parametric search). 961tech is too small for Skroutz's nuance to matter yet; the simpler "allow everything, hide the click-out redirector" pattern is the genre's plurality (5/12 peers do nothing about AI bots — Newegg, LDLC, Pricena, EG-PC, EGPrices) and lets us focus on being citable, not being protected.

Surface M1 M2 Deferred
Allow all training UAs (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, meta-externalagent) Ship
Allow all AI-search index UAs (OAI-SearchBot, Claude-SearchBot, PerplexityBot, meta-webindexer, Applebot) Ship
Allow all on-demand UAs (ChatGPT-User, Claude-User, Perplexity-User, DuckAssistBot, meta-externalfetcher, facebookexternalhit) Ship
Disallow: /api/go/ (the click-out redirector — universal pattern across peers) Ship
Sitemap: directive pointing at /sitemap.xml Ship
Crawl-delay: 60 for known-aggressive UAs M2 if observed
Cloudflare "Block AI training" managed rule Defer until specific abuse

Implementation note. robots.txt itself is one file at the site root. In Next.js 16, the canonical implementation is src/app/robots.ts exporting a MetadataRoute.Robots (verify against current Next.js 16 docs in node_modules/next/dist/docs/ per AGENTS.md before writing).

2.2 llms.txt — curated markdown index for LLMs

What it is. A markdown file at /llms.txt with a curated index of the site's most-citable URLs. Proposed by Jeremy Howard (Answer.AI), 2024-09-03. Spec at llmstxt.org. Informal — not on any RFC/IETF/W3C track but stable since proposal. The Markdown structure is strict: H1 (project name, required), optional blockquote summary, optional non-heading prose, zero-or-more H2 sections each containing a markdown list of - [name](url): notes, plus a special ## Optional section for low-priority links.

What it isn't. Not a sitemap (no exhaustive URL list — curated, ~5KB). Not robots.txt (no access policy). Not for training (Howard explicitly notes inference-time grounding, not training).

Adoption (verified live 2026-04-28):

Site /llms.txt Notes
docs.anthropic.com 200 (134 KB) Massive Claude/Anthropic docs index
docs.stripe.com 200 (93 KB) Thorough docs index
vercel.com 200 (355 KB) Full docs tree
nextjs.org 200 (7.7 KB) Curated; also documents .md-suffix convention
supabase.com 200 (1.2 KB) Textbook curated TOC + /llms-full.txt companion
cloudflare.com 200 Marketing-oriented
hono.dev 200 + /llms-full.txt and /llms-small.txt variants
docs.perplexity.ai 200
docs.cursor.com 200
openai.com / platform.openai.com 404 OpenAI ships none
ai.google.dev 404 Google ships none
Aggregators
pcpartpicker.com 403 (bot block) None
geizhals.de 403 None
skroutz.gr 403 None
idealo.de 503 (Cloudflare) None
pricena.com 404 None
egprices.com 403 None
eg-pc.com 404 None
pcprices.vercel.app SPA fallback None
buildmypc.net 404 None
newegg.com 404 None

Genre adoption: zero. No PC-parts aggregator or general-comparison aggregator ships one as of April 2026. Strong in dev-tools/docs sites; absent in commerce/comparison.

Honest assessment. Hypothesis-grade on impact: no major AI vendor has published a doc stating they read /llms.txt from third-party sites. Anthropic publishes one for its own docs but doesn't say Claude consumes it externally. Documented consumers today are coding-assistant scaffolds (Cursor, Continue, Cline, Aider) that probe the file when a user names a library. Cost to ship is low (one curated markdown file, no JS, no schema); upside is asymmetric — we're a first-mover in the price-aggregator vertical, and any agentic assistant doing "compare GPU prices in Lebanon" via tool-use that does probe /llms.txt lands on a clean structured index of our value proposition.

961tech recommendation: ship one in M1. The cost-to-skip is identical (zero ongoing maintenance for a 5KB file); the cost-to-ship is one PR. Curated index of: build flow, all-parts catalog, retailer audit, compatibility-rules reference, project repo. Skip product/listing pages (defeats curation purpose — those belong in sitemap.xml).

Suggested initial content (final wording in the implementation ticket):

# 961tech

> Lebanon-specific PC parts price comparison and compatibility-checked
> builder. Aggregates real-time prices from Lebanese retailers, normalises
> listings, and lets users build a PC with automatic compatibility checks.

961tech is a solo project covering the Lebanese PC-parts market — a market
no global aggregator (PCPartPicker, Geizhals, Idealo, Pricena) covers.
Prices in USD; Lebanon-only retailers; Lebanon-only delivery realities.
Source code is public.

## Browse
- [All parts](https://961tech.pages.dev/parts): Faceted catalog across CPU, GPU, motherboard, RAM, storage, PSU, case, cooler.
- [Build a PC](https://961tech.pages.dev/build): All-slots-at-once builder UI with live compatibility checks.
- [Retailer coverage](https://961tech.pages.dev/about/retailers): Per-retailer reference for the Lebanese tech-retail surface we index.

## Reference
- [Architecture overview](https://961tech.pages.dev/architecture/overview): How the system is built.
- [Compatibility rules](https://961tech.pages.dev/about/compatibility-rules): What we check and don't.
- [Glossary](https://961tech.pages.dev/glossary): Domain terms (Call For Price, Fresh USD, etc.).
- [Principles](https://961tech.pages.dev/principles): Engineering values that shape decisions.

## Project
- [GitHub repo](https://github.com/Amine32/961tech)
- [Public roadmap](https://github.com/users/Amine32/projects/2)

## Optional
- [ADRs](https://961tech.pages.dev/adr/): Locked architectural decisions.
- [RFCs](https://961tech.pages.dev/rfc/): Proposals under review.
Surface M1 M2 Deferred
/llms.txt curated index, ≤5KB Ship
.md shadow URLs (/foo/foo.md returning rendered MDX as text/markdown) for docs pages M2 candidate
/llms-full.txt (concatenated full docs) Defer until docs are stable
/llms-ctx.txt / /llms-ctx-full.txt (XML-wrapped for llms_txt2ctx CLI) Skip unless we see a documented consumer

2.3 Schema.org / JSON-LD

Verified against schema.org V30.0 (2026-03-19) and Google Search Central docs.

Per-page-type coverage

Page type Type(s) M1 M2 Deferred
Product detail Product + AggregateOffer (containing Offer[]) + Brand + BreadcrumbList Ship
Category listing BreadcrumbList + (optional) ItemList of Product references M1 (breadcrumb); ItemList M2
Retailer profile LocalBusiness (subtype of Organization) + PostalAddress Ship
Build detail (saved/shared) BreadcrumbList only M1 (breadcrumb)
Homepage WebSite + Organization (+ SearchAction if global search ships) Ship
Review / AggregateRating anywhere Defer permanently until 961tech has first-party reviews

Product properties (schema.org/Product)

Property Status Notes
name Required
image Required for merchant listing eligibility ≥1 URL
offers Required Use the AggregateOffer + nested Offer[] pattern below
description Recommended Plain text. Useful for AI grounding
brand Recommended {"@type": "Brand", "name": "..."}
sku Recommended 961tech internal ID
gtin / gtin8/12/13/14 Recommended Pass through if retailer publishes EAN/UPC
mpn Recommended (alt to gtin) Manufacturer Part Number — useful for PC parts where GTIN is patchy
category Recommended E.g. "Computers > Components > GPU"
additionalProperty Recommended for compat specs Array of PropertyValue for socket, tdp, vramGB, coreCount, etc.

Offer vs AggregateOffer — the aggregator decision

Schema.org explicitly endorses AggregateOffer for the multi-retailer case (https://schema.org/AggregateOffer): "When a single product is associated with multiple offers (for example, the same pair of shoes is offered by different merchants)."

Google's separate guidance: merchant-listing rich-result eligibility requires Offer, not AggregateOffer"the merchant has to be the seller of the product" (Google's merchant-listing docs). 961tech is an aggregator, not a merchant, so the merchant-listing rich result is unreachable regardless. We remain eligible for the lighter Product snippet rich result.

Recommended pattern: emit BOTH shapes. Product.offers is an AggregateOffer with lowPrice/highPrice/offerCount/priceCurrency AND a nested offers: Offer[] array of individual retailer offers. This satisfies schema.org, satisfies Google Product-snippet, and gives AI assistants a clean structure they can quote ("\(229–\)275 across 3 Lebanese retailers").

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "ASUS ROG STRIX RTX 4070 Super",
  "image": ["https://961tech.pages.dev/img/asus-rog-strix-rtx4070s.jpg"],
  "brand": { "@type": "Brand", "name": "ASUS" },
  "sku": "961-prd-rtx4070s-asus-strix",
  "mpn": "ROG-STRIX-RTX4070S-O12G-GAMING",
  "category": "Computers > Components > GPU",
  "additionalProperty": [
    { "@type": "PropertyValue", "name": "vramGB", "value": 12 },
    { "@type": "PropertyValue", "name": "tdpWatts", "value": 220 }
  ],
  "offers": {
    "@type": "AggregateOffer",
    "priceCurrency": "USD",
    "lowPrice": "799.00",
    "highPrice": "865.00",
    "offerCount": 3,
    "offers": [
      {
        "@type": "Offer",
        "url": "https://pcandparts.com/...",
        "price": "799.00",
        "priceCurrency": "USD",
        "availability": "https://schema.org/InStock",
        "itemCondition": "https://schema.org/NewCondition",
        "seller": { "@type": "Organization", "name": "PCAndParts" },
        "priceValidUntil": "2026-05-05"
      }
    ]
  }
}

Offer.availability — the "Call For Price" question

Canonical ItemAvailability enum (https://schema.org/ItemAvailability): BackOrder, Discontinued, InStock, InStoreOnly, LimitedAvailability, MadeToOrder, OnlineOnly, OutOfStock, PreOrder, PreSale, Reserved, SoldOut. There is no QuoteOnRequest.

961Souq has ~78% of CPU listings as "Call For Price" (retailers.md §2.2) — this is structural, not edge-case. Decision goes to RFC-0009; preferred path: emit Offer with availability: https://schema.org/MadeToOrder, omit price and priceCurrency, keep url and seller. Honest, valid schema.org, lets AI assistants surface "available, contact retailer for pricing." Disqualifies the listing from price-bearing rich results (correct — there's no price to show), keeps it in the JSON-LD payload for grounding.

Review and AggregateRating — do not ship

Google's review-snippet policy (https://developers.google.com/search/docs/appearance/structured-data/review-snippet) explicitly forbids self-serving aggregation:

  • "Don't aggregate reviews or ratings from other websites."
  • "Don't rely on human editors to create, curate, or compile ratings information for local businesses."

961tech does not have first-party reviews in M1/M2. Faking AggregateRating (e.g. defaulting to 5 stars or aggregating retailer reviews) gets a manual action and is forbidden. Decision: omit Review and AggregateRating markup entirely until 961tech itself collects reviews (post-M3, gated on a real review submission flow).

LocalBusiness for retailer profiles (M2)

For retailer profile pages (/r/[slug] per #10 retailer profile pages):

{
  "@context": "https://schema.org",
  "@type": "LocalBusiness",
  "name": "PCAndParts",
  "url": "https://pcandparts.com",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "...",
    "addressLocality": "Beirut",
    "addressRegion": "Beirut",
    "addressCountry": "LB"
  },
  "telephone": "+96101...",
  "logo": "https://961tech.pages.dev/r/pcandparts/logo.png",
  "sameAs": [
    "https://www.facebook.com/pcandparts",
    "https://www.instagram.com/pcandparts"
  ],
  "priceRange": "$$",
  "openingHoursSpecification": [
    { "@type": "OpeningHoursSpecification", "dayOfWeek": ["Monday","Tuesday","Wednesday","Thursday","Friday"], "opens": "09:00", "closes": "19:00" }
  ]
}

Important nuance. This is 961tech describing a third-party retailer. Google may not award rich results on our domain (canonical authority belongs to the retailer's own site). Treat as AI-assistant grounding payload, not a Google ranking play.

Standard format (https://schema.org/BreadcrumbList):

{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    { "@type": "ListItem", "position": 1, "name": "Components", "item": "https://961tech.pages.dev/parts" },
    { "@type": "ListItem", "position": 2, "name": "GPUs", "item": "https://961tech.pages.dev/parts/gpu" },
    { "@type": "ListItem", "position": 3, "name": "RTX 4070 Super" }
  ]
}

Last item omits item; position is 1-indexed.

Format + placement

  • JSON-LD only. Google's stated preference. Microdata/RDFa accepted but inferior.
  • Server-rendered. AI crawlers (GPTBot, ClaudeBot, PerplexityBot) often skip JS. Render JSON-LD inside the server component (Next.js 16 app/.../page.tsx).
  • Placement: <script type="application/ld+json"> in <head> or end of <body> — both accepted by Google.

Validation tooling: - Google Rich Results Test — Google-specific eligibility, will warn on AggregateOffer-only. - Schema Markup Validator — generic schema.org validation.

2.4 OpenGraph + Twitter Card

Universal across all pages — cheap, helps social previews, helps AI scrapers ground titles/descriptions/images consistently.

Minimum set for product detail:

<meta property="og:title" content="ASUS ROG STRIX RTX 4070 Super — Lebanese price comparison | 961tech" />
<meta property="og:description" content="Compare ASUS ROG STRIX RTX 4070 Super prices across 3 Lebanese retailers. $799–$865 USD. Real-time stock, last updated 2 hours ago." />
<meta property="og:type" content="product" />
<meta property="og:image" content="https://961tech.pages.dev/img/.../share.png" />
<meta property="og:url" content="https://961tech.pages.dev/p/..." />
<meta property="og:site_name" content="961tech" />
<meta property="og:locale" content="en_US" />
<meta name="twitter:card" content="summary_large_image" />

(og:locale = en_US for now per ADR-0004 English-only. When i18n revisits, this becomes per-page.)

Page types covered: - Product detail — full set including og:type: product and product-specific fields if Facebook Product Catalog is ever wired up - Build detail (saved/shared) — full set with og:type: article, image is the build's hero render once that ships (#9) - Retailer profile — full set with og:type: profile or business.business - Homepage + category landing — full set with og:type: website

Surface M1 M2 Deferred
OG title/description/url/image/site_name on every public page Ship
Twitter Card summary_large_image Ship
og:type: product on product detail Ship
Per-build social-share image render (#9) M2
Facebook Product Catalog feed Defer until paid ads

2.5 Page-content shape — citability beyond markup

This is where 961tech earns citation vs. just qualifying for it. Schema.org tells crawlers "this page is about X"; the first 500 tokens of prose tell the LLM why it should quote this page over the competing ten. Grounded in current AI-assistant behavior:

  • Crawlers ingest HTML as text. JSON-LD is text inside a <script> tag from their perspective; it helps because property names are self-explanatory, but the headline/lead/list structure of the visible page matters more for which sentence gets cited.
  • AI assistants prefer short, factual, time-stamped, attributable claims. (Hypothesis — observed pattern in citation outputs across Perplexity / Claude / Google AI Overview; no vendor doc states this explicitly.)
  • Lebanese-specific framing matters because the competing pages are global (PCPartPicker US prices, Geizhals EU prices, Pricena MENA-but-not-Lebanon); a query like "RTX 4070 in Beirut" gets unambiguously matched by a page that says "RTX 4070 in Lebanon ranges from $799 to $865 across 3 Beirut retailers as of 2026-04-28".

Patterns to ship

  1. First-paragraph-as-citation on every product detail page. First sentence answers what is this (canonical product name + brand + category). Second sentence answers what does it cost in Lebanon (price range + retailer count + currency). Third sentence answers where to buy it (retailer names, "in stock" affordances). All within ~500 visible tokens before any UI chrome. The product detail page redesign in #28 inherits this constraint.

  2. "As of <date>" stat block on the homepage. Visible prose, not just JSON-LD. Examples of the kind of cite-worthy assertion AI overview boxes quote:

    • "961tech tracks 1,759 SKUs across 3 Lebanese retailers as of 2026-04-28."
    • "Lebanese-market PC parts are predominantly USD-priced; RTX 4070 in Lebanon ranges from $799 to $865 across N listings on 2026-04-28."
    • "Macrotronics is the only major Lebanese retailer that displays prices VAT-inclusive; PCAndParts and 961Souq display VAT-exclusive."

Specific, verifiable, time-stamped, Lebanese-specific. These exist because they're the kind of factual claim an AI assistant grounds against — and no other page on the open web makes them. (See RFC-0009 decision 4 for whether the homepage actually surfaces this in M1.)

  1. Last updated &lt;Nh ago&gt; per listing row — visible, not buried in tooltip. Lets a user (and a citation engine) verify freshness without round-tripping. Stale-data warning when timestamp is >24h. Aligns with competitive-landscape.md §3.6 freshness pattern (Geizhals + PCPartPicker baseline) and counters Pricena's known "outdated price data" weakness.

  2. Lebanese-specific framing in prose — every category / retailer / build page mentions Lebanon explicitly in the first paragraph. Not "best PC parts comparison" but "Lebanon-specific PC parts price comparison covering Beirut retailers including PCAndParts, 961Souq, and Macrotronics." This is the phrase Perplexity/Claude/ChatGPT match against user queries containing "lebanon" / "beirut".

  3. Retailer attribution per price — "Source: PCAndParts, updated 2h ago" inline with each listing row, not in a tooltip. Per competitive-landscape.md §3.6.

Surface M1 M2 Deferred
First-paragraph-as-citation on product detail Ship (constraint into #28)
Last updated <Nh ago> per listing row Ship
Lebanese-specific framing in prose on every category/landing Ship
Retailer attribution per price (visible) Ship
Homepage "as of <date>" stat block (RFC-0009 decision)
Per-product 1-paragraph editorial intro generated from canonical specs M2 candidate
Per-build summary prose ("This $1,200 1080p build covers...") M2 candidate (links to #9, #13)

2.6 Machine-readable feeds

sitemap.xml — universal

Genre baseline: every working aggregator surveyed has one. Newegg has 8 (including a ProductListKeywords_USA.xml for search terms). PCPrices' SPA-fallback /sitemap.xml is an unforced error.

Recommendation: ship one in M1. Single /sitemap.xml (Next.js 16 app/sitemap.ts) covering homepage, all category pages, all retailer profile pages once they exist, all product detail pages, all docs pages. Reference it from robots.txt.

RSS / JSON Feed / public REST API

Genre survey: none of 12 peers expose RSS or any public machine-readable feed. Every /feed, /rss, /feed.xml, /rss.xml probed returned 404 / 403 / SPA HTML.

961tech recommendation: - No public API in M1/M2. Inviting machine-readable scraping of the entire catalog before we have a monetisation hedge is asymmetric — competitors (if any emerge) can pull our data wholesale. Scraping retailer feeds is our differentiation; we don't hand the same to a hypothetical follower. - RSS for price drops — natural fit once #14 price drop alerts ships. A /feed/price-drops.rss carrying the last 50 price drops is cheap, useful for power users (and bots), and doesn't expose the whole catalog. Defer to M2 alongside #14. - Per-product structured-data shadow URLs (/p/[slug].md returning rendered product page as text/markdown) — interesting for llms.txt consumers but defer until we see a documented consumer.

Surface M1 M2 Deferred
/sitemap.xml Ship
Sitemap: reference in robots.txt Ship
/feed/price-drops.rss M2 (with #14)
/p/[slug].md shadow URLs Defer
Public REST API (/api/v1/products, etc.) Defer to M3+ once monetisation is settled
Facebook Product Catalog feed Defer to paid ads era

3. What AI assistants actually do — grounded in vendor docs

This section is all Hypothesis-grade unless explicitly tagged otherwise. No major AI assistant publishes a complete grounding pipeline. Recommendations on content shape follow from observed citation behavior + the structural fact that all crawlers ingest HTML as text.

Assistant Crawler stack What is documented What's Hypothesis
ChatGPT (browse) ChatGPT-User for live fetch; OAI-SearchBot for index; GPTBot for training Vendor-stated: respects robots.txt for GPTBot + OAI-SearchBot. ChatGPT-User "rules may not apply" because user-initiated. Citation source = the live-fetched page text including JSON-LD as text. No public statement that JSON-LD is parsed structurally.
Claude (web tool) Claude-User for live fetch; Claude-SearchBot for index; ClaudeBot for training Vendor-stated: all bots respect robots.txt. Same — text ingestion; structured-data parsing not documented.
Perplexity PerplexityBot for index; Perplexity-User for live fetch Vendor-stated: index respects robots.txt; user-initiated "generally ignores." Explicitly NOT used for training. Major shopping/comparison citation surface. (Hypothesis — observed citation pattern; no vendor commitment.)
Google AI Overviews Googlebot only — no separate UA Vendor-stated: AI Overviews layer on top of standard Search index. Google-Extended controls Gemini training but NOT AI Overviews. Schema.org rich-result eligibility transitively benefits AI Overview eligibility. (Hypothesis — Google doesn't publish AI Overview ranking signals.)
Apple Intelligence Applebot for index; Applebot-Extended for training opt-out Vendor-stated: data may feed Apple foundation models unless Applebot-Extended is disallowed. Lebanese iPhone share is meaningful; Siri/Spotlight grounding flows through Applebot. (Hypothesis — adoption data not public.)
Meta AI / WhatsApp AI meta-webindexer for AI search; meta-externalfetcher for on-demand; facebookexternalhit for link previews Vendor-stated: "allowing Meta-WebIndexer helps us cite and link to your content in Meta AI's responses." WhatsApp link previews via facebookexternalhit are a Lebanese-specific channel — Lebanese commerce is WhatsApp-heavy (personas.md §5.5). (Hypothesis on volume.)
DuckDuckGo (DuckAssist) DuckAssistBot Vendor-stated: respects robots.txt, ~72h propagation, NOT used for training. Smaller share; relevant for privacy-conscious cohort.

The honest summary. Every AI assistant's crawler stack is documented. Every assistant's grounding behavior (what makes a page get cited) is not. We optimize for the textually-obvious things — crawlers can read the page, the page is fast, the first paragraph is factual and time-stamped, JSON-LD is present and well-formed, and the URL is stable — and let the rest follow. Recommendations in §2 are calibrated to this honest uncertainty.

4. M1 / M2 / deferred summary

Single source-of-truth table. Every recommendation in §2 surfaced here.

M1 (this milestone — implementation ticket pending)

  • robots.txt allowing all AI UAs (training, AI-search index, on-demand) + disallowing /api/go/ + Sitemap: directive
  • /sitemap.xml covering homepage, category pages, product detail pages, docs pages
  • /llms.txt curated index (≤5KB)
  • Schema.org JSON-LD on product detail: Product + AggregateOffer (with nested Offer[]) + Brand + BreadcrumbList + additionalProperty for compat-relevant specs
  • Offer.availability mapping including MadeToOrder for "Call For Price" listings
  • Offer.priceValidUntil set to next-scrape-window-end
  • BreadcrumbList JSON-LD on category + product + build detail pages
  • OpenGraph + Twitter Card on every public page
  • First-paragraph-as-citation prose pattern on product detail (constraint into #28)
  • Last updated <Nh ago> visible per listing row
  • Lebanese-specific framing in prose on every category/landing
  • Retailer attribution per price (visible)

M2

  • LocalBusiness JSON-LD on retailer profile pages (depends on #10)
  • Homepage WebSite + Organization + SearchAction JSON-LD
  • ItemList JSON-LD on category listing pages
  • Per-build social-share image render (depends on #9)
  • /feed/price-drops.rss (depends on #14)
  • .md shadow URLs for docs pages (candidate, not committed)
  • Per-product 1-paragraph editorial intro from canonical specs (candidate)
  • Per-build summary prose (candidate, depends on #9, #13)

Deferred

  • Review / AggregateRating markup — until 961tech has first-party reviews (M3+, gated on review submission flow)
  • /llms-full.txt — until docs are stable
  • Public REST API — until monetisation is settled (M3+)
  • Facebook Product Catalog feed — paid ads era
  • Cloudflare "Block AI training" managed rule — until specific scraping abuse
  • Per-product /p/[slug].md shadow URLs — until we observe a documented consumer
  • Schema.org Dataset type for homepage stats — never (wrong type per Google docs)

5. Comparable aggregators' posture (April 2026 snapshot)

Cross-reference table for competitive-landscape.md §3.6. Verbatim from each site's live robots.txt (or Wayback for CF-walled sites). Fetched 2026-04-28.

Site AI training UAs AI-search/assistant UAs llms.txt Sitemap RSS
PCPartPicker All major training UAs blocked (CF-managed: GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider, Amazonbot, Applebot-Extended, meta-externalagent) + Content-Signal: search=yes,ai-train=no None blocked 404 yes (403 direct) none
Geizhals Only meta-externalagent blocked None blocked 403 yes none
Idealo Applebot-Extended blocked with surgical path-allowlist (/unternehmen, /legal/, /magazin); omgilibot same None of GPTBot/ClaudeBot in fetched range 503 (CF) not checked not checked
Skroutz All major training UAs blocked (ClaudeBot, anthropic-ai, CCBot, Bytespider, Amazonbot, PetalBot) TieredOAI-SearchBot, GPTBot, ChatGPT-User, Google-Extended, PerplexityBot get Allow:/$ + HTML-only allowlist (/c/*.html$, /s/*/*.html$); parameterized URLs disallowed 403 yes none
Pricena None None 404 yes (points at HTML page, not XML) none
EG-PC None None 404 yes none
EGPrices None None 403 yes none
PCPrices (SPA fallback — no real robots.txt) (SPA fallback) (SPA fallback) (SPA fallback) none
BuildMyPC All major training UAs blocked (CF-managed, byte-identical to PCPartPicker) None 404 yes none
Logical Increments (No real robots.txt) 403 unknown unknown
Newegg None — zero AI directives None 404 yes (8 sitemaps) none
LDLC None — zero AI directives None 404 7 sitemaps (one per locale) none real

Cross-cutting findings (informs RFC-0009 robots.txt decision):

  1. No consensus posture. Plurality (5/12) does nothing about AI bots. 2/12 blanket-block via Cloudflare-managed rule. Only Skroutz hand-tiers.
  2. Skroutz's tiered model is the most thoughtful — block training, allow AI-search/assistant on HTML-only, deny on parametric search. Worth revisiting once 961tech is large enough for nuance to matter.
  3. llms.txt adoption in genre is zero. First-mover opportunity (or tells us the format is genre-irrelevant — both possible).
  4. RSS / public feeds in genre is zero. Sitemaps carry the load; Newegg's 8-sitemap fan-out is the most ambitious.
  5. Universal pattern: hide the click-out redirector. Every aggregator disallows it (/redir/ Geizhals, /preisvergleich/Relocate/ Idealo, /partenaire/ LDLC, /m/...ajax/...storageApi Newegg). 961tech's /api/go/ follows the same logic.

6. Open questions

Surfaced for RFC-0009 decision; not resolved here.

  1. robots.txt posture conflict with #41 monetisation. Blocking AI training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) closes a future B2B data-licensing revenue stream. Allowing them gives data away free for training. Today, neither path is monetised; the citation path is what matters. Surfaced as RFC-0009 Open Question.
  2. Schema.org "Call For Price" mapping. MadeToOrder is the closest canonical; LimitedAvailability is a fallback; omitting the Offer is the strict-honest path. ~78% of one retailer's CPU listings are in this state. RFC-0009 decision.
  3. Homepage "as of <date>" stat block — does it ship in M1, or wait for #28 page design? Cross-cuts page design. RFC-0009 decision.
  4. /sitemap.xml — should retailer profile pages be in M1 even though #10 hasn't shipped? Or do we ship sitemap.xml with what exists today (homepage + product pages + docs) and extend?
  5. Cloudflare Bot Management. #44 security inherits this. The relevant question for this doc: do we expect Cloudflare's "Block AI training" managed rule to become part of our posture, and if so, when does that trigger?
  6. English-only constraint vs. Arabic SERP grounding. Per ADR-0004, 961tech ships English-only through M2. AI assistants asked Arabic queries ("قطع كمبيوتر بيروت") may have less to ground from on our pages. Mitigated by the search-input Arabic-tolerance per RFC-0003, but the grounding text is English. Worth probing in M2 telemetry per personas.md §7 Arabic-cohort drift signal.

7. See also

Sources cited (canonical):