RFC-0009: AI discoverability (LLM-citable surface)¶

Status: Draft — needs MASTER signoff
Author: MASTER (drafted by Claude as part of #47)
Date: 2026-04-28
Related: #47, docs/reference/ai-discoverability.md, #38 SEO strategy, #28 page design, #43 KPIs, #44 security, #29 DB, ADR-0004, ADR-0005, RFC-0003

Decision gate. This RFC carries five decisions about how 961tech becomes citable by ChatGPT / Claude / Perplexity / Google AI Overviews / Apple Intelligence. The reference doc (ai-discoverability.md) has the per-surface evidence; this RFC has the questions for MASTER.

Summary¶

Five decisions, each grounded in docs/reference/ai-discoverability.md:

robots.txt posture — recommend fully-open for M1 (allow training, AI-search index, and on-demand UAs). Cite-traffic upside > training-licensing downside at our scale.
llms.txt — recommend ship one at /llms.txt (≤5KB curated index). First-mover in the price-aggregator vertical; cost-to-skip ≈ cost-to-ship.
Schema.org coverage — recommend all product pages get Product + AggregateOffer (with nested Offer[]) + BreadcrumbList in M1. Retailer pages get LocalBusiness in M2 (depends on #10). Skip Review / AggregateRating until 961tech has first-party reviews — Google policy explicitly forbids aggregating from other sites.
Homepage "as of <date>" stat block — recommend ship in M1 as plain prose (not Dataset schema — that's the wrong type per Google docs). Specific Lebanese-specific time-stamped assertions are exactly the cite-bait AI assistants ground from.
Machine-readable feeds — recommend /sitemap.xml ship M1 (universal genre baseline); RSS for price drops M2 (with #14); no public REST API through M2 (asymmetric — invites scraping by future competitors before we have a moat hedge).

Two decisions surface as Open Questions for MASTER (cannot be defaulted):

(a) Does the open robots.txt posture conflict with #41 monetisation's future B2B AI-data-licensing revenue stream?
(b) For ~78% "Call For Price" CPU listings on 961Souq, does schema.org's availability: MadeToOrder (no price), or omitting the Offer entirely, better match how MASTER wants the catalog represented?

Motivation¶

961tech competes for citation traffic against effectively-no-one in Lebanese MENA AI surfaces. Pricena explicitly skips Lebanon (competitive-landscape.md §4.1); SERP for Lebanese-language PC-parts queries returns no Lebanese aggregator (competitive-landscape.md §4.6). Every Lebanese user who asks "where's the cheapest RTX 4070 in Beirut" of ChatGPT / Claude / Perplexity should land on a 961tech citation — and won't, unless we ship the surface that makes that citation possible.

The question is which surfaces. Five named decisions, surveyed in the reference doc, distilled here.

The work has to land before #28 page design starts because page design has constraints that fall out of these decisions:

The first-paragraph-as-citation pattern is a content constraint on the product detail layout.
The Last updated <Nh ago> per listing row is a UI constraint.
The homepage "as of <date>" stat block is a homepage layout constraint.
The retailer profile page (M2, #10) needs the LocalBusiness schema scaffolding designed in.

Doing the AI-discoverability work after #28 means rewriting the design.

Proposal¶

Five decisions follow. Each names the recommendation, the trade-off, and the alternatives considered.

Decision 1: `robots.txt` posture — fully-open in M1¶

Recommendation: open posture for all three AI-crawler classes (training, AI-search index, on-demand). Block nothing. Disallow /api/go/ (the click-out redirector — universal genre pattern).

The robots.txt template (final wording in implementation ticket):

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: meta-externalagent
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: meta-webindexer
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: meta-externalfetcher
Allow: /

User-agent: facebookexternalhit
Allow: /

User-agent: Applebot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /
Disallow: /api/go/

Sitemap: https://961tech.pages.dev/sitemap.xml

Why open. Per ai-discoverability.md §2.1:

Citation traffic is the entire monetisation funnel for the AI-search era. Every blocked UA is a closed citation pathway.
Genre survey: 5/12 peers do nothing about AI bots (Newegg, LDLC, Pricena, EG-PC, EGPrices). 2/12 blanket-block (PCPartPicker, BuildMyPC) — but those are 5M+ MAU sites that can afford to assert rights against scraping. 961tech is not at that scale.
"Block training, allow citation" is Hypothesis-grade defensible (Skroutz does it) but blocking GPTBot / ClaudeBot carries non-zero risk of also dropping us from those vendors' search/citation indexes — vendor docs distinguish training from citation, vendor behavior sometimes doesn't. Open posture eliminates that risk.
Once scraping abuse emerges (as it will if we're worth scraping — see #44), Cloudflare's "Block AI training" managed rule is a one-toggle response.

Decision 2: `/llms.txt` — ship one in M1¶

Recommendation: ship a curated /llms.txt (≤5KB) with H1 + blockquote + three real H2 sections (Browse / Reference / Project) + the spec's ## Optional section. Skip /llms-full.txt and .md shadow URLs for now.

Suggested initial content per ai-discoverability.md §2.2.

Why ship.

Cost to ship is one PR (single curated markdown file at root, no JS, no schema, no ongoing maintenance until the docs site grows).
Genre adoption is zero. First-mover slot. If the format becomes a real grounding signal (for any of: Cursor-style coding agents, agentic browse-and-shop tools, Perplexity scaffolds), 961tech is in. If it stays dev-tools-only forever, we wasted ~2 hours.
Producers who do ship one (Anthropic, Vercel, Stripe, Supabase, Cloudflare, Hono, Cursor) signal that the format is at least credible enough to be worth the time. Anthropic publishing one suggests they think it matters; Anthropic not documenting Claude reading third-party llms.txt suggests it doesn't quite yet.

Decision 3: Schema.org coverage scope — all product pages M1, retailer pages M2, no Review/Rating¶

Recommendation:

M1 — every product detail page gets Product + AggregateOffer (containing nested Offer[]) + Brand + BreadcrumbList + additionalProperty for compat-relevant specs (socket, TDP, VRAM, etc.). Per-Offer.availability mapped to canonical ItemAvailability enum, including MadeToOrder for "Call For Price" listings.
M1 — every category page + product detail + build detail gets BreadcrumbList.
M2 — every retailer profile page (depends on #10) gets LocalBusiness + PostalAddress + OpeningHoursSpecification + sameAs social links.
M2 — homepage gets WebSite + Organization + SearchAction.
Defer permanently until 961tech has first-party reviews: Review + AggregateRating. Google's review-snippet policy explicitly forbids aggregating reviews from other sites. Faking the count gets a manual action.
Skip entirely: Dataset markup for homepage stats. Wrong type per Google docs (Dataset is for downloadable datasets, not editorial prose summaries).

Confidence-high data only? No — emit JSON-LD for every product whose canonical Product row exists in the matcher's matched state. For unmatched/weak listings (matchStatus = 'unmatched' / 'weak' per data-model.md), emit no JSON-LD; surface them only as plain HTML. This naturally avoids "fake-default" hazards (we don't ship structured data for entries we're not confident in).

Why this scope.

Product pages are the primary citation surface. AI assistants asked about a specific GPU/CPU need ground truth on price + availability + retailer — the JSON-LD answers that question without LLM inference.
LocalBusiness for retailer pages waits on #10, which is M2. No reason to ship the markup before the page exists.
Review / AggregateRating is forbidden by Google policy without first-party reviews. The decision is not "should we" but "Google says we may not." Surface this finding to MASTER explicitly so it's not assumed away.
additionalProperty for compat specs is the schema.org-native way to expose what makes 961tech different — the compat DB. Costs ~5 fields per product, gives AI assistants something nobody else exposes.

Decision 4: Homepage "as of <date>" stat block — ship in M1 as plain prose¶

Recommendation: ship in M1 as visible HTML prose, not JSON-LD Dataset. Three to five Lebanese-specific time-stamped assertions on the homepage, refreshed at scrape-window cadence:

961tech tracks 1,759 SKUs across 3 Lebanese retailers as of 2026-04-28.
RTX 4070 Super in Lebanon ranges from $799 to $865 across 3 listings.
Last refreshed 2 hours ago.

(Specific copy in #28 page design; the constraint is "specific, verifiable, Lebanese-specific, time-stamped, ≤500 tokens of citable prose at the top of the homepage.")

Why now, not #28. Two reasons:

The constraint exists independent of the visual design. Defining it in M1 means #28 inherits it as a content requirement; deferring it means we ship a homepage and then re-litigate it.
AI-citation rate (per #43 KPIs) is meaningless without something to cite from. The stat block is the load-bearing prose for "did the AI assistant land on us and quote a real fact?"

Why prose, not Dataset markup. Per ai-discoverability.md §11, Dataset is for downloadable datasets (CSV/scientific data). Editorial stat summaries are not datasets in the schema.org sense. Mis-marking them gets the page flagged. Plain prose with a <time datetime="2026-04-28T14:30Z"> tag for the timestamp is correct.

Decision 5: Machine-readable feeds — sitemap M1, RSS M2, no public API¶

Recommendation:

M1: ship /sitemap.xml covering homepage + all category pages + all product detail pages + all docs pages. Reference from robots.txt with Sitemap: directive.
M2: ship /feed/price-drops.rss alongside #14 price drop alerts. Carries last 50 price drops with product link, old/new price, retailer, timestamp.
Defer through M2: no public REST API. No /api/v1/products, no JSON catalog endpoint, no Facebook Product Catalog feed. Inviting machine-readable scraping of the entire catalog before we have a monetisation moat is asymmetric — competitors can pull our normalized matcher output (which is the differentiator) wholesale.

Why.

Sitemap is the genre baseline (every working aggregator has one; PCPrices' SPA-fallback /sitemap.xml is an unforced error per the reference doc survey).
RSS for price drops is a power-user / agentic-bot feature with negligible scraping risk (50-row sliding window doesn't expose the catalog).
Public API would speed up adoption by hypothetical competitors and by useful agents — but the asymmetry favors caution at our scale. M3+ revisit once monetisation is settled.

Trade-offs¶

Each decision has costs. Surfacing them honestly so MASTER can override.

Decision 1 (open robots.txt)¶

Cost	What it buys
Free training data for OpenAI / Anthropic / Google / Meta / Apple. We get nothing back if they monetise it.	Open citation pathways across every major AI assistant. The genre's plurality (5/12 peers) does this; the only blanket-blockers are 5M+ MAU sites that can afford to assert rights.
Locks us out of a hypothetical future B2B AI-data-licensing revenue stream only if we declared rights early enough. Not declaring rights at scrape time may not foreclose later licensing — see Open Question (a).	Simplicity. One `robots.txt` template, no per-UA path-allowlist surgery (Skroutz's tiered model).

Decision 2 (ship llms.txt)¶

Cost	What it buys
~2 hours initial + occasional updates as docs evolve. Format is informal (no IETF / W3C track), could be obsolete in 18 months.	First-mover in the price-aggregator vertical. Asymmetric — even one agentic scaffold probing `/llms.txt` for "compare GPU prices in Lebanon" gives us a cleaner card than competitors.
Adds a surface to maintain (broken `llms.txt` is worse than no `llms.txt` — wrong message).	Free signaling that the project is technically thoughtful — not a 961tech revenue moat, but a small reputational asset.

Decision 3 (schema.org scope)¶

Cost	What it buys
One reusable `<script type="application/ld+json">` component + per-page composers. ~1 day of code work plus tests.	Google Product-snippet eligibility. AI-assistant grounding payload (price range + retailer count + availability). Compat specs surfaced in a structured way that no other Lebanese retailer ships.
`LocalBusiness` schema for retailer pages encodes a third-party's data; Google may not show rich results on our domain.	Even without rich results, AI assistants ground from the markup; useful in M2.
Permanently no `Review` / `AggregateRating` until first-party reviews. Could be a year+.	Avoids Google manual action.

Decision 4 (homepage stat block)¶

Cost	What it buys
Constrains #28 homepage layout — must reserve a top-of-page block for the stats. ~80px in mobile, ~120px in desktop.	The most cite-worthy assertions on any Lebanese PC-parts page. AI Overview boxes love specific time-stamped facts.
Live data computation cost — homepage needs to render the SKU count + retailer count + last-refresh-timestamp on every request (or be cached short-window).	Cheap to compute (count queries on `Product` + `Retailer` + `max(Listing.lastSeenAt)`).

Decision 5 (no public API)¶

Cost	What it buys
Disappoints power users + agentic tools that want JSON. They have to scrape HTML / JSON-LD instead.	Asymmetric protection: our matcher output (canonical Product DB) is the moat per `competitive-landscape.md` §3.1; not handing it out for free.
Sitemap + RSS is non-trivial to implement (Next.js 16 `app/sitemap.ts`; custom RSS endpoint).	Sitemap is genre baseline; RSS gates price-drop traffic naturally.

Alternatives¶

For each decision, what was considered and rejected.

Decision 1 alternatives¶

A. Block all AI training, allow AI-search index + on-demand (Skroutz tiered model). Rejected for M1. Vendor docs distinguish training from citation, but vendor behavior sometimes conflates them — blocking GPTBot may also reduce our presence in ChatGPT search citations even though OAI-SearchBot is allowed. At our scale, the citation upside outweighs the licensing-rights principle. Revisit when 961tech is large enough that not asserting rights becomes a meaningful loss (M3+ or earlier if #41 monetisation opens a B2B-data-licensing path).

B. Strict allow-list (only OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Applebot). Rejected. Cuts off meta-webindexer (Meta AI / WhatsApp citations are Lebanese-relevant — see personas.md §5.5 WhatsApp commerce). Also creates ambiguous behavior for legacy / undocumented Anthropic UAs (Claude-Web, anthropic-ai).

C. Cloudflare-managed "Block AI bots" rule (PCPartPicker + BuildMyPC byte-identical block). Rejected for M1 — same reasoning as (A) but harder to revert because the toggle is at the edge.

D. Paid-only access (require API key for any non-search-engine UA). Rejected for M1 — kills citation entirely. Could be a M3+ Stage-4 partner offering (#41) for retailer integration, not for AI assistants.

Decision 2 alternatives¶

A. Skip llms.txt because adoption in genre is zero. Rejected. Genre adoption being zero is the first-mover argument. Cost to ship is one PR; cost to skip is the same; only one direction has any upside.

B. Ship full /llms-full.txt with concatenated docs. Rejected for M1 — high maintenance burden, only worth it once docs are stable. M2 candidate at most.

C. Ship .md shadow URLs for every page (Howard's second proposal). Rejected for M1 — doable in Next.js 16 but real engineering work. M2 candidate; defer until we observe a documented consumer fetching them.

Decision 3 alternatives¶

A. Ship Review / AggregateRating with default values or aggregated retailer reviews. Rejected — Google policy explicitly forbids both. Manual action risk. The absence here is decided, not deferred-by-laziness.

B. Confidence-only: emit JSON-LD only for products with matchStatus = 'matched' and matchConfidence > 0.85. Acceptable refinement; soft-recommend the lower bar of matchStatus = 'matched' (which already requires > 0.7 per data-model.md Invariant 3) so we don't double-define thresholds. If matcher noise causes false-positive JSON-LD, raise the bar.

C. Use AggregateOffer only, no nested Offer[]. Rejected per ai-discoverability.md §2.3. Both shapes maximize compatibility with Google + AI-assistant grounding.

D. Use individual Offer[] only, no AggregateOffer wrapper. Rejected — schema.org explicitly endorses AggregateOffer for the multi-merchant case; we lose the lowPrice/highPrice/offerCount summary fields that AI assistants quote directly.

Decision 4 alternatives¶

A. Defer to #28 page design. Rejected — described above. The constraint must exist before #28 starts.

B. Mark up the stat block with Dataset schema. Rejected — wrong type per Google docs.

C. Mark up the stat block with WebPage + mainEntity of type QuantitativeValue. Rejected as over-engineering; AI assistants ground from visible prose more reliably than from speculative property paths.

Decision 5 alternatives¶

A. Ship a public REST API in M1. Rejected — described above. Asymmetric scraping risk.

B. Ship Facebook Product Catalog feed in M1. Rejected — only useful with paid ads, which is M3+.

C. Ship a JSON Feed (modern alternative to RSS) for price drops. Acceptable refinement; if the implementer prefers JSON Feed over RSS, both serve the same purpose. RSS has wider tooling support.

Open questions¶

Surfaced to MASTER. Both are real decisions, not stylistic.

(a) `robots.txt` posture vs. #41 monetisation's future B2B data-licensing¶

Question. Allowing GPTBot / ClaudeBot / Google-Extended / Applebot-Extended / meta-externalagent (training-class UAs) means OpenAI / Anthropic / Google / Apple / Meta can train on 961tech's normalized matcher output without compensation. If #41 monetisation opens a B2B data-licensing path (selling normalized-MENA-PC-pricing data to AI vendors or to retailers), having scraped the data freely first may weaken our negotiating position.

The opposing argument: vendors don't typically pay for data they could have scraped, but they do sometimes pay for data they can't (gated, structured, real-time, with provenance). The catalog-as-licensable-product play is independent of robots.txt; the API behind it is what they'd pay for, not the HTML.

Options for MASTER:

Open posture (recommended). Allow everything. Maximize citation traffic now. If a B2B-data-licensing path opens later, build the licensable product behind a gated API and let the open HTML continue. The two surfaces don't conflict: HTML for citation, API for licensing.
Block training UAs only (GPTBot / ClaudeBot / Google-Extended / Applebot-Extended / meta-externalagent); allow AI-search and on-demand. Skroutz tiered model. Preserves citation pathways while declaring rights for future licensing. Costs a small but non-zero amount of citation traffic if vendors conflate training and citation in practice.
Block everything not explicitly in our allow-list. Maximally protective; minimal citation. Wrong for our scale but the cleanest licensing posture.

(b) Schema.org `availability` for "Call For Price" listings¶

Question. ~78% of 961Souq's CPU listings are "Call For Price" with no public price (retailers.md §2.2). Schema.org has no QuoteOnRequest enum value. Three honest options:

availability: https://schema.org/MadeToOrder, omit price and priceCurrency, keep url + seller (recommended). Closest canonical semantics ("produced/quoted on demand"). Honest, valid, lets AI assistants surface "available, contact retailer for pricing."
availability: https://schema.org/LimitedAvailability, omit price and priceCurrency. Acceptable fallback; weaker semantic match.
Omit the Offer entirely. Strictest; the SKU disappears from our schema.org coverage on those listings. Worst for citation surface (an AI assistant asking "is X available in Lebanon" gets no signal).

Recommend (1). The remaining decision is whether MASTER prefers any of the alternatives.

Implementation plan¶

Mapped to a follow-up code ticket (not built in this RFC). Sequence:

Out of scope¶

Google-search SEO in the classic ranking sense (#38). The two interact (structured data overlaps); whichever ticket lands second adopts the other's choices and reduces to the marginal delta.
Security / WAF / rate-limiting (#44). robots.txt declares policy; WAF enforces it. We declare; #44 enforces.
KPI definition (#43). We surface "AI-citation rate" as a KPI candidate; #43 defines it.
Implementation code. No next.config.ts directives, no JSON-LD components, no app/robots.ts in this RFC.
i18n SEO (hreflang, per-locale sitemaps). Per ADR-0004 we ship English-only through M2; revisit when i18n revisits.
Schema.org for build pages (UC-9 saved/shared builds). M2 candidate alongside #9 completed-build gallery — what's the right schema.org type for a curated PC build? Possibly ItemList of Product, possibly HowTo. Decide there, not here.
AI-citation telemetry pipeline. Tracking when a referrer is chat.openai.com / claude.ai / perplexity.ai / gemini.google.com / duckduckgo.com is a #43 concern.
Per-page editorial copy. "First-paragraph-as-citation" is a constraint; the actual copy is #28 page design + #42 brand voice.