AI bot allowlist runbook¶

Per ADR-0013 D1, 961tech ships fully-open robots.txt for every AI bot class today (training, AI-search, on-demand). New AI bots arrive monthly; this runbook is how we keep up.

Posture summary¶

Class	Default	Notes
Training (GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, meta-externalagent)	Allow	Future B2B data-licensing path lives behind a gated API per ADR-0013 OQ-(a), not behind robots.txt
AI-search index (OAI-SearchBot, Claude-SearchBot, PerplexityBot, meta-webindexer)	Allow	Citation traffic is the entire monetisation funnel for AI-search era
On-demand fetcher (ChatGPT-User, Claude-User, Perplexity-User, DuckAssistBot, meta-externalfetcher, facebookexternalhit)	Allow	User-initiated fetches; never benefit from blocking
Conventional search (Googlebot, Bingbot, Applebot)	Allow	Genre baseline
*Wildcard ``**	Allow / + Disallow `/api/go/`	Scrapers can index public catalog; click redirector stays unindexed

Monthly review checklist¶

Run on the first Monday of each month:

Pull the Cloudflare AI Bot list and the robots.txt for major outlets for new UAs. Anything seen in our logs not in our robots.txt is a candidate for a new explicit allow line.
Check Workers Logs for pre_tool_use events with novel UAs:
```
wrangler logs --format=json --limit=1000 \
  | jq -r '.event.request.headers.user-agent' \
  | sort -u
```
Annotate each unrecognised UA: training? AI-search? on-demand? unrelated?
Update src/app/robots.ts if a new class member arrived. Each new line gets a comment with the source (vendor docs URL).
Update Reference → AI discoverability §2.1 with the new UA + a one-line characterisation.

When to upgrade to Cloudflare WAF block¶

ADR-0013 D1 makes robots.txt declarative; ADR-0012 Q3 makes the WAF the enforcement layer. Trigger to escalate from robots.txt to a WAF block:

Abusive traffic from a documented training crawler — e.g. GPTBot pulling 10k+ pages/hour. Cloudflare's "Block AI training" managed rule is the one-toggle response.
Undocumented UA pretending to be a browser — block at WAF via custom rule on cf.bot_management.score < 30.
Polite but high-volume — rate-limit (ADR-0012 Q3 single rule on /api/* POST etc.) before considering a block.

WAF blocks are reversible; don't fear them. Document each block in docs/runbooks/waf-blocks.md (file as it accumulates).

When NOT to block¶

A novel AI-search UA we haven't seen before — default-allow until evidence of abuse.
A vendor's documented citation crawler — even at 10k/hr, citation traffic is asymmetric upside.
A bot that respects Crawl-Delay headers — they're cooperative, work with them.

How to find a new UA in our logs¶

# In dev with the dev server running:
tail -F /tmp/devserver.log | grep --line-buffered -oE 'user-agent.*$'

# Once Cloudflare Workers logs are wired (per #43 implementation):
wrangler tail --format=json --status=ok \
  | jq 'select(.outcome == "ok") | .event.request.headers["user-agent"]'

Out of scope¶

IP-block lists (CIDR ranges) — too noisy at our scale; revisit if a single ASN floods us
User-agent fingerprint blocking — that's an arms race we won't win
ML-based bot scoring — Cloudflare's Bot Management (Enterprise) does this; out of budget per ADR-0012 Q6