AI bot allowlist runbook¶
Per ADR-0013 D1, 961tech ships fully-open robots.txt for every AI bot class today (training, AI-search, on-demand). New AI bots arrive monthly; this runbook is how we keep up.
Posture summary¶
| Class | Default | Notes |
|---|---|---|
| Training (GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, meta-externalagent) | Allow | Future B2B data-licensing path lives behind a gated API per ADR-0013 OQ-(a), not behind robots.txt |
| AI-search index (OAI-SearchBot, Claude-SearchBot, PerplexityBot, meta-webindexer) | Allow | Citation traffic is the entire monetisation funnel for AI-search era |
| On-demand fetcher (ChatGPT-User, Claude-User, Perplexity-User, DuckAssistBot, meta-externalfetcher, facebookexternalhit) | Allow | User-initiated fetches; never benefit from blocking |
| Conventional search (Googlebot, Bingbot, Applebot) | Allow | Genre baseline |
Wildcard * |
Allow / + Disallow /api/go/ |
Scrapers can index public catalog; click redirector stays unindexed |
Monthly review checklist¶
Run on the first Monday of each month:
-
Pull the Cloudflare AI Bot list and the robots.txt for major outlets for new UAs. Anything seen in our logs not in our
robots.txtis a candidate for a new explicit allow line. -
Check Workers Logs for
Annotate each unrecognised UA: training? AI-search? on-demand? unrelated?pre_tool_useevents with novel UAs: -
Update
src/app/robots.tsif a new class member arrived. Each new line gets a comment with the source (vendor docs URL). -
Update Reference → AI discoverability §2.1 with the new UA + a one-line characterisation.
When to upgrade to Cloudflare WAF block¶
ADR-0013 D1 makes robots.txt declarative; ADR-0012 Q3 makes the WAF the enforcement layer. Trigger to escalate from robots.txt to a WAF block:
- Abusive traffic from a documented training crawler — e.g. GPTBot pulling 10k+ pages/hour. Cloudflare's "Block AI training" managed rule is the one-toggle response.
- Undocumented UA pretending to be a browser — block at WAF via custom rule on
cf.bot_management.score < 30. - Polite but high-volume — rate-limit (ADR-0012 Q3 single rule on
/api/*POST etc.) before considering a block.
WAF blocks are reversible; don't fear them. Document each block in docs/runbooks/waf-blocks.md (file as it accumulates).
When NOT to block¶
- A novel AI-search UA we haven't seen before — default-allow until evidence of abuse.
- A vendor's documented citation crawler — even at 10k/hr, citation traffic is asymmetric upside.
- A bot that respects
Crawl-Delayheaders — they're cooperative, work with them.
How to find a new UA in our logs¶
# In dev with the dev server running:
tail -F /tmp/devserver.log | grep --line-buffered -oE 'user-agent.*$'
# Once Cloudflare Workers logs are wired (per #43 implementation):
wrangler tail --format=json --status=ok \
| jq 'select(.outcome == "ok") | .event.request.headers["user-agent"]'
Out of scope¶
- IP-block lists (CIDR ranges) — too noisy at our scale; revisit if a single ASN floods us
- User-agent fingerprint blocking — that's an arms race we won't win
- ML-based bot scoring — Cloudflare's Bot Management (Enterprise) does this; out of budget per ADR-0012 Q6