Skip to content

AI bot allowlist runbook

Per ADR-0013 D1, 961tech ships fully-open robots.txt for every AI bot class today (training, AI-search, on-demand). New AI bots arrive monthly; this runbook is how we keep up.

Posture summary

Class Default Notes
Training (GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, meta-externalagent) Allow Future B2B data-licensing path lives behind a gated API per ADR-0013 OQ-(a), not behind robots.txt
AI-search index (OAI-SearchBot, Claude-SearchBot, PerplexityBot, meta-webindexer) Allow Citation traffic is the entire monetisation funnel for AI-search era
On-demand fetcher (ChatGPT-User, Claude-User, Perplexity-User, DuckAssistBot, meta-externalfetcher, facebookexternalhit) Allow User-initiated fetches; never benefit from blocking
Conventional search (Googlebot, Bingbot, Applebot) Allow Genre baseline
Wildcard * Allow / + Disallow /api/go/ Scrapers can index public catalog; click redirector stays unindexed

Monthly review checklist

Run on the first Monday of each month:

  1. Pull the Cloudflare AI Bot list and the robots.txt for major outlets for new UAs. Anything seen in our logs not in our robots.txt is a candidate for a new explicit allow line.

  2. Check Workers Logs for pre_tool_use events with novel UAs:

    wrangler logs --format=json --limit=1000 \
      | jq -r '.event.request.headers.user-agent' \
      | sort -u
    
    Annotate each unrecognised UA: training? AI-search? on-demand? unrelated?

  3. Update src/app/robots.ts if a new class member arrived. Each new line gets a comment with the source (vendor docs URL).

  4. Update Reference → AI discoverability §2.1 with the new UA + a one-line characterisation.

When to upgrade to Cloudflare WAF block

ADR-0013 D1 makes robots.txt declarative; ADR-0012 Q3 makes the WAF the enforcement layer. Trigger to escalate from robots.txt to a WAF block:

  • Abusive traffic from a documented training crawler — e.g. GPTBot pulling 10k+ pages/hour. Cloudflare's "Block AI training" managed rule is the one-toggle response.
  • Undocumented UA pretending to be a browser — block at WAF via custom rule on cf.bot_management.score < 30.
  • Polite but high-volume — rate-limit (ADR-0012 Q3 single rule on /api/* POST etc.) before considering a block.

WAF blocks are reversible; don't fear them. Document each block in docs/runbooks/waf-blocks.md (file as it accumulates).

When NOT to block

  • A novel AI-search UA we haven't seen before — default-allow until evidence of abuse.
  • A vendor's documented citation crawler — even at 10k/hr, citation traffic is asymmetric upside.
  • A bot that respects Crawl-Delay headers — they're cooperative, work with them.

How to find a new UA in our logs

# In dev with the dev server running:
tail -F /tmp/devserver.log | grep --line-buffered -oE 'user-agent.*$'

# Once Cloudflare Workers logs are wired (per #43 implementation):
wrangler tail --format=json --status=ok \
  | jq 'select(.outcome == "ok") | .event.request.headers["user-agent"]'

Out of scope

  • IP-block lists (CIDR ranges) — too noisy at our scale; revisit if a single ASN floods us
  • User-agent fingerprint blocking — that's an arms race we won't win
  • ML-based bot scoring — Cloudflare's Bot Management (Enterprise) does this; out of budget per ADR-0012 Q6

See also