RFC-0007: Observability stack¶

Status: Draft — needs MASTER signoff
Author: MASTER (drafted by Claude as part of #43)
Date: 2026-04-28
Related: #43, #19, #18, #14, #28, #29, #44, RFC-0001, RFC-0002, kpis.md, performance-budget.md

Decision gate. This RFC proposes the observability stack — log shipper, error reporter, metrics layer, RUM, alerting destination, and dashboard surface. It does not lock anything in. Once MASTER picks, the choice gets captured as an ADR and #43 closes; implementation lands in a follow-up alongside #19.

Summary¶

961tech needs to see what its production infrastructure is doing without paying for it. Today, tech-stack.md § Observability is "nothing instrumented." RFC-0001 picks Cloudflare Pages + Workers + Cron Triggers (BEY PoP, ~$5/mo all-in); RFC-0002 picks pg-boss on the existing Postgres (no new Redis). Those constraints plus a $5–20/mo M1 budget plus a "ship M1 now" mindset collapse the option space tightly.

Recommendation. Cloudflare-native primary stack — Workers Observability (logs + invocation metrics + cron history), Cloudflare Web Analytics (pageviews + Core Web Vitals RUM), Workers Analytics Engine (custom metrics, including outbound-click rollups for the north-star KPI), Cloudflare Notifications (email alerts) — augmented by Sentry Free (error grouping + release tagging the Cloudflare-native stack lacks) and Axiom Free (30-day log retention backstop via native Cloudflare Logpush integration). Total incremental observability spend at M1 = $0 above the $5/mo Workers Paid baseline already in RFC-0001. Trajectory at M2 (5× M1) stays inside all free tiers.

Defer distributed tracing and APM to M2/M3 — Cloudflare Workers performance.now() zeroes for CPU-bound spans, making server-side traces low-signal until the platform fixes it. Defer the Telegram-notify bridge as opt-in (MASTER-decision); default alert path is Cloudflare email.

Motivation¶

Three concrete, immediate costs of not deciding:

No way to validate the north-star KPI. kpis.md defines Weekly Qualified Outbound Clicks. Without an event sink for the existing /api/go/r/[retailerId]/p/[listingId] route, that number cannot be computed at all.
No drift detection on scrapers. personas.md §5.6 makes "no real-time stock signal" a severity-3 pain across four personas. The product's whole-trust play depends on freshness; without scraper-success-rate alerting, a parser break that returns 0 listings for PCAndParts goes silently to production until a user complains. architecture/deployment.md § Observability names this as a planned signal but no infra exists.
No way to police the performance budget. performance-budget.md sets P75 LCP ≤ 2.5s on Lebanese mobile. Without a RUM beacon, that's a wish, not a measurement.

Plus the foundation cost: every feature ticket from M1 forward carries an implicit "we'll add observability later" footnote. "Later" is the wrong default for a solo evening project — pick something cheap that ships M1.

Proposal¶

Recommendation: Cloudflare-native primary + Sentry Free + Axiom Free¶

Concern	Tool	Cost
Pageviews + sessions + referrers + geo	Cloudflare Web Analytics (Pages-project auto-enabled, no script tag)	$0
Core Web Vitals RUM (LCP / CLS / INP / TTFB / FCP)	Cloudflare Web Analytics RUM beacon (auto-enabled)	$0
Worker invocation metrics (count, errors, P95 wall-clock, CPU)	Cloudflare Workers Observability dashboard (built-in to Workers Paid)	$0 (in $5/mo Workers Paid from RFC-0001)
Application logs (route handlers, scraper events, structured JSON)	Cloudflare Workers Logs (20M events/mo / 7-day retention on Workers Paid)	$0
Cron Trigger execution history	Cloudflare Workers Cron Events panel (last 100 runs, status, duration)	$0
Custom metrics (north-star: outbound clicks per retailer per day; secondary: build-session funnel)	Workers Analytics Engine `writeDataPoint` API + SQL query	$0 (currently unbilled per Cloudflare's published "you will not be billed for AE" disclaimer; published $0.25/M-write rate stays under $1/mo at M1 even if billing flips on)
Long-retention log archive (30 days, queryable)	Axiom Free Personal tier via Cloudflare Logpush native app (1.5 GB/mo of 500 GB free; 30-day retention; APL query language; native dashboards)	$0
Error grouping + stack-trace dedup + release tagging + Next.js + Cloudflare Workers SDKs	Sentry Free Developer tier (5k errors/mo / 5GB logs / 5M spans / 50 replays / 1 cron monitor / 1 user / 30-day lookback / email-only alerts)	$0
Alerting destination (default)	Cloudflare Notifications email (free, included on Workers Paid; fires on Worker errors, script failures, Logpush job health)	$0
Alerting destination (rich, optional)	Tail Worker filtering events → Discord/Slack incoming webhook (~30 lines of code; Tail Workers billed only by their own CPU time, rounding-error within Workers Paid included CPU)	$0
Alerting destination (personal channel, opt-in MASTER decision)	`telegram-notify` MCP bridge via Tail Worker → Telegram Bot API	$0 (gated on MASTER decision, see Open question 4)
Distributed tracing / APM	Deferred to M2/M3 — see Trade-offs	$0

What changes from today¶

App moves from "no instrumentation" to: Workers Observability auto-on via [observability] block in wrangler.jsonc, Cloudflare Web Analytics enabled on the Pages project, @sentry/cloudflare + @sentry/nextjs SDK packages added (the only package.json change this RFC implies — and not in this RFC's commit; in the implementation ticket).
Scraper jobs (per RFC-0002) emit structured logs from each pg-boss handler — picked up by Workers Logs automatically.
Click-out route (/api/go/...) writes one Analytics Engine data point per click in addition to the existing Click row insert.
Cloudflare Notifications wires email alerts on: (a) Worker error rate > 1%, (b) Logpush job failure, © Workers script deploy failure. Scraper-failure alert is implemented as a saved Workers Logs query that triggers via the same Notifications channel.
Axiom Logpush job created via the Axiom Cloudflare app (token paste, no glue code).

Behaviour¶

graph TB
    subgraph Visitor["Lebanese mobile visitor (BEY PoP)"]
      U[Visitor]
    end

    subgraph CF["Cloudflare edge"]
      Pages[Cloudflare Pages<br/>Next.js on Workers]
      WA[Web Analytics RUM beacon<br/>auto-enabled]
      WL[Workers Logs<br/>20M/mo, 7-day]
      WO[Workers Observability dashboard<br/>invocations/errors/P95]
      AE[Workers Analytics Engine<br/>custom metrics]
      CF_N[Cloudflare Notifications<br/>email]
      Cron[Cron Triggers<br/>scraper invocation]
    end

    subgraph Backstops["Free-tier augments"]
      AX[Axiom Personal<br/>30-day log archive]
      SE[Sentry Free Developer<br/>error grouping]
    end

    U -->|page load + RUM beacon| WA
    U -->|HTTP request| Pages
    Pages -->|console.log + structured| WL
    Pages -->|invocation telemetry| WO
    Pages -->|writeDataPoint per click| AE
    Pages -->|exception| SE
    Cron -->|scrape job| Pages

    WL -->|Logpush native app| AX
    WO -->|threshold breach| CF_N
    WL -->|saved query alert| CF_N

    CF_N -->|email| MASTER[MASTER inbox]
    SE -->|email| MASTER

    Backstops -.->|optional| Tail[Tail Worker → Discord/Slack/Telegram]
    Tail -.-> MASTER

Dashboard surfaces¶

Cloudflare dashboard — Workers & Pages → 961tech-app → Observability tab (logs + metrics + cron events). Web Analytics → 961tech-pages.dev (pageviews + Web Vitals).
Sentry web UI — Issues view; Releases tagged via withSentryConfig build-time. Email-only alerts on free tier.
Axiom web UI — saved dashboards with APL queries: per-retailer scraper success rate, per-route P95 latency from logs, error-rate trend.
No custom admin surface in 961tech itself at M1. Workers Analytics Engine SQL queries can be issued ad-hoc via wrangler CLI or the dashboard. A /admin/jobs page reading pg-boss state per RFC-0002 § What we add lands in M2 polish — not on this RFC's critical path.

Trade-offs¶

Cost	What it buys
Cloudflare Workers `performance.now()` zeroing on CPU-bound spans. Server-side spans show 0ms duration because Workers' clock only advances after I/O (anti-timing-attack). Affects all frameworks on Workers including Next.js and Sentry's tracing SDK. Means distributed tracing at M1 would be low-signal.	We defer tracing without losing functional observability. Workers Observability still gives wall-clock + CPU per invocation; query timing comes from in-handler `Date.now()` deltas around DB calls + structured logging, not from spans. Revisit when Cloudflare fixes the platform clock or when a tracing-friendly host enters scope.
Sentry email-only alerts on free tier. Slack / Discord / webhook integrations are gated to Team ($26/mo). Means error notifications stack in MASTER's inbox unless we route email → Telegram via a bridge.	Free tier covers everything else needed (5k errors/mo >> M1 needs, Next.js + Workers SDKs both GA, source-mapped traces, release tagging, 30-day lookback). The email constraint is bridgeable for $0; the alternative ($26/mo for Slack) eats the entire observability budget for one tool.
Workers Logs 7-day retention. Anything older requires Logpush to a destination.	Axiom Free covers it (500 GB/mo / 30-day) with native one-click integration. The 7-day Workers Logs window is plenty for live debugging; Axiom is the long-tail backstop.
Cloudflare Analytics Engine billing is "currently not billed." The published rate ($0.25/M write, $1/M read) could activate without notice.	Even at full billing, M1 lands at < $1/mo (well inside hard ceiling). Migration off AE if it ever becomes painful is mechanical — write the same data points to a Postgres table with a daily aggregator.
No session replay. PostHog (1k recordings/mo on free) and Highlight (500 sessions/mo) both ship replay; we don't.	Session replay is high-value at scale; at M1 with low traffic the DX cost (cookie consent UX, bundle weight from `rrweb` ~30-50 KB gz, privacy-masking config) outweighs the diagnostic gain. Add later if trip-reporting on a specific bug demands it; meanwhile screen-recording is a free MASTER + browser DevTools workflow.
No first-class Slack/Telegram integration on day one. Cloudflare Notifications and Sentry Free both default to email.	Tail Worker → webhook bridge is ~30 lines of code; Telegram via Bot API is straightforward. Wired in implementation ticket only if MASTER opts in (Open question 4).
Sentry bundle weight (~35-55 KB gz client SDK). Counts against `performance-budget.md` §2.2 100 KB initial-JS budget.	Tree-shake aggressively — disable `replayIntegration` (not used) and `browserTracingIntegration` (Workers tracing is zero-signal anyway) cuts to ~15-20 KB gz. The SDK pays for itself the first time a real-user error has a symbolicated stack trace.
Lock-in risk. Workers Analytics Engine, Logpush, Web Analytics are Cloudflare-specific; Sentry is a SaaS.	RFC-0001 already places us on Cloudflare; the lock-in is shared, not new. Sentry has a self-host path (sentry/onpremise) if we ever need it; Axiom has a SaaS-only model — but at $0/mo for years to come, that's an acceptable risk.

Alternatives¶

Alternative A: Sentry-only ($0 at M1, $26 at M2)¶

Use Sentry Free for everything — errors, traces (low-signal), logs (5GB/mo on free), and a single cron monitor for the scraper heartbeat.

Where it loses. No RUM (Web Vitals captured weakly via Sentry's perf tab; not as good as CFWA). No first-class custom-metrics API (you can stash counts in tags but querying is awkward). No pageview attribution by referrer/geo. Email-only alerts on free.
Cost. $0 at M1; jumps to $26/mo Team if Slack/Telegram alerts become non-negotiable. Otherwise stays free through M2.
Why it's not the pick. Misses the natural fit of Cloudflare-native primitives that are already free with the chosen host.

Alternative B: PostHog Free ($0 at M1, $0–30 at M2)¶

Use PostHog (cloud free tier) for everything — product analytics, error tracking, session replay, feature flags. Native Workers SDK exists with documented caveats.

Where it wins. Most-bundled feature set; MIT-licensed exit path; product-analytics + feature flags are real wins for a future A/B-driven product.
Where it loses. Bundle weight: full posthog-js is ~60 KB gz with replay enabled, ~10-15 KB gz with posthog-js-lite. Even the lite build is competing for bundle headroom we'd prefer to spend on Sentry's error symbolication. Cookie consent UX cost (cookies on by default; cookieless mode is a config choice). Doesn't replace CFWA's free Web Vitals beacon.
Cost. $0 at M1; $0-30/mo at M2 with replay sampled at ≤5%.
Why it's not the pick. Genuinely strong second choice; the deciding factor is that we want Web Vitals + low bundle weight, and CFWA gives both for free. PostHog is the right pivot if/when feature-flag-driven experimentation becomes important (M3+).

Alternative C: Highlight.io PAYG ($50/mo from day one)¶

OTel-native observability + session replay + errors + logs.

Where it loses. PAYG floor of $50/mo flat exceeds the entire $20/mo observability budget. Free tier (500 sessions/mo) is too thin to be production. Self-host requires a CX32+ Hetzner ($13/mo) and ops time we don't have.
Why it's not the pick. Wrong shape for a $0–10/mo solo project.

Alternative D: Datadog ($15+/mo floor)¶

Best-in-class log + APM + dashboard product.

Where it loses. Per-host commitments, complex SKU mixing, Pro-tier minimum kicks at ~$15/mo just for logs at our volume. Overkill for solo evening project.
Why it's not the pick. Right answer at $5k/mo budget; wrong answer at $20.

Alternative E: Custom DIY (R2 + grep, Workers + Discord webhooks)¶

Pure DIY: Logpush to R2 ($0.10/mo storage), grep over downloaded NDJSON, Tail Worker → Discord webhook for alerts. No SaaS dependencies.

Where it wins. Zero subscription cost. Maximum control.
Where it loses. No query UI (R2 SQL is preview), no error grouping (every exception is a fresh log line), no dashboards, no symbolication, no Web Vitals collection without writing the beacon ourselves. The engineering time to backfill these capabilities is >100 hours; not a fit for evening-and-weekends.
Why it's not the pick. All the capabilities Sentry Free + Axiom Free + Cloudflare Web Analytics give us for $0 cost real engineering time to replicate.

Alternative F: Plausible Cloud ($9/mo Starter)¶

Replace CFWA with Plausible for pageview analytics + outbound-click goal tracking.

Where it wins. Plausible's outbound-link goal feature directly maps to the WQOC north-star metric — first-class. Lighter script (1.3 KB gz vs ~5-6 KB gz CFWA beacon).
Where it loses. No Web Vitals product (Plausible doesn't track LCP/INP/CLS); we'd still need CFWA or a custom RUM beacon for performance-budget.md compliance. $9/mo for what CFWA + a 30-line outbound-tracking Worker gives free.
Why it's not the pick. The custom outbound-click tracker is already needed for the north-star metric anyway (we want server-side counts, not client-side beacons that adblockers eat) — so Plausible's value-add over CFWA is small. Revisit if M2 wants Plausible's per-goal funnel UI without building one.

Open questions¶

These need MASTER input before this RFC moves to Accepted.

North-star confirmation. kpis.md §1 proposes Weekly Qualified Outbound Clicks. The observability stack collects everything needed for that. If MASTER prefers a quality-first north-star (median match-rate per category, "comparable retailers shown per session", etc.), the collection picks here still hold — but the dashboard surfaces in Axiom + Workers Analytics Engine SQL get re-shaped. Confirm the metric before the dashboards land.
Cost ceiling. This RFC lands at $0 incremental observability spend. If MASTER's actual ceiling is higher — say $30/mo to unlock Sentry Team + Slack alerts + a Plausible Starter for richer goal funnels — the recommendation shifts to Sentry Team + Plausible Starter alongside the Cloudflare-native primitives. Default assumption: **$5–10/mo is the real ceiling** (matches RFC-0001's "5-20" framing but tightened by the observed Cloudflare-native fit).
Sentry Free email-alert bridge. Sentry Free is email-only for alerts. Acceptable to route Sentry email → Telegram via a simple IFTTT-style bridge or a tiny inbox-watcher Worker, or do we want to upgrade to Sentry Team ($26/mo, breaks the $20 ceiling)? Default assumption: bridge it — keep it free.
Telegram-notify integration. MASTER's brain repo has a telegram-notify MCP / tooling. Should the alerting layer wire alerts directly into MASTER's personal Telegram, or keep alerts in a project-only inbox/email channel? Default: opt-in, not default. Mixing personal-life notifications and project-alert noise is a UX cost; surface as MASTER decision.
Sampling on RUM beacon. Cloudflare Web Analytics defaults to 100% sampling on free; high-volume sites sometimes sample down. At M1 traffic this is moot; surface for M2+ if pageview-quota becomes a concern (currently no quota documented).
Telemetry retention. Workers Logs: 7-day. Axiom: 30-day. Workers Analytics Engine: 90-day default for AE event data. For long-tail trend analysis (year-over-year WQOC), a small daily aggregator that stashes per-day rollups in Postgres covers it; in scope for the implementation ticket if MASTER wants > 90-day historical metrics.

Implementation plan¶

Once MASTER picks this stack, the implementation work for #43 (and the wiring that follows after #19 ships hosting):

Out of scope¶

APM / distributed tracing. Workers performance.now() zeroes on CPU-bound spans; defer until platform clock changes or until non-Workers host enters scope.
Session replay. Add later if a specific debugging scenario demands it (PostHog or Highlight free tier covers it on demand).
Synthetic monitoring (Pingdom, UptimeRobot equivalents). Cloudflare's Workers Observability detects errors after the fact; synthetic checks (Cloudflare Health Checks free tier covers basic uptime) are a follow-up if needed.
Real distributed-tracing destinations (Honeycomb, Tempo). Same reason as APM — server-side traces are zero-signal on Workers today.
Per-feature analytics (funnels, retention cohorts). Deferred until product-feedback loop demands them; PostHog Free is the natural pivot if/when that need surfaces.
Affiliate-postback observability. Owned by #17 once that ships; uses the same Workers Analytics Engine + Workers Logs pipes this RFC sets up.
Security-event observability (CSP violations, rate-limit triggers, WAF events). kpis.md §2.4 health row names them as M2 metrics; the collection uses the same Workers Logs + Notifications path this RFC picks. Which alerts cross the "wake someone up" bar is owned by #44.
Backups + disaster recovery observability. Pulled along by RFC-0001 § Open questions; separate ADR after RFC-0001 closes.
Code changes in this branch. This RFC is decision-time; no package.json edits, no env-var additions, no wrangler.jsonc changes land in this commit per the foundation-ticket workflow.

Verification checklist (post-implementation)¶

Once the implementation ticket ships, the stack is "working" iff:

Workers Observability dashboard shows non-zero invocations + per-route P95 wall-clock for the deployed app
Cloudflare Web Analytics shows pageviews + LCP/CLS/INP per page within 24h of first deploy
A deliberately-thrown test exception in a route handler appears in Sentry within 5 minutes, with source-mapped stack trace
A click on /api/go/... produces a row in the Click table AND a data point in the Workers Analytics Engine OUTBOUND_CLICKS dataset (verified via wrangler analytics-engine sql)
A scrape job failure (manually triggered with a forced exception) fires a Cloudflare Notification email within 10 minutes
Axiom dataset shows logs from Workers within 1h of the first deploy with default 30-day retention
All-in monthly cost reading on Cloudflare billing + Sentry usage + Axiom usage = $5/mo (the Workers Paid baseline) within first calendar month