ADR-0015: Observability stack¶
- Status: Accepted
- Date: 2026-04-28
- Deciders: MASTER
- Related: RFC-0007, #43, #19, #18, #14, #28, #29, #44, ADR-0006 hosting Cloudflare, ADR-0007 background jobs pg-boss, ADR-0011 monetisation rates, ADR-0012 security controls posture, ADR-0014 compliance baseline, Reference → KPIs, Reference → Performance budget
Context¶
RFC-0007 surveyed observability options against three concrete costs of not deciding: (1) the north-star Weekly Qualified Outbound Clicks (kpis.md) cannot be computed without an event sink on /api/go/...; (2) scraper drift goes silently to production without per-retailer success-rate alerting; (3) the performance-budget.md P75 LCP ≤ 2.5s target is unenforceable without a RUM beacon.
The constraints that collapsed the option space tightly: ADR-0006 places us on Cloudflare Pages + Workers + Cron Triggers; ADR-0007 puts background jobs on pg-boss with no Redis dependency; the M1 cost ceiling is $5–10/mo total infra.
Decision¶
Primary stack — Cloudflare-native, augmented by Sentry Free and Axiom Free. Total incremental observability spend at M1 = $0 above the $5/mo Workers Paid baseline already locked by ADR-0006.
| Concern | Tool |
|---|---|
| Pageviews + sessions + referrers + geo | Cloudflare Web Analytics (Pages-project auto-enabled, no script tag) |
| Core Web Vitals RUM (LCP / CLS / INP / TTFB / FCP) | Cloudflare Web Analytics RUM beacon (auto-enabled) |
| Worker invocation metrics (count, errors, P95 wall-clock, CPU) | Cloudflare Workers Observability dashboard |
| Application logs (route handlers, scraper events, structured JSON) | Cloudflare Workers Logs (20M events/mo / 7-day retention) |
| Cron Trigger execution history | Cloudflare Workers Cron Events panel |
| Custom metrics (north-star: WQOC; secondary: build-session funnel) | Workers Analytics Engine writeDataPoint API |
| Long-retention log archive (30 days, queryable) | Axiom Free Personal via Cloudflare Logpush native app |
| Error grouping + stack-trace dedup + release tagging | Sentry Free Developer |
| Alerting destination | Cloudflare Notifications email (default); optional Tail Worker → webhook |
| Distributed tracing / APM | Deferred to M2/M3 — see Consequences |
| Session replay | Deferred — add if a specific bug demands it |
Q1 (north-star confirmation). Weekly Qualified Outbound Clicks per kpis.md is the north-star. Workers Analytics Engine writeDataPoint per outbound click on /api/go/r/[retailerId]/p/[listingId] with (retailerId, categoryId, listingId, deviceClass, refererClass) dimensions. Daily aggregator (Q6) preserves long-horizon trend. Quality-first metrics (match-rate per category, comparable retailers per session) ride the same collection pipes; surface them as secondary dashboard cards once UX feedback identifies which one matters at M2.
Q2 (cost ceiling). \(0 incremental at M1. Sentry Team (\)26/mo) + Plausible Starter ($9/mo) is the upgrade path only if (a) Slack alerting becomes mission-critical AND the bridge in Q3 falls short, OR (b) goal-funnel UX from Plausible meaningfully outperforms what we can build on Workers Analytics Engine + Axiom dashboards.
Q3 (Sentry email-alert bridge). Sentry Free is email-only for alerts. Bridge to richer destinations using a tiny inbox-watcher Worker (or just keep alerts in Gmail) before paying for Sentry Team. The $26/mo upgrade is the wrong slope at M1.
Q4 (Telegram-notify integration). Opt-in, not default. Mixing personal-life notifications and project-alert noise is a real UX cost. The Tail Worker → webhook bridge in the implementation ticket leaves the Telegram destination switchable; default route is Cloudflare Notifications email + Sentry email.
Q5 (RUM beacon sampling). 100% at M1; revisit at M2 if Cloudflare Web Analytics introduces a quota or pageview limit. Currently no documented quota; this is a watch-list item, not an active decision.
Q6 (telemetry retention). Daily aggregator pg-boss job stashes per-day rollups in Postgres for indefinite trend retention, while preserving the per-flow retention table in ADR-0014 D4 for raw log data (90-day click logs, 30-day Axiom archive, 7-day Workers Logs). Aggregates and raw events live on different retention timelines: aggregates forever, raw events per ADR-0014.
Bundle weight discipline. Sentry SDK is the only client-side observability cost. Tree-shake aggressively in next.config.ts — disable replayIntegration (not used) and browserTracingIntegration (Workers tracing zero-signal anyway). Target ≤ 20 KB gz on the initial-JS bundle; counts against the performance-budget.md §2.2 100 KB ceiling.
Re-evaluation triggers.
- Q1: re-shape if MASTER picks a different north-star at quarterly KPI review.
- Q2: revisit if budget unlocks for richer alerting OR if a bug class repeatedly bites that the free tier doesn't surface.
- Q4: revisit when alerting volume justifies the personal-Telegram destination (currently zero alerts/day).
- Q5: revisit at M2 if Cloudflare introduces a Web Analytics quota.
- APM / distributed tracing: revisit when Cloudflare fixes the Workers
performance.now()zeroing on CPU-bound spans, OR when 961tech adds a non-Workers host to the deployment topology. - Session replay: revisit on first concrete bug-class that screen-recording can't diagnose.
Consequences¶
Positive¶
- Zero incremental cost at M1. Total infra observability spend stays at $5/mo (Workers Paid, already in ADR-0006). Trajectory at M2 (5× M1) stays inside all free tiers.
- All three "cost of not deciding" gaps closed: north-star is collectible, scraper drift is alertable, performance budget is measurable.
- Single host of truth for invocation metrics + logs + crons in the Cloudflare dashboard. No "three tabs to debug one Worker" UX cost.
- Sentry Free covers symbolicated stack traces + release tagging — the parts the Cloudflare-native stack lacks. The first time a real-user error has a stack trace pointing at a specific commit, the Sentry SDK pays for itself.
- Axiom Free 30-day archive via native Logpush with no glue code; APL queries cover the long-tail debugging window.
- Workers Analytics Engine
writeDataPointis the right shape for high-cardinality custom metrics (per-retailer, per-category, per-deviceClass) without paying for Datadog or building our own time-series service. Currently unbilled by Cloudflare's published disclaimer; even at the published $0.25/M-write rate, M1 cost stays under $1/mo. - Daily aggregator preserves long-horizon trend without violating ADR-0014 retention: aggregates forever, raw events per ADR-0014 windows.
Negative¶
- No distributed tracing at M1. Workers
performance.now()zeroes on CPU-bound spans (anti-timing-attack). Spans show 0ms, making server-side traces low-signal. Mitigated by in-handlerDate.now()deltas around DB calls + structured logs; revisit when the platform fixes the clock. - Sentry email-only alerts on free tier. Slack/Discord/webhook integrations gated to Team ($26/mo). Bridge via inbox-watcher Worker is functional but has a one-link maintenance surface.
- Workers Logs 7-day retention. Anything older requires Axiom. Acceptable; live debugging is always within 7 days, post-mortem digs go to Axiom.
- Workers Analytics Engine billing risk. Cloudflare's "currently not billed" disclaimer could activate without notice. Mitigated by daily-aggregator-to-Postgres path — if AE billing flips on and economics turn unfavourable, migration is mechanical.
- Cloudflare lock-in on Web Analytics + Workers Logs + Logpush. Mitigated by the host already being Cloudflare; lock-in is shared with ADR-0006, not new.
- No session replay. Diagnostic gain at M1 is low-value vs cookie-consent-UX cost (per ADR-0014 D1) + bundle weight from
rrweb(~30–50 KB gz). Trade-off acceptable; revisit if a specific bug class demands it. - Sentry SDK bundle weight. Tree-shaken to ~15–20 KB gz; counts against
performance-budget.md§2.2 100 KB initial-JS budget. Discipline required innext.config.ts.
Neutral¶
- This ADR does not author any code. Implementation lands in a follow-up ticket post-#19 hosting deploy. The follow-up ticket spec is captured in the implementation plan below.
- The Cloudflare Notifications + Sentry email path is a no-glue baseline. The Tail Worker → webhook bridge is opt-in; the implementation ticket carries the option but does not commit to a specific destination (Discord vs Slack vs Telegram is MASTER's call at implementation time).
- This ADR does not address synthetic monitoring (Pingdom-equivalent). Cloudflare Health Checks free tier covers basic uptime if/when needed; not on the M1 critical path.
Alternatives considered¶
Sentry-only at M1¶
Rejected. No first-class custom-metrics API for the WQOC north-star (tag-based stashing is awkward to query); Web Vitals captured weakly via Sentry's perf tab vs Cloudflare Web Analytics' first-class beacon; no pageview attribution by referrer/geo. Misses the natural fit of the Cloudflare-native primitives that come free with the chosen host.
PostHog Free at M1¶
Rejected. Genuinely strong second choice; deciding factor is bundle weight (full posthog-js ~60 KB gz with replay; lite ~10–15 KB gz competing for headroom we want for Sentry symbolication) and the cookie-consent-UX cost (ADR-0014 D1 stays banner-free at M1, and PostHog defaults to cookies-on). PostHog is the right pivot if/when feature-flag-driven experimentation becomes important (M3+).
Highlight.io PAYG¶
Rejected. PAYG floor of $50/mo flat exceeds the entire \(20/mo observability budget; free tier (500 sessions/mo) too thin to be production. Self-host adds a CX32+ Hetzner (\)13/mo) and ops time we don't have.
Datadog¶
Rejected. Per-host commitments + complex SKU mixing; Pro-tier minimum kicks at ~$15/mo just for logs at our volume. Right answer at $5k/mo budget; wrong at $20.
Custom DIY (R2 + grep + Workers + Discord webhooks)¶
Rejected. Engineering time to backfill error grouping, source-mapped traces, dashboards, Web Vitals collection > 100 hours. Not a fit for evening-and-weekends — Sentry Free + Axiom Free + Cloudflare Web Analytics deliver the same capability for $0.
Plausible Cloud at M1¶
Rejected. $9/mo for what Cloudflare Web Analytics + a 30-line outbound-tracking Worker gives free; Plausible has no Web Vitals product so we'd need CFWA anyway. Custom outbound tracker is already needed for the WQOC north-star (server-side counts beat client beacons that adblockers eat). Revisit if M2 wants Plausible's per-goal funnel UI specifically.
Sentry Team upgrade for Slack alerting¶
Rejected at M1. $26/mo is the wrong slope when an inbox-watcher Worker bridges email → webhook for $0. Revisit if alert volume + on-call posture justifies the upgrade.
Telegram-notify default-on¶
Rejected. Mixing personal-life and project-alert noise is real UX cost. Opt-in via Tail Worker keeps the destination switchable when MASTER decides.
APM / distributed tracing at M1¶
Rejected. Workers performance.now() zeroes on CPU-bound spans (Cloudflare-platform behaviour, not framework choice). Server-side spans low-signal until platform fixes the clock or non-Workers host enters scope.
Session replay at M1¶
Rejected. High-value at scale; at M1 with low traffic the diagnostic gain is outweighed by cookie-consent UX cost (banned by ADR-0014 D1) + ~30–50 KB gz bundle weight. Revisit on specific debugging demand.
Implementation plan (deferred to follow-up ticket)¶
This ADR locks the decision; implementation lands in the follow-up ticket #43 after #19 hosting ships. The follow-up ticket spec:
- Cloudflare Pages project: enable Web Analytics on the project (dashboard click)
-
wrangler.jsonc: enable[observability]block; sethead_sampling_rate = 1.0for M1 - Add
@sentry/cloudflare+@sentry/nextjsto dependencies;instrumentation.tsper Sentry docs;withSentryConfiginnext.config.ts; disablereplayIntegration+browserTracingIntegration; verify client bundle ≤ 20 KB gz - Sentry org + project: free Developer plan; capture DSN as
SENTRY_DSNenv var (perenv-vars.md) - Axiom org + dataset: free Personal plan; install Axiom Cloudflare Logpush app; verify ingest within 1 hour
- Workers Analytics Engine: bind
OUTBOUND_CLICKSdataset inwrangler.jsonc;writeDataPointper outbound click in/api/go/...route handler with(retailerId, categoryId, listingId, deviceClass, refererClass)dimensions - Cloudflare Notifications: configure email alerts for: (a) Worker error rate > 1% over 5 min; (b) Logpush job health; © Worker script deploy failure; (d) saved-Workers-Logs-query alert at scraper-success-rate < 80% over 24h
- (Optional, MASTER-decision at implementation time) Tail Worker forwarding errors to webhook destination (Discord / Slack / Telegram)
- Daily aggregator pg-boss job: rolls up WQOC + per-retailer scraper success + per-route P95 latency into a
MetricRollupPostgres table for indefinite trend retention - Document SQL queries for M1 north-star + secondary metrics in
runbooks/observability-queries.md - Update
tech-stack.md§ Observability to replace "Nothing instrumented" with the live stack - Update
architecture/deployment.md§ Observability to replace placeholder with actual diagram - Verification checklist (post-deploy) per RFC-0007 § Verification checklist
References¶
- RFC-0007 — Observability stack (full evidence + per-tool cost analysis)
- Reference → KPIs — what this stack collects (north-star + secondary + health rows)
- Reference → Performance budget — what some signals police (RUM enforces P75 LCP)
- Reference → Tech stack §Observability — current "nothing instrumented" state this ADR replaces
- Reference → Personas §5.2 — Lebanese mobile reality (RUM Web Vitals priority)
- Reference → Competitive landscape §3.7 — out-craft-the-genre mandate enabled by real RUM
- Architecture → Deployment §Observability — placeholder this ADR replaces
- ADR-0006 hosting Cloudflare — infra constraint that drove the Cloudflare-native fit
- ADR-0007 background jobs pg-boss — scraper job source for many signals + daily aggregator host
- ADR-0011 monetisation rates — WQOC ties to CPS revenue path
- ADR-0012 security controls posture — security-event observability shares the same Workers Logs + Notifications pipes
- ADR-0014 compliance baseline — D4 retention windows constrain raw-event retention; aggregator preserves trend separately
- #43 — implementation ticket
- #19 hosting — implementation gate (this ADR can't deploy until #19 ships)
- #44 security review — security-event alerts share collection pipes; alert-vs-noise tuning owned there
- #17 affiliate reconciliation — affiliate-postback observability uses same WAE + Workers Logs pipes