Skip to content

RFC-0007: Observability stack

Decision gate. This RFC proposes the observability stack — log shipper, error reporter, metrics layer, RUM, alerting destination, and dashboard surface. It does not lock anything in. Once MASTER picks, the choice gets captured as an ADR and #43 closes; implementation lands in a follow-up alongside #19.

Summary

961tech needs to see what its production infrastructure is doing without paying for it. Today, tech-stack.md § Observability is "nothing instrumented." RFC-0001 picks Cloudflare Pages + Workers + Cron Triggers (BEY PoP, ~$5/mo all-in); RFC-0002 picks pg-boss on the existing Postgres (no new Redis). Those constraints plus a $5–20/mo M1 budget plus a "ship M1 now" mindset collapse the option space tightly.

Recommendation. Cloudflare-native primary stack — Workers Observability (logs + invocation metrics + cron history), Cloudflare Web Analytics (pageviews + Core Web Vitals RUM), Workers Analytics Engine (custom metrics, including outbound-click rollups for the north-star KPI), Cloudflare Notifications (email alerts) — augmented by Sentry Free (error grouping + release tagging the Cloudflare-native stack lacks) and Axiom Free (30-day log retention backstop via native Cloudflare Logpush integration). Total incremental observability spend at M1 = $0 above the $5/mo Workers Paid baseline already in RFC-0001. Trajectory at M2 (5× M1) stays inside all free tiers.

Defer distributed tracing and APM to M2/M3 — Cloudflare Workers performance.now() zeroes for CPU-bound spans, making server-side traces low-signal until the platform fixes it. Defer the Telegram-notify bridge as opt-in (MASTER-decision); default alert path is Cloudflare email.

Motivation

Three concrete, immediate costs of not deciding:

  1. No way to validate the north-star KPI. kpis.md defines Weekly Qualified Outbound Clicks. Without an event sink for the existing /api/go/r/[retailerId]/p/[listingId] route, that number cannot be computed at all.
  2. No drift detection on scrapers. personas.md §5.6 makes "no real-time stock signal" a severity-3 pain across four personas. The product's whole-trust play depends on freshness; without scraper-success-rate alerting, a parser break that returns 0 listings for PCAndParts goes silently to production until a user complains. architecture/deployment.md § Observability names this as a planned signal but no infra exists.
  3. No way to police the performance budget. performance-budget.md sets P75 LCP ≤ 2.5s on Lebanese mobile. Without a RUM beacon, that's a wish, not a measurement.

Plus the foundation cost: every feature ticket from M1 forward carries an implicit "we'll add observability later" footnote. "Later" is the wrong default for a solo evening project — pick something cheap that ships M1.

Proposal

Recommendation: Cloudflare-native primary + Sentry Free + Axiom Free

Concern Tool Cost
Pageviews + sessions + referrers + geo Cloudflare Web Analytics (Pages-project auto-enabled, no script tag) $0
Core Web Vitals RUM (LCP / CLS / INP / TTFB / FCP) Cloudflare Web Analytics RUM beacon (auto-enabled) $0
Worker invocation metrics (count, errors, P95 wall-clock, CPU) Cloudflare Workers Observability dashboard (built-in to Workers Paid) $0 (in $5/mo Workers Paid from RFC-0001)
Application logs (route handlers, scraper events, structured JSON) Cloudflare Workers Logs (20M events/mo / 7-day retention on Workers Paid) $0
Cron Trigger execution history Cloudflare Workers Cron Events panel (last 100 runs, status, duration) $0
Custom metrics (north-star: outbound clicks per retailer per day; secondary: build-session funnel) Workers Analytics Engine writeDataPoint API + SQL query $0 (currently unbilled per Cloudflare's published "you will not be billed for AE" disclaimer; published $0.25/M-write rate stays under $1/mo at M1 even if billing flips on)
Long-retention log archive (30 days, queryable) Axiom Free Personal tier via Cloudflare Logpush native app (1.5 GB/mo of 500 GB free; 30-day retention; APL query language; native dashboards) $0
Error grouping + stack-trace dedup + release tagging + Next.js + Cloudflare Workers SDKs Sentry Free Developer tier (5k errors/mo / 5GB logs / 5M spans / 50 replays / 1 cron monitor / 1 user / 30-day lookback / email-only alerts) $0
Alerting destination (default) Cloudflare Notifications email (free, included on Workers Paid; fires on Worker errors, script failures, Logpush job health) $0
Alerting destination (rich, optional) Tail Worker filtering events → Discord/Slack incoming webhook (~30 lines of code; Tail Workers billed only by their own CPU time, rounding-error within Workers Paid included CPU) $0
Alerting destination (personal channel, opt-in MASTER decision) telegram-notify MCP bridge via Tail Worker → Telegram Bot API $0 (gated on MASTER decision, see Open question 4)
Distributed tracing / APM Deferred to M2/M3 — see Trade-offs $0

What changes from today

  • App moves from "no instrumentation" to: Workers Observability auto-on via [observability] block in wrangler.jsonc, Cloudflare Web Analytics enabled on the Pages project, @sentry/cloudflare + @sentry/nextjs SDK packages added (the only package.json change this RFC implies — and not in this RFC's commit; in the implementation ticket).
  • Scraper jobs (per RFC-0002) emit structured logs from each pg-boss handler — picked up by Workers Logs automatically.
  • Click-out route (/api/go/...) writes one Analytics Engine data point per click in addition to the existing Click row insert.
  • Cloudflare Notifications wires email alerts on: (a) Worker error rate > 1%, (b) Logpush job failure, © Workers script deploy failure. Scraper-failure alert is implemented as a saved Workers Logs query that triggers via the same Notifications channel.
  • Axiom Logpush job created via the Axiom Cloudflare app (token paste, no glue code).

Behaviour

graph TB
    subgraph Visitor["Lebanese mobile visitor (BEY PoP)"]
      U[Visitor]
    end

    subgraph CF["Cloudflare edge"]
      Pages[Cloudflare Pages<br/>Next.js on Workers]
      WA[Web Analytics RUM beacon<br/>auto-enabled]
      WL[Workers Logs<br/>20M/mo, 7-day]
      WO[Workers Observability dashboard<br/>invocations/errors/P95]
      AE[Workers Analytics Engine<br/>custom metrics]
      CF_N[Cloudflare Notifications<br/>email]
      Cron[Cron Triggers<br/>scraper invocation]
    end

    subgraph Backstops["Free-tier augments"]
      AX[Axiom Personal<br/>30-day log archive]
      SE[Sentry Free Developer<br/>error grouping]
    end

    U -->|page load + RUM beacon| WA
    U -->|HTTP request| Pages
    Pages -->|console.log + structured| WL
    Pages -->|invocation telemetry| WO
    Pages -->|writeDataPoint per click| AE
    Pages -->|exception| SE
    Cron -->|scrape job| Pages

    WL -->|Logpush native app| AX
    WO -->|threshold breach| CF_N
    WL -->|saved query alert| CF_N

    CF_N -->|email| MASTER[MASTER inbox]
    SE -->|email| MASTER

    Backstops -.->|optional| Tail[Tail Worker → Discord/Slack/Telegram]
    Tail -.-> MASTER

Dashboard surfaces

  • Cloudflare dashboard — Workers & Pages → 961tech-app → Observability tab (logs + metrics + cron events). Web Analytics → 961tech-pages.dev (pageviews + Web Vitals).
  • Sentry web UI — Issues view; Releases tagged via withSentryConfig build-time. Email-only alerts on free tier.
  • Axiom web UI — saved dashboards with APL queries: per-retailer scraper success rate, per-route P95 latency from logs, error-rate trend.
  • No custom admin surface in 961tech itself at M1. Workers Analytics Engine SQL queries can be issued ad-hoc via wrangler CLI or the dashboard. A /admin/jobs page reading pg-boss state per RFC-0002 § What we add lands in M2 polish — not on this RFC's critical path.

Trade-offs

Cost What it buys
Cloudflare Workers performance.now() zeroing on CPU-bound spans. Server-side spans show 0ms duration because Workers' clock only advances after I/O (anti-timing-attack). Affects all frameworks on Workers including Next.js and Sentry's tracing SDK. Means distributed tracing at M1 would be low-signal. We defer tracing without losing functional observability. Workers Observability still gives wall-clock + CPU per invocation; query timing comes from in-handler Date.now() deltas around DB calls + structured logging, not from spans. Revisit when Cloudflare fixes the platform clock or when a tracing-friendly host enters scope.
Sentry email-only alerts on free tier. Slack / Discord / webhook integrations are gated to Team ($26/mo). Means error notifications stack in MASTER's inbox unless we route email → Telegram via a bridge. Free tier covers everything else needed (5k errors/mo >> M1 needs, Next.js + Workers SDKs both GA, source-mapped traces, release tagging, 30-day lookback). The email constraint is bridgeable for \(0; the alternative (\)26/mo for Slack) eats the entire observability budget for one tool.
Workers Logs 7-day retention. Anything older requires Logpush to a destination. Axiom Free covers it (500 GB/mo / 30-day) with native one-click integration. The 7-day Workers Logs window is plenty for live debugging; Axiom is the long-tail backstop.
Cloudflare Analytics Engine billing is "currently not billed." The published rate ($0.25/M write, $1/M read) could activate without notice. Even at full billing, M1 lands at < $1/mo (well inside hard ceiling). Migration off AE if it ever becomes painful is mechanical — write the same data points to a Postgres table with a daily aggregator.
No session replay. PostHog (1k recordings/mo on free) and Highlight (500 sessions/mo) both ship replay; we don't. Session replay is high-value at scale; at M1 with low traffic the DX cost (cookie consent UX, bundle weight from rrweb ~30-50 KB gz, privacy-masking config) outweighs the diagnostic gain. Add later if trip-reporting on a specific bug demands it; meanwhile screen-recording is a free MASTER + browser DevTools workflow.
No first-class Slack/Telegram integration on day one. Cloudflare Notifications and Sentry Free both default to email. Tail Worker → webhook bridge is ~30 lines of code; Telegram via Bot API is straightforward. Wired in implementation ticket only if MASTER opts in (Open question 4).
Sentry bundle weight (~35-55 KB gz client SDK). Counts against performance-budget.md §2.2 100 KB initial-JS budget. Tree-shake aggressively — disable replayIntegration (not used) and browserTracingIntegration (Workers tracing is zero-signal anyway) cuts to ~15-20 KB gz. The SDK pays for itself the first time a real-user error has a symbolicated stack trace.
Lock-in risk. Workers Analytics Engine, Logpush, Web Analytics are Cloudflare-specific; Sentry is a SaaS. RFC-0001 already places us on Cloudflare; the lock-in is shared, not new. Sentry has a self-host path (sentry/onpremise) if we ever need it; Axiom has a SaaS-only model — but at $0/mo for years to come, that's an acceptable risk.

Alternatives

Alternative A: Sentry-only ($0 at M1, $26 at M2)

Use Sentry Free for everything — errors, traces (low-signal), logs (5GB/mo on free), and a single cron monitor for the scraper heartbeat.

  • Where it loses. No RUM (Web Vitals captured weakly via Sentry's perf tab; not as good as CFWA). No first-class custom-metrics API (you can stash counts in tags but querying is awkward). No pageview attribution by referrer/geo. Email-only alerts on free.
  • Cost. $0 at M1; jumps to $26/mo Team if Slack/Telegram alerts become non-negotiable. Otherwise stays free through M2.
  • Why it's not the pick. Misses the natural fit of Cloudflare-native primitives that are already free with the chosen host.

Alternative B: PostHog Free ($0 at M1, $0–30 at M2)

Use PostHog (cloud free tier) for everything — product analytics, error tracking, session replay, feature flags. Native Workers SDK exists with documented caveats.

  • Where it wins. Most-bundled feature set; MIT-licensed exit path; product-analytics + feature flags are real wins for a future A/B-driven product.
  • Where it loses. Bundle weight: full posthog-js is ~60 KB gz with replay enabled, ~10-15 KB gz with posthog-js-lite. Even the lite build is competing for bundle headroom we'd prefer to spend on Sentry's error symbolication. Cookie consent UX cost (cookies on by default; cookieless mode is a config choice). Doesn't replace CFWA's free Web Vitals beacon.
  • Cost. $0 at M1; $0-30/mo at M2 with replay sampled at ≤5%.
  • Why it's not the pick. Genuinely strong second choice; the deciding factor is that we want Web Vitals + low bundle weight, and CFWA gives both for free. PostHog is the right pivot if/when feature-flag-driven experimentation becomes important (M3+).

Alternative C: Highlight.io PAYG ($50/mo from day one)

OTel-native observability + session replay + errors + logs.

  • Where it loses. PAYG floor of $50/mo flat exceeds the entire \(20/mo observability budget. Free tier (500 sessions/mo) is too thin to be production. Self-host requires a CX32+ Hetzner (\)13/mo) and ops time we don't have.
  • Why it's not the pick. Wrong shape for a $0–10/mo solo project.

Alternative D: Datadog ($15+/mo floor)

Best-in-class log + APM + dashboard product.

  • Where it loses. Per-host commitments, complex SKU mixing, Pro-tier minimum kicks at ~$15/mo just for logs at our volume. Overkill for solo evening project.
  • Why it's not the pick. Right answer at $5k/mo budget; wrong answer at $20.

Alternative E: Custom DIY (R2 + grep, Workers + Discord webhooks)

Pure DIY: Logpush to R2 ($0.10/mo storage), grep over downloaded NDJSON, Tail Worker → Discord webhook for alerts. No SaaS dependencies.

  • Where it wins. Zero subscription cost. Maximum control.
  • Where it loses. No query UI (R2 SQL is preview), no error grouping (every exception is a fresh log line), no dashboards, no symbolication, no Web Vitals collection without writing the beacon ourselves. The engineering time to backfill these capabilities is >100 hours; not a fit for evening-and-weekends.
  • Why it's not the pick. All the capabilities Sentry Free + Axiom Free + Cloudflare Web Analytics give us for $0 cost real engineering time to replicate.

Alternative F: Plausible Cloud ($9/mo Starter)

Replace CFWA with Plausible for pageview analytics + outbound-click goal tracking.

  • Where it wins. Plausible's outbound-link goal feature directly maps to the WQOC north-star metric — first-class. Lighter script (1.3 KB gz vs ~5-6 KB gz CFWA beacon).
  • Where it loses. No Web Vitals product (Plausible doesn't track LCP/INP/CLS); we'd still need CFWA or a custom RUM beacon for performance-budget.md compliance. $9/mo for what CFWA + a 30-line outbound-tracking Worker gives free.
  • Why it's not the pick. The custom outbound-click tracker is already needed for the north-star metric anyway (we want server-side counts, not client-side beacons that adblockers eat) — so Plausible's value-add over CFWA is small. Revisit if M2 wants Plausible's per-goal funnel UI without building one.

Open questions

These need MASTER input before this RFC moves to Accepted.

  1. North-star confirmation. kpis.md §1 proposes Weekly Qualified Outbound Clicks. The observability stack collects everything needed for that. If MASTER prefers a quality-first north-star (median match-rate per category, "comparable retailers shown per session", etc.), the collection picks here still hold — but the dashboard surfaces in Axiom + Workers Analytics Engine SQL get re-shaped. Confirm the metric before the dashboards land.
  2. Cost ceiling. This RFC lands at $0 incremental observability spend. If MASTER's actual ceiling is higher — say \(30/mo to unlock Sentry Team + Slack alerts + a Plausible Starter for richer goal funnels — the recommendation shifts to Sentry Team + Plausible Starter alongside the Cloudflare-native primitives. Default assumption: **\)5–10/mo is the real ceiling** (matches RFC-0001's "5-20" framing but tightened by the observed Cloudflare-native fit).
  3. Sentry Free email-alert bridge. Sentry Free is email-only for alerts. Acceptable to route Sentry email → Telegram via a simple IFTTT-style bridge or a tiny inbox-watcher Worker, or do we want to upgrade to Sentry Team ($26/mo, breaks the $20 ceiling)? Default assumption: bridge it — keep it free.
  4. Telegram-notify integration. MASTER's brain repo has a telegram-notify MCP / tooling. Should the alerting layer wire alerts directly into MASTER's personal Telegram, or keep alerts in a project-only inbox/email channel? Default: opt-in, not default. Mixing personal-life notifications and project-alert noise is a UX cost; surface as MASTER decision.
  5. Sampling on RUM beacon. Cloudflare Web Analytics defaults to 100% sampling on free; high-volume sites sometimes sample down. At M1 traffic this is moot; surface for M2+ if pageview-quota becomes a concern (currently no quota documented).
  6. Telemetry retention. Workers Logs: 7-day. Axiom: 30-day. Workers Analytics Engine: 90-day default for AE event data. For long-tail trend analysis (year-over-year WQOC), a small daily aggregator that stashes per-day rollups in Postgres covers it; in scope for the implementation ticket if MASTER wants > 90-day historical metrics.

Implementation plan

Once MASTER picks this stack, the implementation work for #43 (and the wiring that follows after #19 ships hosting):

  • Lock decision as ADR (next sequential number after 0005)
  • Cloudflare Pages project: enable Web Analytics on the project (dashboard click)
  • wrangler.jsonc: enable [observability] block; set head_sampling_rate = 1.0 for M1
  • Add @sentry/cloudflare + @sentry/nextjs to dependencies; instrumentation.ts per Sentry docs; withSentryConfig in next.config.ts; disable replayIntegration and browserTracingIntegration to keep client bundle ≤ 20 KB gz
  • Sentry org + project: free Developer plan; capture DSN as SENTRY_DSN env var (per env-vars.md)
  • Axiom org + dataset: free Personal plan; install Axiom Cloudflare Logpush app (paste API token); verify ingest within 1 hour
  • Workers Analytics Engine: bind OUTBOUND_CLICKS dataset in wrangler.jsonc; writeDataPoint per outbound click in /api/go/... route handler with retailer + category dimensions
  • Cloudflare Notifications: configure email alerts for (a) Worker error rate > 1% over 5 min, (b) Logpush job health, © Worker script deploy failure, (d) Saved-Workers-Logs-query alert: scraper-success-rate < 80% over 24h
  • (Optional, MASTER-decision) Tail Worker forwarding errors to Discord/Slack/Telegram webhook
  • Document SQL queries for the M1 north-star + secondary metrics in a runbook (docs/runbooks/observability-queries.md)
  • Update tech-stack.md § Observability to replace "Nothing instrumented today" with the live stack
  • Update architecture/deployment.md § Observability to replace the placeholder with the actual diagram
  • M2: write /admin/jobs page reading pgboss.job per RFC-0002 § What we add; fold in scraper-success-rate + recent-errors per retailer

Out of scope

  • APM / distributed tracing. Workers performance.now() zeroes on CPU-bound spans; defer until platform clock changes or until non-Workers host enters scope.
  • Session replay. Add later if a specific debugging scenario demands it (PostHog or Highlight free tier covers it on demand).
  • Synthetic monitoring (Pingdom, UptimeRobot equivalents). Cloudflare's Workers Observability detects errors after the fact; synthetic checks (Cloudflare Health Checks free tier covers basic uptime) are a follow-up if needed.
  • Real distributed-tracing destinations (Honeycomb, Tempo). Same reason as APM — server-side traces are zero-signal on Workers today.
  • Per-feature analytics (funnels, retention cohorts). Deferred until product-feedback loop demands them; PostHog Free is the natural pivot if/when that need surfaces.
  • Affiliate-postback observability. Owned by #17 once that ships; uses the same Workers Analytics Engine + Workers Logs pipes this RFC sets up.
  • Security-event observability (CSP violations, rate-limit triggers, WAF events). kpis.md §2.4 health row names them as M2 metrics; the collection uses the same Workers Logs + Notifications path this RFC picks. Which alerts cross the "wake someone up" bar is owned by #44.
  • Backups + disaster recovery observability. Pulled along by RFC-0001 § Open questions; separate ADR after RFC-0001 closes.
  • Code changes in this branch. This RFC is decision-time; no package.json edits, no env-var additions, no wrangler.jsonc changes land in this commit per the foundation-ticket workflow.

Verification checklist (post-implementation)

Once the implementation ticket ships, the stack is "working" iff:

  • Workers Observability dashboard shows non-zero invocations + per-route P95 wall-clock for the deployed app
  • Cloudflare Web Analytics shows pageviews + LCP/CLS/INP per page within 24h of first deploy
  • A deliberately-thrown test exception in a route handler appears in Sentry within 5 minutes, with source-mapped stack trace
  • A click on /api/go/... produces a row in the Click table AND a data point in the Workers Analytics Engine OUTBOUND_CLICKS dataset (verified via wrangler analytics-engine sql)
  • A scrape job failure (manually triggered with a forced exception) fires a Cloudflare Notification email within 10 minutes
  • Axiom dataset shows logs from Workers within 1h of the first deploy with default 30-day retention
  • All-in monthly cost reading on Cloudflare billing + Sentry usage + Axiom usage = $5/mo (the Workers Paid baseline) within first calendar month

See also