RFC-0007: Observability stack¶
- Status: Draft — needs MASTER signoff
- Author: MASTER (drafted by Claude as part of #43)
- Date: 2026-04-28
- Related: #43, #19, #18, #14, #28, #29, #44, RFC-0001, RFC-0002,
kpis.md,performance-budget.md
Decision gate. This RFC proposes the observability stack — log shipper, error reporter, metrics layer, RUM, alerting destination, and dashboard surface. It does not lock anything in. Once MASTER picks, the choice gets captured as an ADR and #43 closes; implementation lands in a follow-up alongside #19.
Summary¶
961tech needs to see what its production infrastructure is doing without paying for it. Today, tech-stack.md § Observability is "nothing instrumented." RFC-0001 picks Cloudflare Pages + Workers + Cron Triggers (BEY PoP, ~$5/mo all-in); RFC-0002 picks pg-boss on the existing Postgres (no new Redis). Those constraints plus a $5–20/mo M1 budget plus a "ship M1 now" mindset collapse the option space tightly.
Recommendation. Cloudflare-native primary stack — Workers Observability (logs + invocation metrics + cron history), Cloudflare Web Analytics (pageviews + Core Web Vitals RUM), Workers Analytics Engine (custom metrics, including outbound-click rollups for the north-star KPI), Cloudflare Notifications (email alerts) — augmented by Sentry Free (error grouping + release tagging the Cloudflare-native stack lacks) and Axiom Free (30-day log retention backstop via native Cloudflare Logpush integration). Total incremental observability spend at M1 = $0 above the $5/mo Workers Paid baseline already in RFC-0001. Trajectory at M2 (5× M1) stays inside all free tiers.
Defer distributed tracing and APM to M2/M3 — Cloudflare Workers performance.now() zeroes for CPU-bound spans, making server-side traces low-signal until the platform fixes it. Defer the Telegram-notify bridge as opt-in (MASTER-decision); default alert path is Cloudflare email.
Motivation¶
Three concrete, immediate costs of not deciding:
- No way to validate the north-star KPI.
kpis.mddefines Weekly Qualified Outbound Clicks. Without an event sink for the existing/api/go/r/[retailerId]/p/[listingId]route, that number cannot be computed at all. - No drift detection on scrapers.
personas.md§5.6 makes "no real-time stock signal" a severity-3 pain across four personas. The product's whole-trust play depends on freshness; without scraper-success-rate alerting, a parser break that returns 0 listings for PCAndParts goes silently to production until a user complains.architecture/deployment.md§ Observability names this as a planned signal but no infra exists. - No way to police the performance budget.
performance-budget.mdsets P75 LCP ≤ 2.5s on Lebanese mobile. Without a RUM beacon, that's a wish, not a measurement.
Plus the foundation cost: every feature ticket from M1 forward carries an implicit "we'll add observability later" footnote. "Later" is the wrong default for a solo evening project — pick something cheap that ships M1.
Proposal¶
Recommendation: Cloudflare-native primary + Sentry Free + Axiom Free¶
| Concern | Tool | Cost |
|---|---|---|
| Pageviews + sessions + referrers + geo | Cloudflare Web Analytics (Pages-project auto-enabled, no script tag) | $0 |
| Core Web Vitals RUM (LCP / CLS / INP / TTFB / FCP) | Cloudflare Web Analytics RUM beacon (auto-enabled) | $0 |
| Worker invocation metrics (count, errors, P95 wall-clock, CPU) | Cloudflare Workers Observability dashboard (built-in to Workers Paid) | $0 (in $5/mo Workers Paid from RFC-0001) |
| Application logs (route handlers, scraper events, structured JSON) | Cloudflare Workers Logs (20M events/mo / 7-day retention on Workers Paid) | $0 |
| Cron Trigger execution history | Cloudflare Workers Cron Events panel (last 100 runs, status, duration) | $0 |
| Custom metrics (north-star: outbound clicks per retailer per day; secondary: build-session funnel) | Workers Analytics Engine writeDataPoint API + SQL query |
$0 (currently unbilled per Cloudflare's published "you will not be billed for AE" disclaimer; published $0.25/M-write rate stays under $1/mo at M1 even if billing flips on) |
| Long-retention log archive (30 days, queryable) | Axiom Free Personal tier via Cloudflare Logpush native app (1.5 GB/mo of 500 GB free; 30-day retention; APL query language; native dashboards) | $0 |
| Error grouping + stack-trace dedup + release tagging + Next.js + Cloudflare Workers SDKs | Sentry Free Developer tier (5k errors/mo / 5GB logs / 5M spans / 50 replays / 1 cron monitor / 1 user / 30-day lookback / email-only alerts) | $0 |
| Alerting destination (default) | Cloudflare Notifications email (free, included on Workers Paid; fires on Worker errors, script failures, Logpush job health) | $0 |
| Alerting destination (rich, optional) | Tail Worker filtering events → Discord/Slack incoming webhook (~30 lines of code; Tail Workers billed only by their own CPU time, rounding-error within Workers Paid included CPU) | $0 |
| Alerting destination (personal channel, opt-in MASTER decision) | telegram-notify MCP bridge via Tail Worker → Telegram Bot API |
$0 (gated on MASTER decision, see Open question 4) |
| Distributed tracing / APM | Deferred to M2/M3 — see Trade-offs | $0 |
What changes from today¶
- App moves from "no instrumentation" to: Workers Observability auto-on via
[observability]block inwrangler.jsonc, Cloudflare Web Analytics enabled on the Pages project,@sentry/cloudflare+@sentry/nextjsSDK packages added (the onlypackage.jsonchange this RFC implies — and not in this RFC's commit; in the implementation ticket). - Scraper jobs (per RFC-0002) emit structured logs from each pg-boss handler — picked up by Workers Logs automatically.
- Click-out route (
/api/go/...) writes one Analytics Engine data point per click in addition to the existingClickrow insert. - Cloudflare Notifications wires email alerts on: (a) Worker error rate > 1%, (b) Logpush job failure, © Workers script deploy failure. Scraper-failure alert is implemented as a saved Workers Logs query that triggers via the same Notifications channel.
- Axiom Logpush job created via the Axiom Cloudflare app (token paste, no glue code).
Behaviour¶
graph TB
subgraph Visitor["Lebanese mobile visitor (BEY PoP)"]
U[Visitor]
end
subgraph CF["Cloudflare edge"]
Pages[Cloudflare Pages<br/>Next.js on Workers]
WA[Web Analytics RUM beacon<br/>auto-enabled]
WL[Workers Logs<br/>20M/mo, 7-day]
WO[Workers Observability dashboard<br/>invocations/errors/P95]
AE[Workers Analytics Engine<br/>custom metrics]
CF_N[Cloudflare Notifications<br/>email]
Cron[Cron Triggers<br/>scraper invocation]
end
subgraph Backstops["Free-tier augments"]
AX[Axiom Personal<br/>30-day log archive]
SE[Sentry Free Developer<br/>error grouping]
end
U -->|page load + RUM beacon| WA
U -->|HTTP request| Pages
Pages -->|console.log + structured| WL
Pages -->|invocation telemetry| WO
Pages -->|writeDataPoint per click| AE
Pages -->|exception| SE
Cron -->|scrape job| Pages
WL -->|Logpush native app| AX
WO -->|threshold breach| CF_N
WL -->|saved query alert| CF_N
CF_N -->|email| MASTER[MASTER inbox]
SE -->|email| MASTER
Backstops -.->|optional| Tail[Tail Worker → Discord/Slack/Telegram]
Tail -.-> MASTER
Dashboard surfaces¶
- Cloudflare dashboard — Workers & Pages → 961tech-app → Observability tab (logs + metrics + cron events). Web Analytics → 961tech-pages.dev (pageviews + Web Vitals).
- Sentry web UI — Issues view; Releases tagged via
withSentryConfigbuild-time. Email-only alerts on free tier. - Axiom web UI — saved dashboards with APL queries: per-retailer scraper success rate, per-route P95 latency from logs, error-rate trend.
- No custom admin surface in 961tech itself at M1. Workers Analytics Engine SQL queries can be issued ad-hoc via
wranglerCLI or the dashboard. A/admin/jobspage reading pg-boss state per RFC-0002 § What we add lands in M2 polish — not on this RFC's critical path.
Trade-offs¶
| Cost | What it buys |
|---|---|
Cloudflare Workers performance.now() zeroing on CPU-bound spans. Server-side spans show 0ms duration because Workers' clock only advances after I/O (anti-timing-attack). Affects all frameworks on Workers including Next.js and Sentry's tracing SDK. Means distributed tracing at M1 would be low-signal. |
We defer tracing without losing functional observability. Workers Observability still gives wall-clock + CPU per invocation; query timing comes from in-handler Date.now() deltas around DB calls + structured logging, not from spans. Revisit when Cloudflare fixes the platform clock or when a tracing-friendly host enters scope. |
| Sentry email-only alerts on free tier. Slack / Discord / webhook integrations are gated to Team ($26/mo). Means error notifications stack in MASTER's inbox unless we route email → Telegram via a bridge. | Free tier covers everything else needed (5k errors/mo >> M1 needs, Next.js + Workers SDKs both GA, source-mapped traces, release tagging, 30-day lookback). The email constraint is bridgeable for \(0; the alternative (\)26/mo for Slack) eats the entire observability budget for one tool. |
| Workers Logs 7-day retention. Anything older requires Logpush to a destination. | Axiom Free covers it (500 GB/mo / 30-day) with native one-click integration. The 7-day Workers Logs window is plenty for live debugging; Axiom is the long-tail backstop. |
| Cloudflare Analytics Engine billing is "currently not billed." The published rate ($0.25/M write, $1/M read) could activate without notice. | Even at full billing, M1 lands at < $1/mo (well inside hard ceiling). Migration off AE if it ever becomes painful is mechanical — write the same data points to a Postgres table with a daily aggregator. |
| No session replay. PostHog (1k recordings/mo on free) and Highlight (500 sessions/mo) both ship replay; we don't. | Session replay is high-value at scale; at M1 with low traffic the DX cost (cookie consent UX, bundle weight from rrweb ~30-50 KB gz, privacy-masking config) outweighs the diagnostic gain. Add later if trip-reporting on a specific bug demands it; meanwhile screen-recording is a free MASTER + browser DevTools workflow. |
| No first-class Slack/Telegram integration on day one. Cloudflare Notifications and Sentry Free both default to email. | Tail Worker → webhook bridge is ~30 lines of code; Telegram via Bot API is straightforward. Wired in implementation ticket only if MASTER opts in (Open question 4). |
Sentry bundle weight (~35-55 KB gz client SDK). Counts against performance-budget.md §2.2 100 KB initial-JS budget. |
Tree-shake aggressively — disable replayIntegration (not used) and browserTracingIntegration (Workers tracing is zero-signal anyway) cuts to ~15-20 KB gz. The SDK pays for itself the first time a real-user error has a symbolicated stack trace. |
| Lock-in risk. Workers Analytics Engine, Logpush, Web Analytics are Cloudflare-specific; Sentry is a SaaS. | RFC-0001 already places us on Cloudflare; the lock-in is shared, not new. Sentry has a self-host path (sentry/onpremise) if we ever need it; Axiom has a SaaS-only model — but at $0/mo for years to come, that's an acceptable risk. |
Alternatives¶
Alternative A: Sentry-only ($0 at M1, $26 at M2)¶
Use Sentry Free for everything — errors, traces (low-signal), logs (5GB/mo on free), and a single cron monitor for the scraper heartbeat.
- Where it loses. No RUM (Web Vitals captured weakly via Sentry's perf tab; not as good as CFWA). No first-class custom-metrics API (you can stash counts in tags but querying is awkward). No pageview attribution by referrer/geo. Email-only alerts on free.
- Cost. $0 at M1; jumps to $26/mo Team if Slack/Telegram alerts become non-negotiable. Otherwise stays free through M2.
- Why it's not the pick. Misses the natural fit of Cloudflare-native primitives that are already free with the chosen host.
Alternative B: PostHog Free ($0 at M1, $0–30 at M2)¶
Use PostHog (cloud free tier) for everything — product analytics, error tracking, session replay, feature flags. Native Workers SDK exists with documented caveats.
- Where it wins. Most-bundled feature set; MIT-licensed exit path; product-analytics + feature flags are real wins for a future A/B-driven product.
- Where it loses. Bundle weight: full
posthog-jsis ~60 KB gz with replay enabled, ~10-15 KB gz withposthog-js-lite. Even the lite build is competing for bundle headroom we'd prefer to spend on Sentry's error symbolication. Cookie consent UX cost (cookies on by default; cookieless mode is a config choice). Doesn't replace CFWA's free Web Vitals beacon. - Cost. $0 at M1; $0-30/mo at M2 with replay sampled at ≤5%.
- Why it's not the pick. Genuinely strong second choice; the deciding factor is that we want Web Vitals + low bundle weight, and CFWA gives both for free. PostHog is the right pivot if/when feature-flag-driven experimentation becomes important (M3+).
Alternative C: Highlight.io PAYG ($50/mo from day one)¶
OTel-native observability + session replay + errors + logs.
- Where it loses. PAYG floor of $50/mo flat exceeds the entire \(20/mo observability budget. Free tier (500 sessions/mo) is too thin to be production. Self-host requires a CX32+ Hetzner (\)13/mo) and ops time we don't have.
- Why it's not the pick. Wrong shape for a $0–10/mo solo project.
Alternative D: Datadog ($15+/mo floor)¶
Best-in-class log + APM + dashboard product.
- Where it loses. Per-host commitments, complex SKU mixing, Pro-tier minimum kicks at ~$15/mo just for logs at our volume. Overkill for solo evening project.
- Why it's not the pick. Right answer at $5k/mo budget; wrong answer at $20.
Alternative E: Custom DIY (R2 + grep, Workers + Discord webhooks)¶
Pure DIY: Logpush to R2 ($0.10/mo storage), grep over downloaded NDJSON, Tail Worker → Discord webhook for alerts. No SaaS dependencies.
- Where it wins. Zero subscription cost. Maximum control.
- Where it loses. No query UI (R2 SQL is preview), no error grouping (every exception is a fresh log line), no dashboards, no symbolication, no Web Vitals collection without writing the beacon ourselves. The engineering time to backfill these capabilities is >100 hours; not a fit for evening-and-weekends.
- Why it's not the pick. All the capabilities Sentry Free + Axiom Free + Cloudflare Web Analytics give us for $0 cost real engineering time to replicate.
Alternative F: Plausible Cloud ($9/mo Starter)¶
Replace CFWA with Plausible for pageview analytics + outbound-click goal tracking.
- Where it wins. Plausible's outbound-link goal feature directly maps to the WQOC north-star metric — first-class. Lighter script (1.3 KB gz vs ~5-6 KB gz CFWA beacon).
- Where it loses. No Web Vitals product (Plausible doesn't track LCP/INP/CLS); we'd still need CFWA or a custom RUM beacon for
performance-budget.mdcompliance. $9/mo for what CFWA + a 30-line outbound-tracking Worker gives free. - Why it's not the pick. The custom outbound-click tracker is already needed for the north-star metric anyway (we want server-side counts, not client-side beacons that adblockers eat) — so Plausible's value-add over CFWA is small. Revisit if M2 wants Plausible's per-goal funnel UI without building one.
Open questions¶
These need MASTER input before this RFC moves to Accepted.
- North-star confirmation.
kpis.md§1 proposes Weekly Qualified Outbound Clicks. The observability stack collects everything needed for that. If MASTER prefers a quality-first north-star (median match-rate per category, "comparable retailers shown per session", etc.), the collection picks here still hold — but the dashboard surfaces in Axiom + Workers Analytics Engine SQL get re-shaped. Confirm the metric before the dashboards land. - Cost ceiling. This RFC lands at $0 incremental observability spend. If MASTER's actual ceiling is higher — say \(30/mo to unlock Sentry Team + Slack alerts + a Plausible Starter for richer goal funnels — the recommendation shifts to Sentry Team + Plausible Starter alongside the Cloudflare-native primitives. Default assumption: **\)5–10/mo is the real ceiling** (matches RFC-0001's "5-20" framing but tightened by the observed Cloudflare-native fit).
- Sentry Free email-alert bridge. Sentry Free is email-only for alerts. Acceptable to route Sentry email → Telegram via a simple IFTTT-style bridge or a tiny inbox-watcher Worker, or do we want to upgrade to Sentry Team ($26/mo, breaks the $20 ceiling)? Default assumption: bridge it — keep it free.
- Telegram-notify integration. MASTER's brain repo has a
telegram-notifyMCP / tooling. Should the alerting layer wire alerts directly into MASTER's personal Telegram, or keep alerts in a project-only inbox/email channel? Default: opt-in, not default. Mixing personal-life notifications and project-alert noise is a UX cost; surface as MASTER decision. - Sampling on RUM beacon. Cloudflare Web Analytics defaults to 100% sampling on free; high-volume sites sometimes sample down. At M1 traffic this is moot; surface for M2+ if pageview-quota becomes a concern (currently no quota documented).
- Telemetry retention. Workers Logs: 7-day. Axiom: 30-day. Workers Analytics Engine: 90-day default for AE event data. For long-tail trend analysis (year-over-year WQOC), a small daily aggregator that stashes per-day rollups in Postgres covers it; in scope for the implementation ticket if MASTER wants > 90-day historical metrics.
Implementation plan¶
Once MASTER picks this stack, the implementation work for #43 (and the wiring that follows after #19 ships hosting):
- Lock decision as ADR (next sequential number after 0005)
- Cloudflare Pages project: enable Web Analytics on the project (dashboard click)
-
wrangler.jsonc: enable[observability]block; sethead_sampling_rate = 1.0for M1 - Add
@sentry/cloudflare+@sentry/nextjsto dependencies;instrumentation.tsper Sentry docs;withSentryConfiginnext.config.ts; disablereplayIntegrationandbrowserTracingIntegrationto keep client bundle ≤ 20 KB gz - Sentry org + project: free Developer plan; capture DSN as
SENTRY_DSNenv var (perenv-vars.md) - Axiom org + dataset: free Personal plan; install Axiom Cloudflare Logpush app (paste API token); verify ingest within 1 hour
- Workers Analytics Engine: bind
OUTBOUND_CLICKSdataset inwrangler.jsonc;writeDataPointper outbound click in/api/go/...route handler with retailer + category dimensions - Cloudflare Notifications: configure email alerts for (a) Worker error rate > 1% over 5 min, (b) Logpush job health, © Worker script deploy failure, (d) Saved-Workers-Logs-query alert: scraper-success-rate < 80% over 24h
- (Optional, MASTER-decision) Tail Worker forwarding errors to Discord/Slack/Telegram webhook
- Document SQL queries for the M1 north-star + secondary metrics in a runbook (
docs/runbooks/observability-queries.md) - Update
tech-stack.md§ Observability to replace "Nothing instrumented today" with the live stack - Update
architecture/deployment.md§ Observability to replace the placeholder with the actual diagram - M2: write
/admin/jobspage readingpgboss.jobper RFC-0002 § What we add; fold in scraper-success-rate + recent-errors per retailer
Out of scope¶
- APM / distributed tracing. Workers
performance.now()zeroes on CPU-bound spans; defer until platform clock changes or until non-Workers host enters scope. - Session replay. Add later if a specific debugging scenario demands it (PostHog or Highlight free tier covers it on demand).
- Synthetic monitoring (Pingdom, UptimeRobot equivalents). Cloudflare's Workers Observability detects errors after the fact; synthetic checks (Cloudflare Health Checks free tier covers basic uptime) are a follow-up if needed.
- Real distributed-tracing destinations (Honeycomb, Tempo). Same reason as APM — server-side traces are zero-signal on Workers today.
- Per-feature analytics (funnels, retention cohorts). Deferred until product-feedback loop demands them; PostHog Free is the natural pivot if/when that need surfaces.
- Affiliate-postback observability. Owned by #17 once that ships; uses the same Workers Analytics Engine + Workers Logs pipes this RFC sets up.
- Security-event observability (CSP violations, rate-limit triggers, WAF events).
kpis.md§2.4 health row names them as M2 metrics; the collection uses the same Workers Logs + Notifications path this RFC picks. Which alerts cross the "wake someone up" bar is owned by #44. - Backups + disaster recovery observability. Pulled along by RFC-0001 § Open questions; separate ADR after RFC-0001 closes.
- Code changes in this branch. This RFC is decision-time; no
package.jsonedits, no env-var additions, nowrangler.jsoncchanges land in this commit per the foundation-ticket workflow.
Verification checklist (post-implementation)¶
Once the implementation ticket ships, the stack is "working" iff:
- Workers Observability dashboard shows non-zero invocations + per-route P95 wall-clock for the deployed app
- Cloudflare Web Analytics shows pageviews + LCP/CLS/INP per page within 24h of first deploy
- A deliberately-thrown test exception in a route handler appears in Sentry within 5 minutes, with source-mapped stack trace
- A click on
/api/go/...produces a row in theClicktable AND a data point in the Workers Analytics EngineOUTBOUND_CLICKSdataset (verified viawrangler analytics-engine sql) - A scrape job failure (manually triggered with a forced exception) fires a Cloudflare Notification email within 10 minutes
- Axiom dataset shows logs from Workers within 1h of the first deploy with default 30-day retention
- All-in monthly cost reading on Cloudflare billing + Sentry usage + Axiom usage = $5/mo (the Workers Paid baseline) within first calendar month
See also¶
kpis.md— what this stack collectsperformance-budget.md— what some of these signals police- RFC-0001 — Hosting target — the infra constraint
- RFC-0002 — Background jobs — the scraper job source for many of these signals
personas.md§5.2 device + connection — the rationale for prioritising RUM Web Vitalscompetitive-landscape.md§3.7 craft — the "out-craft the genre" mandate this stack enables by giving us real RUM data to enforce a craft bar againsttech-stack.md§ Observability — current state ("nothing instrumented") this RFC replacesarchitecture/deployment.md§ Observability — placeholder diagram this RFC replaces