Skip to content

RFC-0003: Search backend for product browse and build picker

Decision recorded. Postgres FTS + pg_trgm + unaccent + curated synonym map, behind src/lib/search.ts. See ADR-0008 for the locked decision; this RFC remains as the comparative analysis that produced it.

Summary

961tech browse and the build picker need search — typo-tolerant, faceted, and able to handle Arabic + English queries with common transliterations ("كيس" / "case", "معالج" / "CPU"). Recommendation: Postgres FTS + pg_trgm + unaccent + a hand-curated synonym/transliteration map, with the implementation hidden behind a src/lib/search.ts interface so the migration to Meilisearch (the most likely upgrade path) is bounded. The migration trigger is one of: catalog crosses ~25k listings, instant-search-as-you-type becomes a real product requirement, or Arabic-query CTR data shows the trigram fallback is failing real users.

Motivation

Today, browse is a category-filtered list rendered server-side with Prisma findMany (see Architecture → Overview § Routes). There is no search. #39 is the search ticket and is currently unscoped.

Search has to be designed before it's built because retrofitting is expensive: schema changes (extra tsvector columns or index sync pipelines), query patterns, the layer that owns query parsing, and — most importantly — the synonym/transliteration map are all decisions that a "just bolt search on later" approach gets wrong. The Arabic dimension makes that especially true.

The Lebanese market context comes from Reference → Personas and Reference → Competitive landscape: users land via Google with Arabic SERP queries, brand-name typos are common ("ryzen" / "رايزن"), and the AIB-vs-reference GPU naming problem ("ASUS ROG STRIX RTX 4070" → "RTX 4070") needs to be addressed at the matching layer, not the search layer.

Proposal

Three options were researched (Postgres FTS + pg_trgm + unaccent; self-hosted Meilisearch; Typesense). The proposal is option 1, with a portability harness that makes option 2 cheap to migrate to later.

Recommendation: Postgres FTS + pg_trgm + unaccent + synonym map

Schema additions.

  • Add a generated tsvector column on Product covering brand, model, weighted setweight() for title vs body, with the dictionary configured as simple (so we don't lose Arabic — the arabic snowball stemmer is too lossy) plus a normalization step described below.
  • Add a generated tsvector column on Listing covering titleRaw.
  • GIN index on each tsvector.
  • GIN trigram index (pg_trgm) on Product.brand || ' ' || Product.model and Listing.titleRaw for typo / partial-match fallback.
  • unaccent extension enabled (handles Latin diacritics, plus we'll wrap it for Arabic — see below).

Arabic + English normalization.

Postgres' built-in arabic stemmer is too coarse and unaccent is Latin-only. Both languages get a normalization pass in app code before insert / before query:

  1. Strip Arabic tashkeel (diacritics).
  2. Normalize alef forms: ا/أ/إ/آ → ا.
  3. Normalize ya: ي/ى → ي.
  4. Normalize ta marbuta: ة → ه.
  5. Lowercase Latin.
  6. Strip Latin diacritics via unaccent() at query time.

This lives in src/lib/search.ts so it's symmetric (same normalization for indexed text and query text).

Synonym / transliteration map.

A plain JSON file at src/lib/search-synonyms.json mapping:

{
  "كيس": ["case", "كيسه"],
  "معالج": ["cpu", "processor"],
  "كرت شاشة": ["gpu", "graphics card"],
  "ذاكرة": ["ram", "memory"],
  "مزود": ["psu", "power supply"]
}

Query expansion happens at query time — for each token, look up synonyms, OR them into the tsquery. The map is the actual product — it's portable across all three backend choices and lives in the repo.

Faceting.

Standard GROUP BY ... FILTER (WHERE ...) queries against the same indexes. Categories, brands, retailers, in-stock, price-range — all cheap at our scale.

Autocomplete.

pg_trgm GIN + ILIKE 'prefix%', single-digit ms at our scale.

Public interface — the portability harness

Every search call in src/ goes through one module:

// src/lib/search.ts

export interface SearchHits {
  hits: Array<{ kind: 'product' | 'listing'; id: string; score: number; highlight?: string }>;
  facets: Record<string, Array<{ value: string; count: number }>>;
  total: number;
  tookMs: number;
}

export async function search(
  query: string,
  filters: { category?: Category; brand?: string; retailerId?: string; inStock?: boolean; priceUsdMin?: number; priceUsdMax?: number },
  facets: Array<'category' | 'brand' | 'retailer' | 'inStock'>,
): Promise<SearchHits> { ... }

export async function suggest(prefix: string, limit?: number): Promise<Array<{ value: string; kind: 'product' | 'listing' }>> { ... }

Implementation detail (Postgres queries, normalization, synonym expansion) lives behind this interface. The migration to Meilisearch later is "swap the implementation, keep the signature."

What this proposal does NOT do

  • Does not introduce ML reranking, vector search, or LLM-assisted query understanding. All three are deferrable to specific future RFCs once we have query data.
  • Does not solve AIB-vs-reference GPU naming — that's a matching problem (src/lib/matching.ts), not a search problem.

Trade-offs

Cost What it buys
Arabic morphology is shallow. Postgres has no proper Arabic root-form unification ("كاتب / كتاب / مكتوب" don't unify). Same caveat applies (less severely) to Meili and Typesense. The synonym map covers the practical pain points (transliterations, common variants); deep morphology isn't a real M1 user need.
Trigram fallback latency starts to feel laggy around 10k rows. At our M1 scale (~2k listings, ~5–20k within a year), trigram is fine.
Sync is implicit (no separate pipeline). Indexes update inside the same Postgres transaction as the scrape upsert. Zero "out-of-sync" failure mode. No backup story for a separate index. No second container.
Faceting via GROUP BY ... FILTER gets verbose at high cardinality. Cheap at our cardinality. Worth a lib/search.ts helper to normalize.
No instant-as-you-type at scale. If we want sub-100ms typo-tolerant autocomplete on 50k+ items, Postgres becomes a fight. At M1 we're not promising that UX. The migration trigger handles this case.
Synonym map is hand-curated. Someone has to maintain src/lib/search-synonyms.json. The map is the actual value — it lives in the repo, it's reviewable, and it's portable to any backend choice.

Alternatives

Meilisearch (self-hosted Docker)

Purpose-built fuzzy search engine. Best-in-class typo tolerance (configurable typo budget per word length, prefix search, sub-50ms p95). Arabic supported via the charabia tokenizer (Arabic segmentation, alef/ya normalization, tashkeel stripping) since v1.3+ — better than raw Postgres but still no deep morphology. Faceting first-class. Self-hosted: one container, ~200MB–1GB RAM. Meili Cloud entry tier ~$30/mo.

  • Where it wins: above ~10–20k documents, the moment instant-search-as-you-type becomes a product requirement, or when Arabic handling becomes a user complaint.
  • Where it loses for us today: extra container (and its own backup story, sync pipeline, failure mode) for a problem we don't have yet. Sync is eventual — not transactional with Postgres.
  • Migration cost from FTS: bounded if we hide search behind lib/search.ts. Mostly a re-implementation of the module's body + a sync handler in the scrape pipeline.

Typesense

Comparable to Meilisearch — typo tolerance, Arabic tokenizer (Unicode segmentation, normalization, similar caveats), faceting, curation rules (pinned/hidden results per query — Meili needs more work for this), vector search. Self-hosted ~300MB–1GB RAM; Typesense Cloud entry tier $19/mo.

  • Where it wins over Meili (marginally): richer faceting (range facets out-of-box), curation rules, slightly more mature multi-node story.
  • Where it loses: same as Meili for our workload — extra infra + sync pipeline. Marginal feature differences don't pay for the operational delta over Postgres FTS today.

Open questions

  1. Is Arabic-query SERP a hard requirement at M1? Reference → Personas and Reference → Competitive landscape suggest yes — but is it required to launch or required by M2? If launch-required, the synonym map needs investment up front; if M2-required, ship with English-first FTS and accept thin Arabic coverage at first.
  2. Who owns the synonym map? Recommend MASTER. The map is small (10s to low-100s of entries) and reflects real user vocabulary that an automated extractor won't get right. It can be expanded over time from search-log analysis.
  3. Migration trigger thresholds. Above ~25k listings, or when a specific UX requirement (instant typo-tolerant autocomplete) is committed to. Worth writing the trigger into the future ADR so we don't drift past it without noticing.
  4. How is search latency observed? Decided once observability lands (RFC-0001). For now, log tookMs in lib/search.ts and surface in the same place we surface other slow-query telemetry.
  5. Do we want unified search across products + listings, or two separate indexes / endpoints? The interface above models them as a single result set; this is fine for product-card-style results, harder for power-user queries that want listing-level prices. Default to unified; revisit if M2 introduces an "advanced search" UI.

Implementation plan

Once MASTER picks Postgres FTS, the implementation work for #39 is roughly:

  • Lock decision as ADR 0006 (or whichever number is next)
  • Migration: enable pg_trgm + unaccent extensions; add tsvector columns + GIN indexes on Product and Listing
  • Create src/lib/search.ts with the interface above; implement search() and suggest() against Postgres
  • Create src/lib/search-synonyms.json with the initial transliteration / synonym map (MASTER-curated)
  • Wire normalisation into the scrape pipeline so tsvector columns stay current on upsert
  • Surface search in the UI: (a) global search box in TopNav, (b) the /build/choose/[category] picker page, © /products?q=... query
  • Tests: golden-path queries (English, Arabic, transliterated), facet correctness, prefix autocomplete latency
  • Architecture page update: add a docs/architecture/search.md page narrating the indexing + query path

Out of scope

  • Vector / embedding search. Future RFC. Worth waiting for query data to know whether we even need it.
  • LLM-assisted query understanding (e.g., "best AM5 build under $2000" → filter set). Future RFC.
  • Search-result personalisation / ranking models. Way out of scope at M1 traffic.
  • Search log / analytics pipeline. Pulled along by the observability decision. For now, server logs of tookMs and the q= query string are the input to "is this working" intuition.
  • Spell correction / "did you mean" suggestions. Trigram fallback gives us approximate-match scoring already; explicit "did you mean" UI is a future polish.
  • Indexing build guides / blog posts. When #19 eventually adds editorial guides (Architecture → Overview § Routes), they get their own search story — out of scope here.