Skip to content

ADR-0008: Search backend is Postgres FTS + pg_trgm + unaccent + curated synonym/transliteration map

Context

RFC-0003 compared three search backends for product browse + the build picker: Postgres FTS (pg_trgm + unaccent), self-hosted Meilisearch, and Typesense.

961tech needs typo-tolerant, faceted search that handles Arabic + English queries despite the UI being English-only (ADR-0004). Lebanese SERP traffic arrives partly in Arabic ("كرت شاشة" / "GPU"), and brand-name typos / transliterations are common. Catalog scale: ~2k listings at M1, ~5–20k expected within a year.

Decision

Use Postgres full-text search with pg_trgm (trigram fuzzy fallback), unaccent (Latin diacritics), and a hand-curated JSON synonym/transliteration map, all hidden behind a src/lib/search.ts interface so a future Meilisearch migration is bounded.

Schema additions (when #39 implements)

  • Generated tsvector column on Product covering brand, model, with weighted setweight() for title vs body. Dictionary configured as simple (the arabic snowball stemmer is too lossy).
  • Generated tsvector column on Listing covering titleRaw.
  • GIN index on each tsvector.
  • GIN trigram index (pg_trgm) on Product.brand || ' ' || Product.model and Listing.titleRaw for typo / partial-match fallback.
  • unaccent extension enabled.

Arabic + English normalization (in app code, applied symmetrically at index time and query time)

  1. Strip Arabic tashkeel (diacritics)
  2. Normalize alef forms: ا/أ/إ/آ → ا
  3. Normalize ya: ي/ى → ي
  4. Normalize ta marbuta: ة → ه
  5. Lowercase Latin
  6. Strip Latin diacritics via unaccent() at query time

Lives in src/lib/search.ts.

Synonym / transliteration map

A plain JSON file at src/lib/search-synonyms.json mapping Arabic ↔ English: كيس ↔ case, معالج ↔ cpu, كرت شاشة ↔ gpu, ذاكرة ↔ ram, مزود ↔ psu, etc. Query-time expansion ORs synonyms into the tsquery. The map is curated, lives in the repo, is portable to any future backend.

Public interface (src/lib/search.ts)

Exposed signatures: search(query, filters, facets) → SearchHits, suggest(prefix, limit) → string[]. All call sites use this; the implementation is swappable.

Consequences

Positive

  • Sync is implicit. Indexes update inside the same Postgres transaction as the scrape upsert. No separate pipeline; no out-of-sync failure mode; no second backup story.
  • No new infrastructure. Same managed Postgres (ADR-0006). One less container, one less vendor-cost tier.
  • Synonym map is the actual product — portable, reviewable, expandable from search-log analysis (when #43 observability lands).
  • Faceting via GROUP BY ... FILTER is cheap at our cardinality.
  • Right-sized for M1/M2. ~2k listings now → ~5–20k within a year. Postgres FTS comfortably handles that.

Negative

  • Arabic morphology is shallow. No root-form unification (كاتب / كتاب / مكتوب don't unify). Same caveat applies to Meili and Typesense, less severely.
  • Trigram fallback latency starts to feel laggy ~10k rows. We're not there yet.
  • No instant-search-as-you-type at scale. If we want sub-100ms typo-tolerant autocomplete on 50k+ items, Postgres becomes a fight. Migration trigger documented below.
  • Faceting at high cardinality gets verbose. Wrap in a lib/search.ts helper.
  • Synonym map is hand-curated. Someone maintains src/lib/search-synonyms.json. Curated by MASTER; expanded over time from search-log analysis.

Neutral

  • The portability harness costs nothing at runtime. The interface is just TypeScript types + a single import path. Migration to Meili later is "swap the implementation, keep the signature."

Migration triggers — pre-recorded

Switch to Meilisearch (likely) when any fires:

  1. Catalog crosses ~25k listings.
  2. Instant-search-as-you-type becomes a real product requirement.
  3. Arabic-query CTR data shows trigram fallback failing real users (i.e. queries with no English-equivalent in the synonym map are losing buyers — observed via #43 telemetry).

Documenting in advance so we don't drift past the threshold without noticing.

Alternatives considered

Meilisearch (self-hosted Docker) — rejected for now

Best-in-class fuzzy search; Arabic via charabia tokenizer (better than raw Postgres). Loses today on: extra container + sync pipeline + backup story. Sync becomes eventual rather than transactional with Postgres.

Typesense — rejected for now

Comparable to Meilisearch. Marginally better for facet ranges + curation rules. Loses for the same reasons as Meili: extra infra delta vs Postgres FTS. The marginal feature differences don't pay for the operational cost at our M1 scale.

Decisions on RFC-0003 open questions

  • Arabic-query SERP at launch: ship light — bootstrap the synonym map with ~30 high-value entries (top Arabic equivalents for each component category) for M1; expand over time from search-log analysis once #43 lands. Don't gate launch on a comprehensive map.
  • Synonym map ownership: MASTER curates. Initial bootstrap can be AI-assisted with MASTER review; ongoing expansion driven by real search logs.
  • Migration trigger thresholds: ~25k listings + instant-search-UX requirement OR Arabic CTR failure (see above). Recorded.
  • Search latency observability: decided when ADR-0006 / #43 observability lands. For now, log tookMs in lib/search.ts and surface in slow-query telemetry.
  • Unified vs separate (products + listings): unified result set in M1 (single API, simpler UX). Revisit if M2 introduces an "advanced search" UI.

References