Architecture

How Firn turns cheap object storage into a fast, multi-tenant search engine.

Tiered storage

Firn's core idea is a three-tier storage hierarchy. Data lives permanently in the configured object store, but queries are served from the fastest available tier:

Tier	Backed by	Latency	Role
L1: RAM	foyer in-memory cache	sub-microsecond	Hottest queries, configurable size
L2: NVMe	foyer disk cache	microseconds	Overflow from RAM tier, larger capacity
L3: Object storage	LanceDB on AWS S3 / MinIO / R2 / Tigris / Spaces / native GCS	milliseconds to seconds	Source of truth, unlimited capacity, near-zero idle cost

The exact cache stores complete, serialised query result sets. When a query hits the cache, zero bytes are read from object storage. When it misses, the result is fetched from the backend and stored in the cache for future use.

Single-vector queries can also opt into a small in-memory semantic sidecar. After an exact-cache miss, Firn can compare the incoming query vector with recent cached query vectors in the same namespace generation. If cosine similarity clears the request's threshold and k / nprobes / include_vector match, Firn reuses the previous top-k bytes without touching object storage. This is approximate result reuse, so callers must opt in per request.

Request middleware

Every protected endpoint runs through a fixed middleware stack before the handler. The stack is wired up in firnflow_api::router and the order is load-bearing — bucketing on a validated Principal rather than an attacker-supplied header is the whole reason auth runs first.

request
  │
  ▼
[ /health, /metrics — neither auth nor rate-limit ]

[ all other routes ]
  │
  ▼
pre-auth IP limiter (optional, outermost)
  │
  ▼
auth (require_write or require_admin) — attaches Principal
  │
  ▼
per-principal limiter (optional) — keys on Principal
  │
  ▼
handler

The pre-auth IP limiter caps credential-stuffing throughput per peer IP and runs before auth on purpose. The per-principal limiter sits inside auth so it can read the validated identity. /health never traverses either layer; /metrics is gated only by the optional FIRNFLOW_METRICS_TOKEN. See configuration for the env-var surface.

Query path

Every query follows this exact sequence:

Hash the query. The cacheable search fields in QueryRequest (vector, vectors, k, nprobes, text, include_vector) are serialised with bincode and hashed with xxh3 to produce a deterministic QueryHash. include_vector is part of the key because a full and a vector-light result for the same search are different payloads that must not collide. The semantic_cache control block is deliberately excluded so toggling it does not split otherwise-identical exact-cache entries.
Build the cache key. The key is a tuple of (namespace, generation, query_hash). The generation is derived from the namespace's current Lance table version, read from the pooled handle (in memory once the handle is warm), so it advances on every committed write and is stable across process restarts — a recovered NVMe entry is only reachable while the table is unchanged. The manifest commit timestamp is folded in so a deleted-and-recreated namespace does not reuse a prior incarnation's cache entries.
Check foyer. The HybridCache checks RAM first, then NVMe. On a hit, the serialised result is returned and a cache_hits_total metric is recorded.
Optional semantic lookup. If the exact cache misses and the request includes semantic_cache.enabled: true, Firn scans the namespace's in-memory semantic sidecar for a nearby single-vector query with matching k, resolved nprobes, and include_vector. A hit returns that cached result immediately and records firnflow_semantic_cache_hits_total. Ineligible shapes such as multivector, FTS, and hybrid requests return 400 when semantic caching is enabled.
On miss: query object storage. LanceDB runs the query against the Lance table on the configured backend (vector nearest-neighbour, multivector, BM25 FTS, or hybrid) using the namespace's pooled Connection and Table handle — the first request for a namespace opens and caches the handle, every subsequent request skips the credential-resolution and manifest-read round-trips. The result is serialised with bincode and stored in foyer. Eligible semantic-cache requests also add the query vector and result bytes to the sidecar. The exact-cache miss counter was already recorded before the semantic lookup; backend work records s3_requests_total. (The metric name predates native GCS support; it now counts object-store operations regardless of backend.)
Return the result. The query duration (cache hit or miss) is recorded in the query_duration_seconds histogram.

The semantic sidecar is bounded to 1024 entries per namespace generation and lives only in the Firn process. Any committed change — write, delete, compaction, or index build — advances the table version and so drops both the exact cache and the sidecar for that namespace on the next read. Index builds invalidate even though the rows are unchanged, because building an index is itself a Lance commit; the trade-off buys post-build queries that use the new index instead of replaying a pre-index result.

Write path

Writes follow a strict ordering to prevent stale cache state:

Merge-insert into object storage. Rows are written to the namespace's Lance table via LanceDB's merge-insert, keyed by id: an existing id is replaced in full (latest-write-wins) and a new id is inserted.
Invalidation is intrinsic. A successful write advances the Lance table version, so the next read derives a new generation and every entry cached against the prior version becomes unreachable by key. There is no separate invalidation step to forget.
Record metrics. write_duration_seconds and s3_requests_total{operation=upsert} are updated.

This is automatic and safe: the version only advances on a committed write, so a failed write leaves the generation — and therefore the cache — untouched with valid data. There is no window where the cache is empty and queries would storm the backend.

Cache invalidation

Firn keys cached results on the Lance table version, so invalidation falls out of the data itself rather than a separate bookkeeping step.

How it works

The cache key carries a per-namespace generation:

CacheKey {
    namespace: NamespaceId,
    generation: u64,
    query: QueryHash,
}

The generation is a hash of the namespace's current Lance table version and the manifest's commit timestamp_nanos. The query read path reads these from the pooled table handle and forms the generation before consulting the cache. Because the version advances on every commit (append, delete, compaction, index build), any change moves the generation and leaves entries cached against the prior version unreachable by key construction; foyer's normal LFU/LRU eviction reclaims their space over time. The commit timestamp distinguishes two incarnations of a namespace that reach the same version after a delete-and-recreate.

Why this approach

Property	Value
Invalidation cost	O(1) and implicit — a committed write already changes the version.
Survives restart	Yes. The version lives in the Lance manifest, so a recovered NVMe entry is only reachable while the table is unchanged. An earlier in-memory counter reset to 0 on restart and could serve stale entries.
Replication	Single replica per bucket. The version is read from the cached handle without a `checkout_latest`, so one process does not observe another's commits.
Trade-off	Forming the generation reads the version before the cache lookup, so the first query to a namespace in a process opens its handle (one manifest read) even on a cache hit; later hits read the version from memory. Stale-generation entries also linger until foyer evicts them.

Object cache

The result cache above answers a whole query from a single cached payload, so a repeated query reads nothing from object storage. It cannot help a query it has never seen — a new filter, a new vector, or the first query after a restart still falls through to the backend, where cold latency is dominated by object-storage reads. The object cache is an optional second layer that targets exactly those reads. It is disabled by default and enabled with FIRNFLOW_OBJECT_CACHE_ENABLED.

It sits one level below the storage engine: a local byte-range cache wrapped around the object store the engine reads through. When the engine asks for a range of a file, the cache serves it from local disk if present, otherwise fetches it from object storage and keeps a copy. The next query that touches the same data and index bytes — even a different query shape — is served locally.

What it caches, and why that is safe

Correctness rests on caching only objects that are immutable once written. Lance writes data fragments and index files under unique, write-once paths and never mutates them in place, so a cached byte range can never go stale: a later version is a different file with a different name. The cache restricts itself to those object classes.

Everything that can change is always read straight through, never cached: manifests, the version pointer, and transaction logs, plus any HEAD, conditional, or explicitly-versioned read. Because the manifest is always read live, a write, delete, compaction, or index build is observed immediately — the engine resolves the new manifest, which names new fragment and index files, and the cache simply misses on those new names and fetches them. A deleted-and-recreated namespace is safe for the same reason: it writes fresh unique paths, so old cached bytes are unreachable rather than served. There is no invalidation step because there is nothing mutable to invalidate.

Capacity and restart

The cache is bounded by a byte budget (FIRNFLOW_OBJECT_CACHE_BYTES); when full it evicts least-recently-used entries. On startup it scans its directory and counts existing files toward the budget, evicting down to the cap before serving, so a restart reuses the warm cache without exceeding its bound. Like every cache here it is a performance layer only — object storage remains the source of truth, and losing the local cache costs latency, never data.

Effectiveness and savings are observable through the object-cache metrics (firnflow_object_cache_hits_total, _misses_total, _s3_bytes_total, _evictions_total).

Connection pool

The namespace manager caches a lancedb::Connection and Table handle per namespace after the first open. Without the pool, every upsert and every cache-miss query would re-run credential resolution (AWS chain or Google service-account JSON, depending on backend) and re-read the Lance table manifest; with the pool, those costs are paid once per namespace per process lifetime.

The pool is invalidated on operations that change the table's manifest or remove its data:

DELETE /ns/{ns} — all objects gone, handle evicted
POST /ns/{ns}/index and /fts-index — manifest bumped, handle evicted
POST /ns/{ns}/compact — fragment offsets change, handle evicted

Ordinary upserts do not evict the handle. A merge-insert write commits through the cached handle and its result is visible to subsequent reads on that same handle, so there is no need to reopen the table on every write.

Pool occupancy is exposed as the firnflow_cached_handles gauge at /metrics. Compare it against firnflow_active_namespaces to see how many namespaces are currently paying the cold-open cost.

Namespace isolation

Each namespace is a fully isolated unit:

Namespace root: {FIRNFLOW_STORAGE_URI}/{namespace}/ — e.g. s3://my-bucket/docs/ or gs://my-bucket/tenants/acme/docs/. No data sharing between namespaces.
Vector dimension: inferred from the first upsert; subsequent upserts to the same namespace must match
Cache: one shared foyer instance, but keys are scoped by namespace
Lifecycle: lazy creation (no objects in the backend until the first write), full cleanup on delete

Cross-namespace queries are not supported and return a 400 error.

Per-namespace vector dimensions

There is no global vector dimension setting. Each namespace independently determines its dimension:

On the first upsert, the dimension is inferred from the first row's vector length
On subsequent upserts, every row is validated against the resolved dimension
On re-open (after restart), the dimension is read from the existing Lance table schema

This means a single Firn instance can serve namespaces at different dimensions simultaneously, for example 384-dim sentence embeddings alongside 1536-dim OpenAI embeddings.

ANN index

Firn supports explicit IVF_PQ (Inverted File with Product Quantisation) index builds via the /ns/{ns}/index endpoint. The index is optional; without it, queries perform a linear scan.

When to build an index

Once the namespace has enough rows that cold query latency matters (usually >10k vectors)
After the bulk of your data is loaded (building on a partial dataset then upserting more data degrades index quality until rebuilt)

Impact on latency

On AWS S3 with 100k vectors at dim=1536:

	Without index	With IVF_PQ	Speedup
Cold query (p50)	25.14 s	979 ms	25.7x
Warm query (p50)	66 µs	72 µs	(cache dominates)

The index matters most for cold queries. Once a result is cached, the index makes no difference.

The nprobes parameter

When an IVF_PQ index exists, the nprobes query parameter controls how many IVF partitions are searched. Higher values improve recall but increase latency. The default is 20.

Full-text search

Each namespace's schema includes a nullable text column. When text data is present and an FTS index has been built (via /ns/{ns}/fts-index), three query modes are available:

Vector-only: provide vector, omit text
FTS-only: provide text, omit vector
Hybrid: provide both. LanceDB automatically fuses results via Reciprocal Rank Fusion (RRF).

Compaction

Each upsert creates a new Lance data fragment in the backing object store. After many small upserts, the namespace accumulates many small files, which increases cold query latency (more GET requests per scan). Compaction merges these fragments into fewer, larger files.

Triggered explicitly via POST /ns/{ns}/compact
Runs in the background (returns 202)
Invalidates the cache after completion (file offsets change)
Target: 1 million rows per fragment

Concurrency

Firn relies on LanceDB's native concurrency model, which uses the backend's conditional-write primitive to serialise commits between concurrent writers. For S3-family backends that is If-None-Match: *; for native Google Cloud Storage it is the generation precondition (x-goog-if-generation-match: 0 on the GCS XML API), which lancedb's gcs feature routes through transparently. Both paths are stress-tested with 100 consecutive runs of an 8-writer × 100-row contention test against real buckets (AWS S3, MinIO, R2, Tigris, Spaces, and native GCS), each showing zero row-count discrepancies. The GCS S3-interop endpoint is a separate path that silently drops If-None-Match: * and is not supported.

Single-node design

Firn currently operates as a single-node service. The cache is in-process (not distributed). Horizontal scaling would require a shared cache layer or request routing, which is not yet implemented.

Serialisation

Cached result sets are serialised with bincode 2 (serde path). This was benchmarked against realistic payloads:

Result set size	Round-trip p99
10 results (1536-dim)	32 µs
100 results (1536-dim)	318 µs
1000 results (1536-dim)	3 ms

The architecture includes a documented upgrade path to rkyv (zero-copy deserialisation) if serialisation overhead becomes a bottleneck at scale.