Monitoring

Firn exposes Prometheus metrics at GET /metrics that give full visibility into cache effectiveness and object-storage cost savings.

Scrape configuration

Add Firn to your Prometheus scrape targets:

# prometheus.yml
scrape_configs:
  - job_name: firn
    scrape_interval: 15s
    static_configs:
      - targets: ['firn:3000']

The endpoint returns metrics in Prometheus text exposition format (text/plain; version=0.0.4).

If FIRNFLOW_METRICS_TOKEN is configured, /metrics requires Authorization: Bearer <token>. Same Bearer parser as the data plane, so Prometheus's bearer_token_file scrape config works directly:

scrape_configs:
  - job_name: firn
    bearer_token_file: /etc/prometheus/firn-metrics-token
    static_configs:
      - targets: ['firn:3000']

Leaving the token unset (the default) keeps /metrics open, which matches the typical "scrape from a private subnet" pattern.

Metric reference

Cache metrics

Metric	Type	Labels	Description
`firnflow_cache_hits_total`	Counter	`namespace`	Total cache hits. A hit serves the result from RAM or NVMe without re-running the search. Once a namespace's handle is warm a hit makes no object-storage access; the first query to a namespace in a process reads its manifest once (to form the version-based cache key) even when it then hits the cache.
`firnflow_cache_misses_total`	Counter	`namespace`	Total exact-cache misses. Without semantic caching, each miss reaches the configured backend via LanceDB. With `semantic_cache.enabled: true`, a semantic-cache hit can still avoid the backend after this counter increments.
`firnflow_semantic_cache_hits_total`	Counter	`namespace`	Opt-in semantic-cache hits. Each hit reused a previous near-duplicate single-vector query's top-k result after the exact cache missed.
`firnflow_semantic_cache_misses_total`	Counter	`namespace`	Eligible semantic-cache lookups where the sidecar had entries, but no cached query cleared the request's `min_similarity` threshold with matching `k`, `nprobes`, and `include_vector`.
`firnflow_semantic_cache_rejections_total`	Counter	`namespace`, `reason`	Semantic-cache lookups rejected before a similarity scan. Reasons are bounded to `unsupported_query_shape` and `empty_index`.

Latency metrics

Metric	Type	Labels	Description
`firnflow_query_duration_seconds`	Histogram	`namespace`, `query_type`	End-to-end query latency through the cache-aside path, including serialisation. The `query_type` label is `vector`, `multivector`, `fts`, or `hybrid`.
`firnflow_write_duration_seconds`	Histogram	`namespace`	Upsert or delete latency, including cache invalidation time.
`firnflow_index_build_duration_seconds`	Histogram	`namespace`, `kind`	Time to build a vector or FTS index. Buckets go up to 600 seconds. The `kind` label is `ivf_pq` or `fts`.
`firnflow_compaction_duration_seconds`	Histogram	`namespace`	Time to compact data files. Buckets go up to 600 seconds.

Cost metrics

Metric	Type	Labels	Description
`firnflow_s3_requests_total`	Counter	`namespace`, `operation`	Number of Firn-initiated operations that hit the configured object-storage backend. Operations include `query`, `upsert`, and `delete`. This is the primary signal for whether the cache is saving you backend request costs. The metric name predates native GCS support and counts requests against any backend (S3-family or native GCS); it is kept as `s3_requests_total` for dashboard continuity.
`firnflow_active_namespaces`	Gauge	none	Number of distinct namespaces that have been accessed since startup.
`firnflow_cached_handles`	Gauge	none	Number of namespaces holding a warm `lancedb::Connection` and `Table` handle in the in-process pool. The delta against `firnflow_active_namespaces` is the number of active namespaces that will pay the cold-open cost on their next request.

The key metric

firnflow_s3_requests_total is the metric that proves backend work happened. For exact-cache-only traffic, compare s3_requests_total{operation="query"} against cache_misses_total; they should be equal. For semantic-cache traffic, semantic hits make s3_requests_total{operation="query"} lower than exact-cache misses. (The metric name is historical and kept stable for dashboard continuity; it counts requests against any configured backend.)

Object cache metrics

These track the optional byte-range object cache. They are global rather than per-namespace, because the cache operates at the object-store layer beneath the namespace abstraction and has no namespace label. They are always registered and read 0 when the object cache is disabled.

Metric	Type	Labels	Description
`firnflow_object_cache_hits_total`	Counter	none	Byte-range reads served from the local object cache without an object-storage round-trip. The primary effectiveness signal for cold and novel queries.
`firnflow_object_cache_misses_total`	Counter	none	Cacheable reads that missed, were fetched from object storage, and were then written to the local cache.
`firnflow_object_cache_inner_gets_total`	Counter	none	Reads forwarded to object storage: misses plus uncacheable passthroughs (manifests, version pointers, and HEAD / conditional / versioned reads, which are never cached).
`firnflow_object_cache_s3_bytes_total`	Counter	none	Total bytes fetched from object storage by the cache on misses. A proxy for egress and request-size cost avoided on subsequent hits.
`firnflow_object_cache_evictions_total`	Counter	none	Entries evicted from the local cache to stay within `FIRNFLOW_OBJECT_CACHE_BYTES`. A high rate relative to hits suggests the byte cap is too small for the working set.

Auth and rate-limit metrics

Metric	Type	Labels	Description
`firnflow_auth_rejections_total`	Counter	`reason`	Requests rejected before reaching their handler. `reason` is one of: `missing` — no `Authorization` header on a protected route. `invalid` — header present but token does not match a configured key. `forbidden` — valid token but insufficient scope (write key on an admin route while a separate admin key is configured). `rate_limited` — shed by the per-principal or pre-auth IP limiter. Use this counter to detect misconfigured keys after a rotation (sudden spike in `missing`) or credential-stuffing pressure (sustained spike in `invalid` or `rate_limited`).

PromQL examples

Cache hit rate (per namespace)

The fraction of queries served from cache without touching the backend:

firnflow_cache_hits_total{namespace="production"}
/
(firnflow_cache_hits_total{namespace="production"} + firnflow_cache_misses_total{namespace="production"})

Cache hit rate (global, over 5 minutes)

sum(rate(firnflow_cache_hits_total[5m]))
/
(sum(rate(firnflow_cache_hits_total[5m])) + sum(rate(firnflow_cache_misses_total[5m])))

Query latency p50 / p99

# p50
histogram_quantile(0.50, rate(firnflow_query_duration_seconds_bucket[5m]))

# p99
histogram_quantile(0.99, rate(firnflow_query_duration_seconds_bucket[5m]))

Object-storage request rate (per namespace)

rate(firnflow_s3_requests_total{namespace="production"}[5m])

Object-storage requests saved (total avoided queries)

sum(firnflow_cache_hits_total) + sum(firnflow_semantic_cache_hits_total)

Object-cache hit rate

The fraction of cacheable byte-range reads served from the local object cache:

sum(rate(firnflow_object_cache_hits_total[5m]))
/
(sum(rate(firnflow_object_cache_hits_total[5m])) + sum(rate(firnflow_object_cache_misses_total[5m])))

Object-storage bytes fetched by the object cache

rate(firnflow_object_cache_s3_bytes_total[5m])

Each exact-cache hit and semantic-cache hit is one object-storage query that did not happen. Semantic-cache hits are approximate result reuse, so track them separately when correctness calibration matters.

Semantic-cache hit rate (eligible lookups)

sum(rate(firnflow_semantic_cache_hits_total[5m]))
/
(sum(rate(firnflow_semantic_cache_hits_total[5m])) + sum(rate(firnflow_semantic_cache_misses_total[5m])))

Semantic-cache rejections by reason

sum by (reason) (rate(firnflow_semantic_cache_rejections_total[5m]))

Write throughput

rate(firnflow_s3_requests_total{operation="upsert"}[5m])

Namespaces paying cold-open cost

Active namespaces without a pooled connection handle. The first request to any of these will re-run credential resolution and re-read the Lance table manifest:

firnflow_active_namespaces - firnflow_cached_handles

Alerting rules

Suggested Prometheus alerting rules for production deployments:

# alerts.yml
groups:
  - name: firn
    rules:

      # Cache hit rate dropping below 80% over 15 minutes
      - alert: FirnLowCacheHitRate
        expr: |
          sum(rate(firnflow_cache_hits_total[15m]))
          /
          (sum(rate(firnflow_cache_hits_total[15m]))
           + sum(rate(firnflow_cache_misses_total[15m])))
          < 0.80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Firn cache hit rate is below 80%"
          description: >
            The cache hit rate has been below 80% for 10 minutes.
            This may indicate the working set exceeds cache capacity
            or a write-heavy workload is causing frequent invalidation.

      # Query latency p99 above 1 second (cold queries are slow)
      - alert: FirnHighQueryLatency
        expr: |
          histogram_quantile(0.99,
            rate(firnflow_query_duration_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Firn query latency p99 above 1 second"
          description: >
            High query latency suggests frequent cache misses
            hitting the object-storage backend. Consider increasing cache
            size, building an index, or warming the cache.

      # Object-storage request rate spike (unexpected backend load).
      # Alert rule name kept as FirnHighS3RequestRate for dashboard
      # continuity; the metric is backend-agnostic.
      - alert: FirnHighS3RequestRate
        expr: |
          sum(rate(firnflow_s3_requests_total{operation="query"}[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Firn object-storage query request rate above 10/s"
          description: >
            The cache is not absorbing enough queries. This increases
            object-storage costs and latency. Check if the working set
            has changed or if write-heavy invalidation is the cause.

Grafana dashboard

A minimal Grafana dashboard for Firn should include these panels:

Panel	Type	PromQL
Cache hit rate	Gauge	`sum(rate(cache_hits[5m])) / (sum(rate(cache_hits[5m])) + sum(rate(cache_misses[5m])))`
Query latency (p50, p99)	Time series	`histogram_quantile(0.50, rate(query_duration_seconds_bucket[5m]))`
Object-storage requests/sec by operation	Time series	`rate(s3_requests_total[5m])` grouped by `operation`
Object-storage requests saved (counter)	Stat	`sum(cache_hits_total)`
Active namespaces	Stat	`firnflow_active_namespaces`
Pooled connection handles	Stat	`firnflow_cached_handles`
Write latency (p50, p99)	Time series	`histogram_quantile(0.50, rate(write_duration_seconds_bucket[5m]))`
Cache hits vs misses	Time series (stacked)	`rate(cache_hits_total[5m])` and `rate(cache_misses_total[5m])`

Interpreting the metrics

Healthy signals

Cache hit rate above 80% for read-heavy workloads
s3_requests_total{operation=query} rate is low and stable
Semantic-cache hit rate is stable for workloads that opt into semantic_cache, with rejections mostly from empty_index immediately after startup or writes
Query latency p99 under 10ms (warm queries dominate)

Warning signals

Falling cache hit rate: the working set may exceed cache capacity. Increase FIRNFLOW_CACHE_MEMORY_BYTES or FIRNFLOW_CACHE_NVME_BYTES.
High s3_requests_total rate: too many cache misses are reaching the object-storage backend. This costs money and adds latency. Consider cache warmup, larger cache, or building an index.
Low semantic-cache hit rate: the threshold may be too strict for your embedding model, or query vectors may not cluster tightly enough for safe reuse. Lower semantic_cache.min_similarity only after measuring result quality on your corpus.
Rising query latency: if cold queries dominate, build an IVF_PQ index. If warm queries are slow, check for serialisation overhead with large result sets.
Write duration spikes: may indicate backend throttling or contention. Check the object-storage request metrics and consider compaction.