Skip to content

Observability (Prometheus)

The collector exposes a Prometheus exposition at GET /metrics. Operators scrape it from inside the same private network — there is no auth, and the endpoint does not pass through the rate limiter so it stays reachable when limits trip.

GET /metrics
Content-Type: text/plain; version=0.0.4; charset=utf-8

200 returns the exposition; 503 is returned with an empty body when the recorder is not initialised (test fixtures that build state without main.rs). Production should never see 503 — alert on it as a deployment bug.

Labels on Prometheus metrics are bounded to closed-set categoricals:

LabelAllowed values
endpointevents, s2s
reasonaccepted, accepted_consent, bot, internal, consent_required, rate_limit, quota, schema_invalid
methodPOST, GET

There are no per-tenant, per-user, per-anon, or per-site labels. A workspace_id label would push cardinality into the millions; the platform refuses it on principle. Per-tenant breakdowns go through the audit log and analytics queries instead.

Counters:

MetricLabelsMeaning
syntarie_events_received_totalendpointEvery request, including rejects.
syntarie_events_accepted_totalendpointEvents that reached storage.
syntarie_events_dropped_totalendpoint, reasonEvents dropped at any gate.
syntarie_events_quarantined_totalendpointEvents sent to the DLQ.
syntarie_events_deduped_totalendpointEvents suppressed by the dedup window.

Histograms:

MetricLabelsBuckets
syntarie_ingest_duration_secondsendpoint1 ms / 2 ms / 5 ms / 10 ms / 25 ms / 50 ms / 100 ms
syntarie_validation_duration_secondsendpoint100 µs / 500 µs / 1 ms / 5 ms
syntarie_storage_write_duration_secondsendpoint1 ms / 5 ms / 10 ms / 50 ms / 100 ms

Gauges:

MetricLabelsMeaning
syntarie_storage_pool_open_connectionsOpen Postgres connections.
syntarie_storage_pool_idle_connectionsIdle Postgres connections.
syntarie_dlq_sizeDLQ row count, sampled every 60 s.

Two metrics every operator should alert on

Section titled “Two metrics every operator should alert on”

1. p99 ingest latency past the 5 ms budget.

histogram_quantile(
0.99,
sum(rate(syntarie_ingest_duration_seconds_bucket[5m])) by (le)
) > 0.005

When this fires, the cause is almost always either a slow Postgres write or a saturated CPU. Check syntarie_storage_write_duration_seconds first.

2. Sustained drops at any gate.

sum(rate(syntarie_events_dropped_total[5m])) by (reason) > 10

A burst is normal (a release, a synthetic test). A sustained five-minute drop rate above ~10 events/sec is usually a config issue (rate limit too low, consent flag misconfigured, schema bundle too strict).

The query API does not expose Prometheus metrics in v1.0. It is on the v1.1 roadmap. For now, monitor the API at the platform level (CPU, RSS, HTTP error rate) and rely on Postgres slow-query logs for read-side latency analysis.

The scrape endpoint is private — operators should firewall port 8080 at the network layer and only expose /events and /s2s/events to the public internet. A typical Prometheus scrape config:

- job_name: 'syntarie-collector'
scrape_interval: 15s
static_configs:
- targets:
- 'collector-1.private:8080'
- 'collector-2.private:8080'

Independently of Prometheus, the collector emits structured (JSON) logs for every accept / reject decision. These are operator-facing and may expand in any release; do not parse them programmatically. They are intended for log aggregators, not for alerting (which uses Prometheus).