Observability (Prometheus)

The collector exposes a Prometheus exposition at GET /metrics. Operators scrape it from inside the same private network — there is no auth, and the endpoint does not pass through the rate limiter so it stays reachable when limits trip.

Endpoint

GET /metrics
Content-Type: text/plain; version=0.0.4; charset=utf-8

200 returns the exposition; 503 is returned with an empty body when the recorder is not initialised (test fixtures that build state without main.rs). Production should never see 503 — alert on it as a deployment bug.

Cardinality rules

Labels on Prometheus metrics are bounded to closed-set categoricals:

Label	Allowed values
`endpoint`	`events`, `s2s`
`reason`	`accepted`, `accepted_consent`, `bot`, `internal`, `consent_required`, `rate_limit`, `quota`, `schema_invalid`
`method`	`POST`, `GET`

There are no per-tenant, per-user, per-anon, or per-site labels. A workspace_id label would push cardinality into the millions; the platform refuses it on principle. Per-tenant breakdowns go through the audit log and analytics queries instead.

The metric set

Counters:

Metric	Labels	Meaning
`syntarie_events_received_total`	`endpoint`	Every request, including rejects.
`syntarie_events_accepted_total`	`endpoint`	Events that reached storage.
`syntarie_events_dropped_total`	`endpoint`, `reason`	Events dropped at any gate.
`syntarie_events_quarantined_total`	`endpoint`	Events sent to the DLQ.
`syntarie_events_deduped_total`	`endpoint`	Events suppressed by the dedup window.

Histograms:

Metric	Labels	Buckets
`syntarie_ingest_duration_seconds`	`endpoint`	1 ms / 2 ms / 5 ms / 10 ms / 25 ms / 50 ms / 100 ms
`syntarie_validation_duration_seconds`	`endpoint`	100 µs / 500 µs / 1 ms / 5 ms
`syntarie_storage_write_duration_seconds`	`endpoint`	1 ms / 5 ms / 10 ms / 50 ms / 100 ms

Gauges:

Metric	Labels	Meaning
`syntarie_storage_pool_open_connections`	—	Open Postgres connections.
`syntarie_storage_pool_idle_connections`	—	Idle Postgres connections.
`syntarie_dlq_size`	—	DLQ row count, sampled every 60 s.

Two metrics every operator should alert on

1. p99 ingest latency past the 5 ms budget.

histogram_quantile(
  0.99,
  sum(rate(syntarie_ingest_duration_seconds_bucket[5m])) by (le)
) > 0.005

When this fires, the cause is almost always either a slow Postgres write or a saturated CPU. Check syntarie_storage_write_duration_seconds first.

2. Sustained drops at any gate.

sum(rate(syntarie_events_dropped_total[5m])) by (reason) > 10

A burst is normal (a release, a synthetic test). A sustained five-minute drop rate above ~10 events/sec is usually a config issue (rate limit too low, consent flag misconfigured, schema bundle too strict).

What about the query API?

The query API does not expose Prometheus metrics in v1.0. It is on the v1.1 roadmap. For now, monitor the API at the platform level (CPU, RSS, HTTP error rate) and rely on Postgres slow-query logs for read-side latency analysis.

Scraping

The scrape endpoint is private — operators should firewall port 8080 at the network layer and only expose /events and /s2s/events to the public internet. A typical Prometheus scrape config:

- job_name: 'syntarie-collector'
  scrape_interval: 15s
  static_configs:
    - targets:
      - 'collector-1.private:8080'
      - 'collector-2.private:8080'

What gets logged

Independently of Prometheus, the collector emits structured (JSON) logs for every accept / reject decision. These are operator-facing and may expand in any release; do not parse them programmatically. They are intended for log aggregators, not for alerting (which uses Prometheus).