Skip to content

Bot detection

Every accepted ingest request passes through the bot filter. Matches are either tagged in storage (is_bot=true) or silently dropped, depending on operator config.

BOT_MODE=Tag # default — persist with is_bot=true
BOT_MODE=Drop # silently discard

Tag is the right default for v1.0 — analytics queries that filter is_bot=false exclude bots, but the rows are still inspectable for forensic work. Drop is appropriate when storage cost dominates and operators have decided bots are pure noise.

The detector is a small ordered list. The first matching rule wins; the matched rule label flows into the structured log (operator-facing, unstable — do not parse).

  1. IAB spider listUser-Agent matches a curated list of declared crawler tokens (Googlebot, Bingbot, etc.). Match label: bot_iab_spider.
  2. Headless browser signals — UA contains HeadlessChrome, PhantomJS, Puppeteer, Playwright markers, or the navigator-webdriver flag is set in the captured context.ua. Match label: bot_headless.
  3. Impossible interaction rate — more than 30 events in a 1-second window for a single anon_id. Match label: bot_rate_anomaly.
  4. Empty / unset User-Agent — a request with no UA from a browser pathway. Match label: bot_no_ua.
  • It does not run a probabilistic ML model (v1.6 candidate).
  • It does not honour robots.txt for any crawler — it is not a robot service. The ingest path assumes the operator has already excluded unwanted paths upstream.
  • It does not consult a real-time threat-intelligence feed. Adding one would tie the platform to an external service in the hot path; the offline IAB list is the v1.0 trade-off.

When BOT_MODE=Tag:

  • Bot events are persisted with is_bot=true.
  • All analytics endpoints (pageviews, sessions, events, users/.../timeline) filter is_bot=false by default. Bot rows do not pollute the user-facing metric.
  • The is_bot=true rows are reachable via the underlying SQL (operators can SELECT … WHERE is_bot=true for forensic analysis).

When BOT_MODE=Drop:

  • The request still returns 202 (the SDK retry orchestrator does not loop).
  • The event is never persisted. Storage cost is zero.
  • The bot decision is recorded in Prometheus as syntarie_events_dropped_total{reason="bot"}.

The detector is conservative — false positives are worse than false negatives at the analytics layer. Every match label is operator-facing and you can audit a sampled day’s logs to see what is being tagged. If your dashboard shows zero pageviews from a known crawler that was intentionally not included in the IAB list (a private internal scraper, say), you are in the false-negative case — those will land tagged is_bot=false. Use the internal traffic exclusion for that.

The match label inside the structured warn log (reason="bot_iab_spider", etc.) is operator-facing and may be expanded in any version. Customers must not parse it. The is_bot=true boolean and the Drop-mode behaviour ARE part of the operator-visible contract and won’t change inside v1.x. See Versioning §5.5.