Bot detection
Every accepted ingest request passes through the bot filter. Matches are
either tagged in storage (is_bot=true) or silently dropped, depending on
operator config.
BOT_MODE=Tag # default — persist with is_bot=trueBOT_MODE=Drop # silently discardTag is the right default for v1.0 — analytics queries that filter
is_bot=false exclude bots, but the rows are still inspectable for
forensic work. Drop is appropriate when storage cost dominates and
operators have decided bots are pure noise.
Detection rules
Section titled “Detection rules”The detector is a small ordered list. The first matching rule wins; the matched rule label flows into the structured log (operator-facing, unstable — do not parse).
- IAB spider list —
User-Agentmatches a curated list of declared crawler tokens (Googlebot, Bingbot, etc.). Match label:bot_iab_spider. - Headless browser signals — UA contains
HeadlessChrome,PhantomJS,Puppeteer,Playwrightmarkers, or the navigator-webdriver flag is set in the capturedcontext.ua. Match label:bot_headless. - Impossible interaction rate — more than 30 events in a 1-second
window for a single
anon_id. Match label:bot_rate_anomaly. - Empty / unset User-Agent — a request with no UA from a browser
pathway. Match label:
bot_no_ua.
What the detector does NOT do in v1.0
Section titled “What the detector does NOT do in v1.0”- It does not run a probabilistic ML model (v1.6 candidate).
- It does not honour
robots.txtfor any crawler — it is not a robot service. The ingest path assumes the operator has already excluded unwanted paths upstream. - It does not consult a real-time threat-intelligence feed. Adding one would tie the platform to an external service in the hot path; the offline IAB list is the v1.0 trade-off.
Tag mode workflow
Section titled “Tag mode workflow”When BOT_MODE=Tag:
- Bot events are persisted with
is_bot=true. - All analytics endpoints (
pageviews,sessions,events,users/.../timeline) filteris_bot=falseby default. Bot rows do not pollute the user-facing metric. - The
is_bot=truerows are reachable via the underlying SQL (operators canSELECT … WHERE is_bot=truefor forensic analysis).
Drop mode workflow
Section titled “Drop mode workflow”When BOT_MODE=Drop:
- The request still returns
202(the SDK retry orchestrator does not loop). - The event is never persisted. Storage cost is zero.
- The bot decision is recorded in Prometheus as
syntarie_events_dropped_total{reason="bot"}.
False positives
Section titled “False positives”The detector is conservative — false positives are worse than false
negatives at the analytics layer. Every match label is operator-facing
and you can audit a sampled day’s logs to see what is being tagged. If
your dashboard shows zero pageviews from a known crawler that was
intentionally not included in the IAB list (a private internal scraper,
say), you are in the false-negative case — those will land tagged
is_bot=false. Use the internal traffic exclusion
for that.
Stability
Section titled “Stability”The match label inside the structured warn log (reason="bot_iab_spider",
etc.) is operator-facing and may be expanded in any version. Customers
must not parse it. The is_bot=true boolean and the Drop-mode behaviour
ARE part of the operator-visible contract and won’t change inside v1.x.
See Versioning §5.5.