Observability
How to monitor TameFlare with Prometheus metrics, Grafana dashboards, and alerting. TameFlare exposes a /metrics endpoint on the gateway for scraping.
Prometheus metrics endpoint
The gateway exposes Prometheus-compatible metrics at:
GET http://localhost:9443/metrics
Available metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
| aaf_requests_total | counter | agent, connector, decision, action_type | Total proxied requests |
| aaf_requests_denied_total | counter | agent, connector, reason | Total denied requests |
| aaf_requests_approved_total | counter | agent, connector | Total requests that went through approval |
| aaf_request_duration_seconds | histogram | agent, connector | Request latency (proxy overhead + upstream) |
| aaf_proxy_overhead_seconds | histogram | agent, connector | Proxy overhead only (permission check + credential injection) |
| aaf_active_connections | gauge | agent | Currently open connections per agent |
| aaf_kill_switch_active | gauge | scope | 1 if kill switch is active, 0 otherwise |
| aaf_approvals_pending | gauge | — | Number of pending approval requests |
| aaf_connector_errors_total | counter | connector, error_type | Upstream API errors (401, 429, 500, timeout) |
| aaf_rate_limit_hits_total | counter | agent | Rate limit rejections per agent |
Scrape configuration
Add to your prometheus.yml:
scrape_configs:
- job_name: 'TameFlare-gateway'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9443']
metrics_path: /metricsGrafana dashboard
Example dashboard panels
Actions over time (stacked area)
# Actions per minute by decision
sum(rate(aaf_requests_total[5m])) by (decision) * 60Denial rate
# Percentage of requests denied
sum(rate(aaf_requests_denied_total[5m])) / sum(rate(aaf_requests_total[5m])) * 100Top agents by volume
# Top 10 agents by request count
topk(10, sum(rate(aaf_requests_total[5m])) by (agent))Proxy latency (p95)
# 95th percentile proxy overhead
histogram_quantile(0.95, sum(rate(aaf_proxy_overhead_seconds_bucket[5m])) by (le))Kill switch status
# 1 = active, 0 = inactive
aaf_kill_switch_activeConnector error rate
# Upstream errors per connector
sum(rate(aaf_connector_errors_total[5m])) by (connector, error_type)Importing a dashboard
A pre-built Grafana dashboard JSON is available at apps/gateway-v2/grafana-dashboard.json (if present) or you can create one using the PromQL queries above.
To import:
- Open Grafana → Dashboards → Import
- Paste the JSON or upload the file
- Select your Prometheus data source
- Save
Alerting
Prometheus alert rules
Create an alert rules file (TameFlare-alerts.yml):
groups:
- name: TameFlare
rules:
# High denial rate
- alert: AAFHighDenialRate
expr: >
sum(rate(aaf_requests_denied_total[5m]))
/ sum(rate(aaf_requests_total[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "TameFlare denial rate above 50%"
description: "More than half of agent requests are being denied. Check policies and permissions."
# Kill switch activated
- alert: AAFKillSwitchActive
expr: aaf_kill_switch_active == 1
for: 1m
labels:
severity: critical
annotations:
summary: "TameFlare kill switch is active"
description: "The kill switch is blocking agent traffic. Scope: {{ $labels.scope }}"
# Gateway down
- alert: AAFGatewayDown
expr: up{job="TameFlare-gateway"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "TameFlare gateway is down"
description: "The TameFlare gateway is not responding to Prometheus scrapes."
# High proxy latency
- alert: AAFHighLatency
expr: >
histogram_quantile(0.95,
sum(rate(aaf_proxy_overhead_seconds_bucket[5m])) by (le)
) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "TameFlare proxy p95 latency above 100ms"
description: "Proxy overhead is unusually high. Check gateway resource usage."
# Agent rate limited
- alert: AAFAgentRateLimited
expr: sum(rate(aaf_rate_limit_hits_total[5m])) by (agent) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} is being rate limited"
description: "Agent is exceeding 120 req/min. Check for runaway behavior."
# Upstream errors spike
- alert: AAFUpstreamErrors
expr: sum(rate(aaf_connector_errors_total[5m])) by (connector) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High upstream error rate for {{ $labels.connector }}"
description: "Connector is experiencing frequent upstream API errors."Alertmanager integration
Route alerts to Slack, PagerDuty, or email via Alertmanager:
# alertmanager.yml
route:
receiver: 'slack'
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#TameFlare-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'Datadog integration
TameFlare does not have a native Datadog integration. Options:
| Approach | How |
|---|---|
| Prometheus → Datadog | Use the Datadog Prometheus integration to scrape TameFlare's /metrics endpoint |
| StatsD forwarding | Not currently supported. Planned for a future release. |
| Log forwarding | Forward TameFlare's stdout/stderr logs to Datadog Agent for log-based monitoring |
Health monitoring
Gateway health endpoint
curl http://localhost:9443/health
# {"status": "ok", "uptime_seconds": 3600, "agents": 3, "connectors": 5}Control plane health endpoint
curl http://localhost:3000/api/health
# {"status": "healthy", "database": "ok", "gateway": "online", "latency_ms": 5}Recommended health checks
| Check | Endpoint | Frequency | Alert if |
|---|---|---|---|
| Gateway alive | GET /health on port 9443 | Every 30s | No response for 2 min |
| Control plane alive | GET /api/health on port 3000 | Every 30s | No response for 2 min |
| Database writable | Included in /api/health | Every 30s | Status not "ok" |
| Gateway reachable from control plane | Included in /api/health | Every 30s | Gateway status "offline" |
Next steps
- Architecture — deployment topologies and scaling
- Performance — latency and resource benchmarks
- Kill Switch Recovery — investigation runbook
- Failure Modes — behavior under failure conditions