Observability

How to monitor TameFlare with Prometheus metrics, Grafana dashboards, and alerting. TameFlare exposes a /metrics endpoint on the gateway for scraping.

Prometheus metrics endpoint

The gateway exposes Prometheus-compatible metrics at:

GET http://localhost:9443/metrics

Available metrics

| Metric | Type | Labels | Description | |---|---|---|---| | aaf_requests_total | counter | agent, connector, decision, action_type | Total proxied requests | | aaf_requests_denied_total | counter | agent, connector, reason | Total denied requests | | aaf_requests_approved_total | counter | agent, connector | Total requests that went through approval | | aaf_request_duration_seconds | histogram | agent, connector | Request latency (proxy overhead + upstream) | | aaf_proxy_overhead_seconds | histogram | agent, connector | Proxy overhead only (permission check + credential injection) | | aaf_active_connections | gauge | agent | Currently open connections per agent | | aaf_kill_switch_active | gauge | scope | 1 if kill switch is active, 0 otherwise | | aaf_approvals_pending | gauge | — | Number of pending approval requests | | aaf_connector_errors_total | counter | connector, error_type | Upstream API errors (401, 429, 500, timeout) | | aaf_rate_limit_hits_total | counter | agent | Rate limit rejections per agent |

Scrape configuration

Add to your prometheus.yml:

scrape_configs:
  - job_name: 'TameFlare-gateway'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9443']
    metrics_path: /metrics

Grafana dashboard

Example dashboard panels

Actions over time (stacked area)

# Actions per minute by decision
sum(rate(aaf_requests_total[5m])) by (decision) * 60

Denial rate

# Percentage of requests denied
sum(rate(aaf_requests_denied_total[5m])) / sum(rate(aaf_requests_total[5m])) * 100

Top agents by volume

# Top 10 agents by request count
topk(10, sum(rate(aaf_requests_total[5m])) by (agent))

Proxy latency (p95)

# 95th percentile proxy overhead
histogram_quantile(0.95, sum(rate(aaf_proxy_overhead_seconds_bucket[5m])) by (le))

Kill switch status

# 1 = active, 0 = inactive
aaf_kill_switch_active

Connector error rate

# Upstream errors per connector
sum(rate(aaf_connector_errors_total[5m])) by (connector, error_type)

Importing a dashboard

A pre-built Grafana dashboard JSON is available at apps/gateway-v2/grafana-dashboard.json (if present) or you can create one using the PromQL queries above.

To import:

Open Grafana → Dashboards → Import
Paste the JSON or upload the file
Select your Prometheus data source
Save

Alerting

Prometheus alert rules

Create an alert rules file (TameFlare-alerts.yml):

groups:
  - name: TameFlare
    rules:
      # High denial rate
      - alert: AAFHighDenialRate
        expr: >
          sum(rate(aaf_requests_denied_total[5m]))
          / sum(rate(aaf_requests_total[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TameFlare denial rate above 50%"
          description: "More than half of agent requests are being denied. Check policies and permissions."
 
      # Kill switch activated
      - alert: AAFKillSwitchActive
        expr: aaf_kill_switch_active == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "TameFlare kill switch is active"
          description: "The kill switch is blocking agent traffic. Scope: {{ $labels.scope }}"
 
      # Gateway down
      - alert: AAFGatewayDown
        expr: up{job="TameFlare-gateway"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "TameFlare gateway is down"
          description: "The TameFlare gateway is not responding to Prometheus scrapes."
 
      # High proxy latency
      - alert: AAFHighLatency
        expr: >
          histogram_quantile(0.95,
            sum(rate(aaf_proxy_overhead_seconds_bucket[5m])) by (le)
          ) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TameFlare proxy p95 latency above 100ms"
          description: "Proxy overhead is unusually high. Check gateway resource usage."
 
      # Agent rate limited
      - alert: AAFAgentRateLimited
        expr: sum(rate(aaf_rate_limit_hits_total[5m])) by (agent) > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent }} is being rate limited"
          description: "Agent is exceeding 120 req/min. Check for runaway behavior."
 
      # Upstream errors spike
      - alert: AAFUpstreamErrors
        expr: sum(rate(aaf_connector_errors_total[5m])) by (connector) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High upstream error rate for {{ $labels.connector }}"
          description: "Connector is experiencing frequent upstream API errors."

Alertmanager integration

Route alerts to Slack, PagerDuty, or email via Alertmanager:

# alertmanager.yml
route:
  receiver: 'slack'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
 
receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#TameFlare-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
 
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

Datadog integration

TameFlare does not have a native Datadog integration. Options:

| Approach | How | |---|---| | Prometheus → Datadog | Use the Datadog Prometheus integration to scrape TameFlare's /metrics endpoint | | StatsD forwarding | Not currently supported. Planned for a future release. | | Log forwarding | Forward TameFlare's stdout/stderr logs to Datadog Agent for log-based monitoring |

Health monitoring

Gateway health endpoint

curl http://localhost:9443/health
# {"status": "ok", "uptime_seconds": 3600, "agents": 3, "connectors": 5}

Control plane health endpoint

curl http://localhost:3000/api/health
# {"status": "healthy", "database": "ok", "gateway": "online", "latency_ms": 5}

Recommended health checks

| Check | Endpoint | Frequency | Alert if | |---|---|---|---| | Gateway alive | GET /health on port 9443 | Every 30s | No response for 2 min | | Control plane alive | GET /api/health on port 3000 | Every 30s | No response for 2 min | | Database writable | Included in /api/health | Every 30s | Status not "ok" | | Gateway reachable from control plane | Included in /api/health | Every 30s | Gateway status "offline" |

Next steps

Architecture — deployment topologies and scaling
Performance — latency and resource benchmarks
Kill Switch Recovery — investigation runbook
Failure Modes — behavior under failure conditions