Failure Modes

How TameFlare behaves when things go wrong. Every failure mode defaults to deny — TameFlare never fails open.


Gateway crash

If the Go gateway process crashes or is killed:

| Scenario | Agent behavior | |---|---| | Gateway exits while agent is running | Agent's HTTP requests fail with connection refused (proxy unreachable). No requests leak to the real API. | | Gateway restarts | In-flight requests are lost. Pending approvals are lost (stored in gateway memory). Agents must retry. | | Gateway OOM-killed | Same as crash. SQLite WAL is recovered on restart — no data loss for traffic logs. |

Recovery: Restart the gateway. tf run does not auto-restart the gateway — use a process manager (systemd, Docker restart policy) for production.

Tip
The control plane (Next.js) and gateway (Go) are independent processes. A gateway crash does not affect the dashboard, API, or audit log.

SQLite corruption

SQLite is crash-safe by default (WAL mode). Corruption is rare but possible with:

  • Hardware failure (disk errors, power loss during write)
  • Running multiple processes writing to the same .db file simultaneously
  • Filesystem bugs (NFS, network-mounted storage)

Recovery:

  1. Stop the gateway and control plane
  2. Run SQLite integrity check: sqlite3 local.db "PRAGMA integrity_check;"
  3. If corrupt, restore from backup (see Backup & Restore)
  4. If no backup, export what you can: sqlite3 local.db ".dump" > recovery.sql
  5. Create a fresh database: delete local.db, restart, run pnpm db:push
Warning
Never run two TameFlare instances writing to the same SQLite file. Use Turso (hosted libSQL) if you need multi-instance access.

Approval timeout

When an action requires approval in proxy mode, the gateway holds the HTTP connection for up to 5 minutes.

| Event | Result | |---|---| | Human approves within 5 min | Request proceeds — credentials injected, forwarded to API | | Human denies within 5 min | Request returns 403 to the agent | | No response in 5 min | Request returns 408 (timeout) to the agent | | Gateway restarts during hold | Request fails — agent must retry |

The 5-minute timeout is currently not configurable. Agents using the SDK can poll for approval status instead of holding a connection.


DNS failure

If the gateway cannot resolve a domain:

| Scenario | Result | |---|---| | Connector domain unreachable (e.g., api.github.com) | Request fails with 502 Bad Gateway. Logged in traffic log with error: dns_resolution_failed. | | Control plane unreachable from gateway | Gateway operates independently using local permissions (SQLite). Token verification is skipped — only permission-based decisions apply. | | Agent cannot reach gateway | Connection refused. No requests forwarded. |


Malicious policies

The policy engine is not Turing-complete and cannot execute arbitrary code:

| Attack vector | Result | |---|---| | YAML bomb (billion laughs) | YAML parser has depth/size limits. Oversized policies are rejected at parse time. | | Regex denial of service (ReDoS) | The matches operator uses Go's regexp (linear time, no backtracking). ReDoS is not possible. | | Injection via field values | Policy conditions compare strings — no eval, no template injection, no shell execution. | | Malformed YAML | Parse error caught. Policy is skipped. Other policies continue to evaluate. Action is denied if no policy matches (default deny). |


Disk full

| Component | Behavior | |---|---| | Gateway SQLite (traffic log) | New traffic log entries fail silently. Proxy continues to forward allowed requests. Permissions still enforced from in-memory cache. | | Control plane SQLite | API returns 500 for write operations (action requests, audit events). Read operations (dashboard, status) continue. | | Credential vault | Vault is read at startup. Disk full does not affect credential injection for already-loaded credentials. Adding new credentials fails. |

Recovery: Free disk space, then restart. No data migration needed — SQLite recovers automatically.


Connector auth failure

When a connector's API key is invalid, expired, or rate-limited by the upstream API:

| Scenario | Result | |---|---| | Invalid API key | Upstream returns 401/403. Gateway forwards the error to the agent. Logged in traffic log. | | Upstream rate limit (e.g., GitHub 5000 req/hr) | Upstream returns 429. Gateway forwards the 429 + Retry-After to the agent. | | Upstream timeout | Gateway returns 504 after its own timeout (30s default). | | Upstream 500 error | Gateway forwards the 500 to the agent. |

TameFlare does not retry upstream failures — the agent is responsible for retry logic. The traffic log records the upstream status code for debugging.


Cold start time

| Component | Cold start | Notes | |---|---|---| | Gateway (Go) | ~50-200ms | Binary startup + SQLite open + credential vault decrypt + CA cert load | | Control plane (Next.js) | ~2-5s (dev), ~500ms-1s (production) | Next.js server startup + DB connection | | First request | +10-50ms | Connector registry initialization, first TLS cert generation |

For CI/CD pipelines, add a health check loop before running agents:

# Wait for gateway to be ready
until curl -sf http://localhost:9443/health > /dev/null; do sleep 0.5; done

Load balancer compatibility

TameFlare's gateway is a forward proxy (not a reverse proxy). It is not designed to sit behind a load balancer for agent traffic.

| Topology | Supported? | |---|---| | Single gateway, multiple agents | Yes — each agent gets a dedicated port | | Multiple gateways, shared DB | No — SQLite does not support concurrent writers. Use one gateway per host. | | Reverse proxy in front of control plane | Yes — use nginx/Caddy for TLS termination to the Next.js app | | Reverse proxy in front of gateway | Not recommended — agents connect directly to the gateway via HTTP_PROXY |

For high availability, run the control plane behind a reverse proxy with Turso (cloud SQLite) for shared state. The gateway should run on the same host as the agents it governs.


Next steps