Failure Modes
How TameFlare behaves when things go wrong. Every failure mode defaults to deny — TameFlare never fails open.
Gateway crash
If the Go gateway process crashes or is killed:
| Scenario | Agent behavior | |---|---| | Gateway exits while agent is running | Agent's HTTP requests fail with connection refused (proxy unreachable). No requests leak to the real API. | | Gateway restarts | In-flight requests are lost. Pending approvals are lost (stored in gateway memory). Agents must retry. | | Gateway OOM-killed | Same as crash. SQLite WAL is recovered on restart — no data loss for traffic logs. |
Recovery: Restart the gateway. tf run does not auto-restart the gateway — use a process manager (systemd, Docker restart policy) for production.
SQLite corruption
SQLite is crash-safe by default (WAL mode). Corruption is rare but possible with:
- Hardware failure (disk errors, power loss during write)
- Running multiple processes writing to the same
.dbfile simultaneously - Filesystem bugs (NFS, network-mounted storage)
Recovery:
- Stop the gateway and control plane
- Run SQLite integrity check:
sqlite3 local.db "PRAGMA integrity_check;" - If corrupt, restore from backup (see Backup & Restore)
- If no backup, export what you can:
sqlite3 local.db ".dump" > recovery.sql - Create a fresh database: delete
local.db, restart, runpnpm db:push
Approval timeout
When an action requires approval in proxy mode, the gateway holds the HTTP connection for up to 5 minutes.
| Event | Result | |---|---| | Human approves within 5 min | Request proceeds — credentials injected, forwarded to API | | Human denies within 5 min | Request returns 403 to the agent | | No response in 5 min | Request returns 408 (timeout) to the agent | | Gateway restarts during hold | Request fails — agent must retry |
The 5-minute timeout is currently not configurable. Agents using the SDK can poll for approval status instead of holding a connection.
DNS failure
If the gateway cannot resolve a domain:
| Scenario | Result |
|---|---|
| Connector domain unreachable (e.g., api.github.com) | Request fails with 502 Bad Gateway. Logged in traffic log with error: dns_resolution_failed. |
| Control plane unreachable from gateway | Gateway operates independently using local permissions (SQLite). Token verification is skipped — only permission-based decisions apply. |
| Agent cannot reach gateway | Connection refused. No requests forwarded. |
Malicious policies
The policy engine is not Turing-complete and cannot execute arbitrary code:
| Attack vector | Result |
|---|---|
| YAML bomb (billion laughs) | YAML parser has depth/size limits. Oversized policies are rejected at parse time. |
| Regex denial of service (ReDoS) | The matches operator uses Go's regexp (linear time, no backtracking). ReDoS is not possible. |
| Injection via field values | Policy conditions compare strings — no eval, no template injection, no shell execution. |
| Malformed YAML | Parse error caught. Policy is skipped. Other policies continue to evaluate. Action is denied if no policy matches (default deny). |
Disk full
| Component | Behavior | |---|---| | Gateway SQLite (traffic log) | New traffic log entries fail silently. Proxy continues to forward allowed requests. Permissions still enforced from in-memory cache. | | Control plane SQLite | API returns 500 for write operations (action requests, audit events). Read operations (dashboard, status) continue. | | Credential vault | Vault is read at startup. Disk full does not affect credential injection for already-loaded credentials. Adding new credentials fails. |
Recovery: Free disk space, then restart. No data migration needed — SQLite recovers automatically.
Connector auth failure
When a connector's API key is invalid, expired, or rate-limited by the upstream API:
| Scenario | Result |
|---|---|
| Invalid API key | Upstream returns 401/403. Gateway forwards the error to the agent. Logged in traffic log. |
| Upstream rate limit (e.g., GitHub 5000 req/hr) | Upstream returns 429. Gateway forwards the 429 + Retry-After to the agent. |
| Upstream timeout | Gateway returns 504 after its own timeout (30s default). |
| Upstream 500 error | Gateway forwards the 500 to the agent. |
TameFlare does not retry upstream failures — the agent is responsible for retry logic. The traffic log records the upstream status code for debugging.
Cold start time
| Component | Cold start | Notes | |---|---|---| | Gateway (Go) | ~50-200ms | Binary startup + SQLite open + credential vault decrypt + CA cert load | | Control plane (Next.js) | ~2-5s (dev), ~500ms-1s (production) | Next.js server startup + DB connection | | First request | +10-50ms | Connector registry initialization, first TLS cert generation |
For CI/CD pipelines, add a health check loop before running agents:
# Wait for gateway to be ready
until curl -sf http://localhost:9443/health > /dev/null; do sleep 0.5; doneLoad balancer compatibility
TameFlare's gateway is a forward proxy (not a reverse proxy). It is not designed to sit behind a load balancer for agent traffic.
| Topology | Supported? |
|---|---|
| Single gateway, multiple agents | Yes — each agent gets a dedicated port |
| Multiple gateways, shared DB | No — SQLite does not support concurrent writers. Use one gateway per host. |
| Reverse proxy in front of control plane | Yes — use nginx/Caddy for TLS termination to the Next.js app |
| Reverse proxy in front of gateway | Not recommended — agents connect directly to the gateway via HTTP_PROXY |
For high availability, run the control plane behind a reverse proxy with Turso (cloud SQLite) for shared state. The gateway should run on the same host as the agents it governs.
Next steps
- Backup & Restore — database backup procedures
- Security — full security model and operational checklist
- Troubleshooting — common issues and solutions