Spooky

This page is the operator quick-reference for common high-signal failure modes and maintenance actions.

Before You Touch Production

keep a tested rollback path
know whether the change requires cert reload or drain-and-restart
know where metrics, logs, and control API endpoints are exposed
keep a traffic-reduction plan ready before making invasive changes

Scenario: Rising 503 Rate

Check:

spooky_overload_shed_by_reason_total
route latency metrics
active connections and inflight pressure
backend health state
recent config or backend changes

Likely causes:

global or scoped inflight limits reached
route queue cap exceeded
upstream or backend overload
backend timeout surge

Immediate actions:

Determine whether the 503s are overload-generated or upstream-generated.
Reduce traffic or shed non-critical traffic first.
Verify backend health and recent latency changes.
Roll back the most recent risky config change if the spike correlates with change timing.

Scenario: Handshake Failures Or Client Connection Failures

Check:

downstream TLS metrics
ALPN selection metrics
certificate expiry/selection metrics
listener cert/key presence and permissions

Likely causes:

invalid or expired certificate material
wrong SNI mapping
missing client certificate in required-client-cert mode
client-side protocol mismatch

Immediate actions:

Verify the listener is presenting the expected certificate.
Verify whether failures are concentrated on one hostname or all hostnames.
If only certificate material changed, use the certificate reload path when appropriate.
If listener routing or policy changed, prefer drain-and-restart with rollback readiness.

Scenario: Backend Timeout Surge

Check:

route latency percentiles
backend timeout counters
backend health transitions
per-upstream and per-backend inflight pressure

Likely causes:

unhealthy backend pool
sudden backend latency regression
connection establishment failures
under-sized backend fleet

Immediate actions:

Confirm whether the issue is localized to one upstream or all traffic.
Remove or isolate failing backends if health signals are clear.
Reduce concurrency pressure if the proxy is amplifying backend collapse.
Roll back recent backend or network changes first, not just proxy config.

Scenario: Control API Or Metrics Endpoint Unavailable

Check:

bind address and port config
local firewall rules
listener startup logs
whether endpoints are configured as required or optional

Immediate actions:

Confirm whether the process is healthy but only the admin plane is down.
If admin endpoints are required: true, treat startup failure as intentional protection.
If admin endpoints are required: false, decide whether to fail closed operationally and restart into a safer config.

Scenario: Cert Rotation

Safe approach:

Place new cert and key material with correct permissions.
Validate hostname coverage and expiry before activation.
Use certificate reload for listener cert replacement.
Verify new handshakes present the new certificate.
Keep previous material until verification is complete.

Scenario: Route Or Upstream Change

Current operational model:

certificate-only changes can use cert reload
route, upstream, timeout, and policy changes should be treated as drain-and-restart changes

Recommended sequence:

Validate config offline.
Stage on a canary node or bounded traffic slice.
Drain and restart one instance at a time.
Watch error rate, route latency, health transitions, and shed counters.
Expand only after the canary stays stable.

Scenario: Brownout Or Overload Triggering

Check:

overload shed counters by reason
brownout state transitions
active connections
inflight metrics versus configured caps

Actions:

Confirm whether the system is protecting itself correctly rather than failing unexpectedly.
Preserve core traffic first.
Reduce demand or increase backend capacity before simply widening limits.
Avoid increasing caps blindly without memory and latency validation.

Scenario: Draining For Deploy Or Maintenance

Stop sending new traffic to the instance.
Trigger drain-aware restart workflow.
Watch for completion before hard termination whenever possible.
Use the configured forced-drain timeout only as a safety boundary, not as the primary shutdown path.

After Any Incident

record what metric or symptom first signaled the issue
record whether the proxy was the root cause or the reflector of backend failure
record what config or dependency changed
add or tighten alerts and runbook steps for the same class of issue

Migration Sizing And Capacity

On This Page

Before You Touch Production Scenario: Rising 503 Rate Scenario: Handshake Failures Or Client Connection Failures Scenario: Backend Timeout Surge Scenario: Control API Or Metrics Endpoint Unavailable Scenario: Cert Rotation Scenario: Route Or Upstream Change Scenario: Brownout Or Overload Triggering Scenario: Draining For Deploy Or Maintenance After Any Incident