RustChat Operations Runbook

This runbook provides procedures for common operational tasks and incident response.

Deployment
Monitoring
Incident Response
Maintenance

Deployment

First-Time Deployment

bash

# 1. Generate secrets
export RUSTCHAT_JWT_SECRET="$(openssl rand -base64 48)"
export RUSTCHAT_ENCRYPTION_KEY="$(openssl rand -base64 48)"

# 2. Create environment file
cat > .env << EOF
RUSTCHAT_ENVIRONMENT=production
RUSTCHAT_JWT_SECRET=${RUSTCHAT_JWT_SECRET}
RUSTCHAT_ENCRYPTION_KEY=${RUSTCHAT_ENCRYPTION_KEY}
RUSTCHAT_DATABASE_URL=postgres://user:pass@db:5432/rustchat
RUSTCHAT_REDIS_URL=redis://redis:6379
RUSTCHAT_SECURITY_OAUTH_TOKEN_DELIVERY=cookie
EOF

# 3. Verify configuration
./rustchat --check-config

# 4. Run database migrations
./rustchat --migrate-only

# 5. Start service
./rustchat

Rolling Update (Zero-Downtime)

bash

# 1. Verify health of existing nodes
for node in node1 node2 node3; do
  curl -sf http://${node}:3000/api/v1/health/ready || exit 1
done

# 2. Update nodes one at a time
for node in node1 node2 node3; do
  echo "Updating ${node}..."
  
  # Drain connections (if using a load balancer)
  curl -X POST http://lb/admin/drain/${node}
  
  # Wait for connections to drain
  sleep 30
  
  # Deploy new version
  ssh ${node} "systemctl restart rustchat"
  
  # Wait for health check
  until curl -sf http://${node}:3000/api/v1/health/ready; do
    sleep 5
  done
  
  # Re-enable in load balancer
  curl -X POST http://lb/admin/enable/${node}
done

Monitoring

Key Metrics

Metric	Query	Alert Threshold
Error Rate	`rate(rustchat_http_requests_total{status=~"5.."}[5m])`	> 1%
Latency p99	`histogram_quantile(0.99, rate(rustchat_http_request_duration_seconds_bucket[5m]))`	> 500ms
DB Pool Saturation	`rustchat_db_pool_saturation_ratio`	> 0.8
WS Connections	`rustchat_websocket_active_connections`	Per-node limit
Circuit Breaker	`rustchat_circuit_breaker_state_changes_total`	Any state change

Health Checks

bash

# Liveness (Kubernetes)
curl http://localhost:3000/api/v1/health/live
# Expected: {"status":"ok","version":"0.3.1","uptime_seconds":3600}

# Readiness (Kubernetes)
curl http://localhost:3000/api/v1/health/ready
# Expected: 200 OK with all checks passing

# Metrics (Prometheus)
curl http://localhost:3000/api/v1/health/metrics
# Expected: Prometheus exposition format

# Stats (JSON)
curl http://localhost:3000/api/v1/health/stats
# Expected: {"websocket_connections":150,"active_users":42,"db_pool_saturation":0.3}

Log Queries

bash

# Error spikes
kubectl logs -l app=rustchat | jq 'select(.level=="ERROR")'

# Slow queries
kubectl logs -l app=rustchat | jq 'select(.fields.latency_ms > 1000)'

# Authentication failures
kubectl logs -l app=rustchat | jq 'select(.message | contains("Unauthorized"))'

# WebSocket events
kubectl logs -l app=rustchat | jq 'select(.span.name=="ws_handler")'

Incident Response

P1: Service Down

Symptoms: Health checks failing, 0% availability

bash

# 1. Check if process is running
systemctl status rustchat

# 2. Check logs for startup errors
journalctl -u rustchat -n 100 --no-pager

# 3. Verify database connectivity
psql $RUSTCHAT_DATABASE_URL -c "SELECT 1"

# 4. Verify Redis connectivity
redis-cli -u $RUSTCHAT_REDIS_URL PING

# 5. Check port binding
ss -tlnp | grep 3000

# 6. Emergency restart
systemctl restart rustchat

P2: High Error Rate

Symptoms: > 5% 5xx errors

bash

# 1. Identify error types
kubectl logs -l app=rustchat | jq -r '.message' | sort | uniq -c | sort -rn | head -20

# 2. Check database performance
# Look for slow queries in PostgreSQL logs

# 3. Check Redis latency
redis-cli --latency -h redis

# 4. Check circuit breaker states
curl http://localhost:3000/api/v1/health/stats | jq '.circuit_breakers'

# 5. Scale up if needed
kubectl scale deployment rustchat --replicas=6

P3: Database Connection Exhaustion

Symptoms: "pool timed out" errors

bash

# 1. Check current connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE application_name='rustchat';"

# 2. Check for idle connections
psql -c "SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state='idle in transaction';"

# 3. Kill idle transactions (carefully!)
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle in transaction' AND age(now(), query_start) > interval '5 minutes';"

# 4. Temporary fix: Increase pool size
export RUSTCHAT_DB_POOL__MAX_CONNECTIONS=100
systemctl restart rustchat

# 5. Permanent fix: Check for connection leaks in code

P4: WebSocket Connection Issues

Symptoms: Users reporting real-time updates not working

bash

# 1. Check WebSocket connection count
curl http://localhost:3000/api/v1/health/stats | jq '.websocket_connections'

# 2. Check for connection limit violations
kubectl logs -l app=rustchat | grep "Too many connections"

# 3. Check Redis pub/sub
redis-cli PUBSUB CHANNELS | grep rustchat

# 4. Check cluster heartbeat
kubectl logs -l app=rustchat | grep "cluster heartbeat"

# 5. Test WebSocket directly
websocat ws://localhost:3000/api/v1/ws -H "Authorization: Bearer TOKEN"

P5: OAuth Authentication Failure

Symptoms: Users can't log in via OAuth

bash

# 1. Check circuit breaker state for OIDC
kubectl logs -l app=rustchat | grep "circuit.*oidc"

# 2. Test OIDC discovery
curl https://accounts.google.com/.well-known/openid-configuration | jq '.authorization_endpoint'

# 3. Check OAuth callback URL configuration
# Verify RUSTCHAT_SITE_URL matches registered redirect URI

# 4. Check for state parameter mismatches
kubectl logs -l app=rustchat | grep "OAuth state"

# 5. Reset circuit breaker (emergency)
# Restart affected node or wait for auto-recovery

P6: Rate Limiting Too Aggressive

Symptoms: Legitimate users getting 429 errors

bash

# 1. Check rate limit metrics
curl http://localhost:3000/api/v1/health/metrics | grep rate_limit

# 2. Temporarily increase limits
export RUSTCHAT_SECURITY_RATE_LIMIT_AUTH_PER_MINUTE=20
export RUSTCHAT_SECURITY_RATE_LIMIT_WS_PER_MINUTE=60
systemctl restart rustchat

# 3. Identify source of high traffic
kubectl logs -l app=rustchat | grep "Rate limit exceeded" | jq '.ip' | sort | uniq -c | sort -rn | head -10

Maintenance

Secret Rotation

bash

# 1. Generate new secret
NEW_JWT_SECRET="$(openssl rand -base64 48)"

# 2. Deploy with new secret (rolling update)
# Users will be logged out as tokens become invalid

# 3. Monitor for auth errors
watch 'kubectl logs -l app=rustchat | grep -c "Invalid token"'

# 4. Users will naturally re-authenticate

Database Maintenance

bash

# 1. Check table bloat
psql -c "SELECT schemaname, tablename, n_tup_ins, n_tup_upd, n_tup_del FROM pg_stat_user_tables ORDER BY n_tup_del DESC;"

# 2. Vacuum analyze
psql -c "VACUUM ANALYZE;"

# 3. Check index usage
psql -c "SELECT schemaname, tablename, indexname, idx_scan FROM pg_stat_user_indexes ORDER BY idx_scan DESC;"

# 4. Reindex if needed
psql -c "REINDEX INDEX CONCURRENTLY idx_name;"

Redis Maintenance

bash

# 1. Check memory usage
redis-cli INFO memory

# 2. Check key count
redis-cli DBSIZE

# 3. Find large keys
redis-cli --bigkeys

# 4. Expire old keys manually (if needed)
redis-cli EVAL "return redis.call('del', unpack(redis.call('keys', 'rustchat:oauth:code:*')))" 0

Backup Procedures

bash

# Database backup
pg_dump $RUSTCHAT_DATABASE_URL | gzip > rustchat_backup_$(date +%Y%m%d).sql.gz

# Redis backup (if using persistence)
redis-cli BGSAVE

# Configuration backup
cp .env .env.backup_$(date +%Y%m%d)

Performance Tuning

Database Optimization

sql

-- Add index for common queries if missing
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_posts_channel_created 
ON posts(channel_id, created_at DESC);

-- Check for slow queries
SELECT query, mean_exec_time, calls 
FROM pg_stat_statements 
ORDER BY mean_exec_time DESC 
LIMIT 10;

Connection Pool Tuning

bash

# Formula: max_connections = (CPU cores * 2) + effective_spindle_count
# For cloud: max_connections = vCPU * 4

# Example for 4 vCPU instance
export RUSTCHAT_DB_POOL__MAX_CONNECTIONS=16
export RUSTCHAT_DB_POOL__MIN_CONNECTIONS=4

WebSocket Optimization

bash

# Increase per-node connection limit (if memory allows)
# Update in database:
UPDATE server_config 
SET site = jsonb_set(site, '{max_simultaneous_connections}', '10')
WHERE id = 'default';

# Scale horizontally instead of vertically
kubectl scale deployment rustchat --replicas=10

Security Incident Response

Suspected Breach

Immediate Actions:

bash

# Rotate all secrets immediately
kubectl create secret generic rustchat-secrets --from-literal=jwt=$(openssl rand -base64 48) --dry-run -o yaml | kubectl apply -f -
kubectl rollout restart deployment/rustchat

Audit:

bash

# Check auth logs
kubectl logs -l app=rustchat --since=24h | jq 'select(.message | contains("auth"))'

# Check for unusual access patterns
kubectl logs -l app=rustchat | jq 'select(.fields.status | tonumber > 400)'

Notify:
- Security team
- Users (if data affected)

Contact & Escalation

Severity	Response Time	Contact
P1 (Service Down)	15 minutes	On-call engineer
P2 (Degraded)	1 hour	Engineering team
P3 (Minor)	4 hours	Engineering team
Security	Immediate	Security team

RustChat Operations Runbook ​

Table of Contents ​

Deployment ​

First-Time Deployment ​

Rolling Update (Zero-Downtime) ​

Monitoring ​

Key Metrics ​

Health Checks ​

Log Queries ​

Incident Response ​

P1: Service Down ​

P2: High Error Rate ​

P3: Database Connection Exhaustion ​

P4: WebSocket Connection Issues ​

P5: OAuth Authentication Failure ​

P6: Rate Limiting Too Aggressive ​

Maintenance ​

Secret Rotation ​

Database Maintenance ​

Redis Maintenance ​

Backup Procedures ​

Performance Tuning ​

Database Optimization ​

Connection Pool Tuning ​

WebSocket Optimization ​

Security Incident Response ​

Suspected Breach ​

Contact & Escalation ​

RustChat Operations Runbook

Table of Contents

Deployment

First-Time Deployment

Rolling Update (Zero-Downtime)

Monitoring

Key Metrics

Health Checks

Log Queries

Incident Response

P1: Service Down

P2: High Error Rate

P3: Database Connection Exhaustion

P4: WebSocket Connection Issues

P5: OAuth Authentication Failure

P6: Rate Limiting Too Aggressive

Maintenance

Secret Rotation

Database Maintenance

Redis Maintenance

Backup Procedures

Performance Tuning

Database Optimization

Connection Pool Tuning

WebSocket Optimization

Security Incident Response

Suspected Breach

Contact & Escalation