A client's API was crashing under load. 2,000 requests per second? Down. 5,000? Down. They needed to handle 10,000+ requests per second with 99.99% uptime.
"We're losing customers every time we go down," the CTO said. "This needs to be bulletproof."
Six months later, their API handles 15,000 requests per second with 99.99% uptime. Here's how we built it.
The Architecture
We built a multi-layered defense system:
- Rate Limiting: Prevent abuse and overload
- Circuit Breakers: Fail fast when dependencies are down
- Graceful Degradation: Serve partial responses when possible
- Load Balancing: Distribute traffic across instances
- Monitoring: Real-time alerts and dashboards
Layer 1: Rate Limiting
We implemented three types of rate limiting:
1. Per-User Rate Limiting
Using Redis with sliding window algorithm:
import { Redis } from 'ioredis';
async function checkRateLimit(userId: string, limit: number, window: number) {
const key = "rate_limit:" + userId;
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, window);
}
if (current > limit) {
return { allowed: false, remaining: 0 };
}
return { allowed: true, remaining: limit - current };
}
// Usage
const { allowed, remaining } = await checkRateLimit(userId, 100, 60); // 100 req/min
if (!allowed) {
return res.status(429).json({ error: 'Rate limit exceeded' });
}
2. Global Rate Limiting
Protect against DDoS and traffic spikes:
// At load balancer level (Nginx)
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
server {
location /api/ {
limit_req zone=api_limit burst=200 nodelay;
proxy_pass http://backend;
}
}
3. Tiered Rate Limits
Different limits for different user tiers:
- Free tier: 100 requests/minute
- Pro tier: 1,000 requests/minute
- Enterprise: 10,000 requests/minute
Layer 2: Circuit Breakers
When external dependencies fail, fail fast—don't cascade:
class CircuitBreaker {
private failures = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
private nextAttempt = Date.now();
constructor(
private threshold: number = 5,
private timeout: number = 60000
) {}
async execute(fn: () => Promise): Promise {
if (this.state === 'open') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'half-open';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
if (this.failures >= this.threshold) {
this.state = 'open';
this.nextAttempt = Date.now() + this.timeout;
}
}
}
// Usage
const dbBreaker = new CircuitBreaker(5, 60000);
try {
const data = await dbBreaker.execute(() => database.query(sql));
} catch (error) {
// Return cached data or default response
return getCachedData();
}
Layer 3: Graceful Degradation
When services are down, serve what you can:
async function getProductData(productId: string) {
try {
// Try primary data source
const product = await db.query('SELECT * FROM products WHERE id = ?', [productId]);
const reviews = await reviewService.getReviews(productId);
const recommendations = await aiService.getRecommendations(productId);
return {
product,
reviews,
recommendations,
source: 'full'
};
} catch (error) {
// Fallback: serve from cache
const cached = await cache.get("product:" + productId);
if (cached) {
return {
...cached,
source: 'cache',
note: 'Some features temporarily unavailable'
};
}
// Last resort: minimal response
return {
product: { id: productId, name: 'Product unavailable' },
source: 'minimal'
};
}
}
Layer 4: Load Balancing & Auto-Scaling
We use Kubernetes with horizontal pod autoscaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
When CPU hits 70%, Kubernetes automatically scales up. When traffic drops, it scales down.
Layer 5: Monitoring & Alerting
We track:
- Request rate: Requests per second
- Error rate: 4xx and 5xx responses
- Latency: P50, P95, P99 response times
- Circuit breaker state: Open/closed status
- Rate limit hits: How many requests are throttled
- Dependency health: Database, cache, external APIs
Alert thresholds:
- Error rate > 1%: Warning alert
- Error rate > 5%: Critical alert
- P99 latency > 1s: Warning alert
- P99 latency > 3s: Critical alert
- Circuit breaker opens: Immediate alert
Real-World Example: Handling a Traffic Spike
A client got featured on Product Hunt. Traffic spiked from 500 req/s to 8,000 req/s in 10 minutes.
Here's what happened:
- Rate limiting: Throttled abusive requests (saved 30% capacity)
- Auto-scaling: Kubernetes scaled from 3 to 25 pods (handled the load)
- Circuit breakers: Protected against database overload (prevented cascade failure)
- Graceful degradation: Served cached data when database was slow (maintained UX)
Result: API stayed up, 99.99% uptime maintained, zero data loss.
Common Mistakes to Avoid
❌ Don't Do This:
- Rate limit without proper error messages (users get confused)
- Set circuit breaker threshold too low (opens on normal failures)
- Forget to implement graceful degradation (everything breaks)
- Monitor only error rates (miss latency issues)
- Scale manually (too slow for traffic spikes)
The Results
- Peak throughput: 15,000 requests/second
- Uptime: 99.99% (4 minutes downtime per month)
- P99 latency: 250ms (under 1s target)
- Error rate: 0.05% (under 0.1% target)
- Rate limit effectiveness: Blocked 15% abusive traffic
Implementation Checklist
For your mission-critical API:
- Implement per-user rate limiting (Redis)
- Add global rate limiting (load balancer)
- Set up circuit breakers for all external dependencies
- Implement graceful degradation (cached fallbacks)
- Configure auto-scaling (Kubernetes HPA or similar)
- Set up monitoring (Prometheus + Grafana)
- Configure alerting (PagerDuty or similar)
- Load test your API (find breaking points)
- Document runbooks (what to do when alerts fire)
- Run chaos engineering tests (simulate failures)
The Bottom Line
99.99% uptime isn't achieved by hoping nothing breaks. It's achieved by building layers of defense: rate limiting, circuit breakers, graceful degradation, and monitoring.
At NetForceLabs, we don't build APIs that work when everything is perfect. We build APIs that work when everything is broken.