Database Migrations in Production: Zero-Downtime Strategies That Actually Work

Last year, a fintech client needed to migrate their PostgreSQL database—50 million rows, processing $50M in daily transactions. The catch? Zero downtime allowed. Not even a maintenance window.

"If we go down, we lose $2M in revenue per hour," the CTO told me.

We completed the migration in 4 hours with zero downtime. Here's the exact strategy we used—and how you can replicate it.

The Challenge

The migration involved:

Adding 3 new columns to a table with 20M rows
Changing column types (VARCHAR to JSONB)
Creating 5 new indexes
All while handling 1,000+ requests per second

Traditional approach? Lock the table, run migrations, hope nothing breaks. That's not acceptable for mission-critical systems.

Strategy 1: Blue-Green Deployment

We ran two database environments in parallel:

Blue Environment (Current):

Production database (current schema)
Handling 100% of traffic

Green Environment (New):

Read replica with new schema
Receiving replication from Blue
Ready to become primary

Step-by-Step Process

Phase 1: Create Read Replica (30 minutes)

-- On Green environment
CREATE SUBSCRIPTION green_subscription
CONNECTION 'host=blue-db.example.com dbname=production'
PUBLICATION blue_publication;

-- Verify replication lag
SELECT * FROM pg_stat_replication;

We monitored replication lag until it was under 1 second.

Phase 2: Run Migrations on Green (2 hours)

With Green as a read replica, we could run migrations without affecting production:

-- Step 1: Add new columns (nullable first)
ALTER TABLE transactions 
ADD COLUMN metadata JSONB,
ADD COLUMN processed_at TIMESTAMP,
ADD COLUMN version INTEGER DEFAULT 1;

-- Step 2: Backfill data (in batches)
UPDATE transactions 
SET metadata = jsonb_build_object('legacy', true)
WHERE metadata IS NULL
AND id BETWEEN 1 AND 1000000;

-- Step 3: Create indexes concurrently (non-blocking)
CREATE INDEX CONCURRENTLY idx_transactions_metadata 
ON transactions USING GIN (metadata);

-- Step 4: Make columns NOT NULL (after backfill)
ALTER TABLE transactions 
ALTER COLUMN metadata SET NOT NULL;

Key insight: We used CONCURRENTLY for index creation to avoid locking the table.

Phase 3: Switch Traffic Gradually (1 hour)

We used feature flags to route traffic gradually:

// Application code
const useGreenDB = featureFlag.isEnabled('use-green-db', userId);

const db = useGreenDB ? greenConnection : blueConnection;

// Start with 1% of users
// Monitor error rates
// Gradually increase to 100%

Traffic routing schedule:

0-15 min: 1% traffic to Green
15-30 min: 10% traffic to Green (if error rate < 0.1%)
30-45 min: 50% traffic to Green
45-60 min: 100% traffic to Green

Phase 4: Promote Green to Primary (30 minutes)

Once 100% of traffic was on Green, we promoted it:

-- Stop replication
ALTER SUBSCRIPTION green_subscription DISABLE;

-- Promote Green to primary
SELECT pg_promote();

-- Update DNS/connection strings
-- Blue becomes standby

Strategy 2: Backward-Compatible Migrations

Every migration was designed to be backward-compatible:

Migration Rules:

Add, don't remove: New columns are nullable first
Expand, don't contract: VARCHAR(50) → VARCHAR(255) is safe
Deploy code first: Code supports both old and new schema
Migrate data second: Backfill new columns
Remove old code last: Only after migration complete

Strategy 3: Rollback Plan

We had a rollback plan ready:

Instant rollback: Feature flag to route back to Blue
Data rollback: Scripts to reverse schema changes
Monitoring: Alerts on error rates, latency, replication lag

Thankfully, we didn't need it—but having it ready was critical.

Monitoring & Metrics

We tracked:

Replication lag: Must stay under 1 second
Error rates: Per environment, per endpoint
Latency: P50, P95, P99 for both environments
Database locks: Monitor for blocking queries
Transaction volume: Ensure no transactions lost

Common Pitfalls to Avoid

❌ Don't Do This:

Run migrations during peak traffic
Create indexes without CONCURRENTLY
Change column types directly (VARCHAR → JSONB)
Skip replication lag monitoring
Deploy code and migrations simultaneously

The Results

Downtime: 0 seconds
Data loss: 0 rows
Error rate increase: 0.02% (within normal variance)
Migration time: 4 hours
Rollback time: 2 minutes (if needed)

Implementation Checklist

For your next zero-downtime migration:

Set up read replica (Green environment)
Verify replication lag is minimal
Design backward-compatible migrations
Deploy application code that supports both schemas
Run migrations on Green
Backfill data in batches
Create indexes with CONCURRENTLY
Test Green environment thoroughly
Route traffic gradually using feature flags
Monitor metrics at each step
Promote Green to primary
Keep Blue as standby for 48 hours

The Bottom Line

Zero-downtime migrations aren't magic—they're engineering. With the right strategy, you can migrate databases with millions of rows without a single second of downtime.

At NetForceLabs, we don't take maintenance windows. We build systems that migrate themselves.