Home
Database Migrations in Production: Zero-Downtime Strategies That Actually Work

Database Migrations in Production: Zero-Downtime Strategies That Actually Work

Dec 20, 2024 • 12 min read • Engineering

Last year, a fintech client needed to migrate their PostgreSQL database—50 million rows, processing $50M in daily transactions. The catch? Zero downtime allowed. Not even a maintenance window.

"If we go down, we lose $2M in revenue per hour," the CTO told me.

We completed the migration in 4 hours with zero downtime. Here's the exact strategy we used—and how you can replicate it.

The Challenge

The migration involved:

  • Adding 3 new columns to a table with 20M rows
  • Changing column types (VARCHAR to JSONB)
  • Creating 5 new indexes
  • All while handling 1,000+ requests per second

Traditional approach? Lock the table, run migrations, hope nothing breaks. That's not acceptable for mission-critical systems.

Strategy 1: Blue-Green Deployment

We ran two database environments in parallel:

Blue Environment (Current):

  • Production database (current schema)
  • Handling 100% of traffic

Green Environment (New):

  • Read replica with new schema
  • Receiving replication from Blue
  • Ready to become primary

Step-by-Step Process

Phase 1: Create Read Replica (30 minutes)

-- On Green environment
CREATE SUBSCRIPTION green_subscription
CONNECTION 'host=blue-db.example.com dbname=production'
PUBLICATION blue_publication;

-- Verify replication lag
SELECT * FROM pg_stat_replication;

We monitored replication lag until it was under 1 second.

Phase 2: Run Migrations on Green (2 hours)

With Green as a read replica, we could run migrations without affecting production:

-- Step 1: Add new columns (nullable first)
ALTER TABLE transactions 
ADD COLUMN metadata JSONB,
ADD COLUMN processed_at TIMESTAMP,
ADD COLUMN version INTEGER DEFAULT 1;

-- Step 2: Backfill data (in batches)
UPDATE transactions 
SET metadata = jsonb_build_object('legacy', true)
WHERE metadata IS NULL
AND id BETWEEN 1 AND 1000000;

-- Step 3: Create indexes concurrently (non-blocking)
CREATE INDEX CONCURRENTLY idx_transactions_metadata 
ON transactions USING GIN (metadata);

-- Step 4: Make columns NOT NULL (after backfill)
ALTER TABLE transactions 
ALTER COLUMN metadata SET NOT NULL;

Key insight: We used CONCURRENTLY for index creation to avoid locking the table.

Phase 3: Switch Traffic Gradually (1 hour)

We used feature flags to route traffic gradually:

// Application code
const useGreenDB = featureFlag.isEnabled('use-green-db', userId);

const db = useGreenDB ? greenConnection : blueConnection;

// Start with 1% of users
// Monitor error rates
// Gradually increase to 100%

Traffic routing schedule:

  • 0-15 min: 1% traffic to Green
  • 15-30 min: 10% traffic to Green (if error rate < 0.1%)
  • 30-45 min: 50% traffic to Green
  • 45-60 min: 100% traffic to Green

Phase 4: Promote Green to Primary (30 minutes)

Once 100% of traffic was on Green, we promoted it:

-- Stop replication
ALTER SUBSCRIPTION green_subscription DISABLE;

-- Promote Green to primary
SELECT pg_promote();

-- Update DNS/connection strings
-- Blue becomes standby

Strategy 2: Backward-Compatible Migrations

Every migration was designed to be backward-compatible:

Migration Rules:

  1. Add, don't remove: New columns are nullable first
  2. Expand, don't contract: VARCHAR(50) → VARCHAR(255) is safe
  3. Deploy code first: Code supports both old and new schema
  4. Migrate data second: Backfill new columns
  5. Remove old code last: Only after migration complete

Strategy 3: Rollback Plan

We had a rollback plan ready:

  • Instant rollback: Feature flag to route back to Blue
  • Data rollback: Scripts to reverse schema changes
  • Monitoring: Alerts on error rates, latency, replication lag

Thankfully, we didn't need it—but having it ready was critical.

Monitoring & Metrics

We tracked:

  • Replication lag: Must stay under 1 second
  • Error rates: Per environment, per endpoint
  • Latency: P50, P95, P99 for both environments
  • Database locks: Monitor for blocking queries
  • Transaction volume: Ensure no transactions lost

Common Pitfalls to Avoid

❌ Don't Do This:

  • Run migrations during peak traffic
  • Create indexes without CONCURRENTLY
  • Change column types directly (VARCHAR → JSONB)
  • Skip replication lag monitoring
  • Deploy code and migrations simultaneously

The Results

  • Downtime: 0 seconds
  • Data loss: 0 rows
  • Error rate increase: 0.02% (within normal variance)
  • Migration time: 4 hours
  • Rollback time: 2 minutes (if needed)

Implementation Checklist

For your next zero-downtime migration:

  1. Set up read replica (Green environment)
  2. Verify replication lag is minimal
  3. Design backward-compatible migrations
  4. Deploy application code that supports both schemas
  5. Run migrations on Green
  6. Backfill data in batches
  7. Create indexes with CONCURRENTLY
  8. Test Green environment thoroughly
  9. Route traffic gradually using feature flags
  10. Monitor metrics at each step
  11. Promote Green to primary
  12. Keep Blue as standby for 48 hours

The Bottom Line

Zero-downtime migrations aren't magic—they're engineering. With the right strategy, you can migrate databases with millions of rows without a single second of downtime.

At NetForceLabs, we don't take maintenance windows. We build systems that migrate themselves.