Last year, a fintech client needed to migrate their PostgreSQL database—50 million rows, processing $50M in daily transactions. The catch? Zero downtime allowed. Not even a maintenance window.
"If we go down, we lose $2M in revenue per hour," the CTO told me.
We completed the migration in 4 hours with zero downtime. Here's the exact strategy we used—and how you can replicate it.
The Challenge
The migration involved:
- Adding 3 new columns to a table with 20M rows
- Changing column types (VARCHAR to JSONB)
- Creating 5 new indexes
- All while handling 1,000+ requests per second
Traditional approach? Lock the table, run migrations, hope nothing breaks. That's not acceptable for mission-critical systems.
Strategy 1: Blue-Green Deployment
We ran two database environments in parallel:
Blue Environment (Current):
- Production database (current schema)
- Handling 100% of traffic
Green Environment (New):
- Read replica with new schema
- Receiving replication from Blue
- Ready to become primary
Step-by-Step Process
Phase 1: Create Read Replica (30 minutes)
-- On Green environment
CREATE SUBSCRIPTION green_subscription
CONNECTION 'host=blue-db.example.com dbname=production'
PUBLICATION blue_publication;
-- Verify replication lag
SELECT * FROM pg_stat_replication;
We monitored replication lag until it was under 1 second.
Phase 2: Run Migrations on Green (2 hours)
With Green as a read replica, we could run migrations without affecting production:
-- Step 1: Add new columns (nullable first)
ALTER TABLE transactions
ADD COLUMN metadata JSONB,
ADD COLUMN processed_at TIMESTAMP,
ADD COLUMN version INTEGER DEFAULT 1;
-- Step 2: Backfill data (in batches)
UPDATE transactions
SET metadata = jsonb_build_object('legacy', true)
WHERE metadata IS NULL
AND id BETWEEN 1 AND 1000000;
-- Step 3: Create indexes concurrently (non-blocking)
CREATE INDEX CONCURRENTLY idx_transactions_metadata
ON transactions USING GIN (metadata);
-- Step 4: Make columns NOT NULL (after backfill)
ALTER TABLE transactions
ALTER COLUMN metadata SET NOT NULL;
Key insight: We used CONCURRENTLY for index creation to avoid locking the table.
Phase 3: Switch Traffic Gradually (1 hour)
We used feature flags to route traffic gradually:
// Application code
const useGreenDB = featureFlag.isEnabled('use-green-db', userId);
const db = useGreenDB ? greenConnection : blueConnection;
// Start with 1% of users
// Monitor error rates
// Gradually increase to 100%
Traffic routing schedule:
- 0-15 min: 1% traffic to Green
- 15-30 min: 10% traffic to Green (if error rate < 0.1%)
- 30-45 min: 50% traffic to Green
- 45-60 min: 100% traffic to Green
Phase 4: Promote Green to Primary (30 minutes)
Once 100% of traffic was on Green, we promoted it:
-- Stop replication
ALTER SUBSCRIPTION green_subscription DISABLE;
-- Promote Green to primary
SELECT pg_promote();
-- Update DNS/connection strings
-- Blue becomes standby
Strategy 2: Backward-Compatible Migrations
Every migration was designed to be backward-compatible:
Migration Rules:
- Add, don't remove: New columns are nullable first
- Expand, don't contract: VARCHAR(50) → VARCHAR(255) is safe
- Deploy code first: Code supports both old and new schema
- Migrate data second: Backfill new columns
- Remove old code last: Only after migration complete
Strategy 3: Rollback Plan
We had a rollback plan ready:
- Instant rollback: Feature flag to route back to Blue
- Data rollback: Scripts to reverse schema changes
- Monitoring: Alerts on error rates, latency, replication lag
Thankfully, we didn't need it—but having it ready was critical.
Monitoring & Metrics
We tracked:
- Replication lag: Must stay under 1 second
- Error rates: Per environment, per endpoint
- Latency: P50, P95, P99 for both environments
- Database locks: Monitor for blocking queries
- Transaction volume: Ensure no transactions lost
Common Pitfalls to Avoid
❌ Don't Do This:
- Run migrations during peak traffic
- Create indexes without CONCURRENTLY
- Change column types directly (VARCHAR → JSONB)
- Skip replication lag monitoring
- Deploy code and migrations simultaneously
The Results
- Downtime: 0 seconds
- Data loss: 0 rows
- Error rate increase: 0.02% (within normal variance)
- Migration time: 4 hours
- Rollback time: 2 minutes (if needed)
Implementation Checklist
For your next zero-downtime migration:
- Set up read replica (Green environment)
- Verify replication lag is minimal
- Design backward-compatible migrations
- Deploy application code that supports both schemas
- Run migrations on Green
- Backfill data in batches
- Create indexes with CONCURRENTLY
- Test Green environment thoroughly
- Route traffic gradually using feature flags
- Monitor metrics at each step
- Promote Green to primary
- Keep Blue as standby for 48 hours
The Bottom Line
Zero-downtime migrations aren't magic—they're engineering. With the right strategy, you can migrate databases with millions of rows without a single second of downtime.
At NetForceLabs, we don't take maintenance windows. We build systems that migrate themselves.