Most database migration failures are not caused by SQL syntax. They are caused by treating schema changes like isolated database tasks instead of distributed production changes. The moment a new column, index, or constraint touches live traffic, you are coordinating application code, background workers, API behavior, dashboards, rollback strategy, and often tenant-specific data shapes at the same time.
That is why zero-downtime migration work sits at the intersection of several disciplines already covered elsewhere on this blog. The transaction boundaries from Database Transactions and Concurrency Control in TypeScript APIs, the query-awareness from Database Indexing Strategies Every Backend Developer Should Know, the rollout discipline in Best Practices for API Versioning and Backward Compatibility, and the kill-switch mindset in How to Build Your Own Feature Flag System from Scratch all matter here.
This guide focuses on relational databases in TypeScript backends, especially Postgres-backed SaaS systems. We will walk through a production-safe expand-contract workflow, show how to dual-write and backfill without crushing the primary database, and cover the observability you need to know whether the migration is healthy while it is happening. If your system already relies on worker fleets, the operational model should feel similar to Patterns for Background Jobs and Queues in Web Apps, and your instrumentation should look a lot like Designing a High Quality Logging Pipeline with Attention to Cost and Structure.
TL;DR
- Treat schema changes as multi-step rollouts, not one-shot DDL events.
- Prefer expand-contract: add compatible schema first, then dual-write, backfill, cut reads over, and only then remove old fields.
- Keep migrations backward compatible with both old and new application versions during rollout.
- Run backfills as controlled background work with batching, checkpointing, and clear pause controls.
- Instrument every phase with logs, metrics, and reconciliation queries before you delete the old path.
Why database migrations break production
When teams say a migration "caused downtime," the database is often only part of the story. The real problem is usually one of these:
- The new app version expects schema that has not been deployed yet.
- The migration locks hot tables or scans far more data than expected.
- Old and new code paths disagree about where the source of truth lives.
- Backfills overload the database and starve user traffic.
- Rollback is impossible because the migration changed data in only one direction.
This is the same compatibility problem described in Best Practices for API Versioning and Backward Compatibility, just applied to your persistence layer instead of your public API. During deployment, you often have multiple application versions running at once. If schema changes are not backward compatible across that window, a routine deploy becomes a live-fire migration event.
The other hidden problem is concurrency. Backfills, dual writes, and cleanup jobs create new write patterns under real traffic. If those writes are not scoped carefully, they can introduce the exact race conditions discussed in Database Transactions and Concurrency Control in TypeScript APIs. Zero-downtime work is not just about availability. It is also about preserving correctness while multiple versions of your system overlap.
The expand-contract mental model
The safest migration pattern for production SaaS systems is usually expand-contract.
Phase 1: Expand
Add new schema in a backward-compatible way
Phase 2: Dual write
New application code writes old and new representations
Phase 3: Backfill
Historical rows are copied or transformed gradually
Phase 4: Read cutover
Application starts reading from new schema behind a flag
Phase 5: Contract
Old columns, indexes, and fallback paths are removedThe important property is compatibility. Old code still works after the expand phase. New code still works before contract is complete. That compatibility window is what buys you safe deploys, canaries, and reversibility.
If you already use controlled rollout techniques from How to Build Your Own Feature Flag System from Scratch, apply the same model here. The schema migration is not the rollout. It is the prerequisite for the rollout.
Step 1: Design for compatibility before touching SQL
Before you write a migration file, define these questions clearly:
- What is changing in the read path?
- What is changing in the write path?
- Can old code and new code run at the same time?
- What makes rollback safe?
- How will you prove that the new representation is complete and correct?
Suppose you are moving from a single full_name column to separate first_name and last_name columns.
The unsafe approach is obvious:
- Drop
full_name - Add
first_nameandlast_name - Deploy new code
That introduces an instant compatibility break. The safer design is:
- Add
first_nameandlast_nameas nullable columns. - Deploy code that still reads
full_nameby default but writes both shapes. - Backfill historical rows.
- Cut reads over to the new columns behind a flag.
- Keep fallback logic until confidence is high.
- Remove
full_namelater.
This mirrors the additive-first approach from Best Practices for API Versioning and Backward Compatibility. You are effectively versioning the data model inside your own system.
Step 2: Expand with additive schema changes only
The expand phase should avoid irreversible moves. Favor additions over mutations:
- Add nullable columns instead of replacing existing ones
- Add new tables instead of repurposing old ones in place
- Add indexes before traffic depends on them
- Add constraints in a staged way when the database supports it
Example Postgres migration:
ALTER TABLE customers
ADD COLUMN first_name text,
ADD COLUMN last_name text;
-- Build indexes before the new read path needs them
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_customers_first_name
ON customers (first_name);Two practical notes matter here.
First, index creation must match real query patterns. If you are about to cut reads over to first_name plus tenant_id, index for that access pattern rather than guessing. The workflow in Database Indexing Strategies Every Backend Developer Should Know applies directly here because migrations fail just as often from bad access paths as from bad SQL.
Second, treat locks as a first-class risk. Large ALTER TABLE operations can still block writes depending on the database, the specific DDL, and the table size. Review each statement with the same caution you would apply to a hot transactional path in Database Transactions and Concurrency Control in TypeScript APIs. In production, "quick migration" is not a guarantee. It is a hypothesis you have to validate.
Step 3: Dual-write inside a single transactional boundary
Once the new schema exists, application code should write both the old and new representations for a while. This is what keeps rollback possible after deployment.
For example, in a repository or service layer:
type UpdateCustomerNameInput = {
customerId: string;
tenantId: string;
fullName: string;
};
const splitName = (fullName: string) => {
const trimmed = fullName.trim();
const [firstName, ...rest] = trimmed.split(/\s+/);
return {
firstName: firstName ?? "",
lastName: rest.join(" "),
};
};
export const updateCustomerName = async ({
customerId,
tenantId,
fullName,
}: UpdateCustomerNameInput) => {
const { firstName, lastName } = splitName(fullName);
await db.transaction().execute(async (trx) => {
await trx
.updateTable("customers")
.set({
full_name: fullName,
first_name: firstName,
last_name: lastName,
})
.where("id", "=", customerId)
.where("tenant_id", "=", tenantId)
.executeTakeFirstOrThrow();
});
};Do not dual-write in separate operations unless you enjoy creating reconciliation work for yourself. The old and new shapes must commit atomically, which is exactly why the locking and transaction guidance in Database Transactions and Concurrency Control in TypeScript APIs matters so much during migrations.
This is also where feature flags help. The schema can be live everywhere while the new read path stays disabled until you are ready. That separation of data readiness from user-facing rollout is one of the strongest lessons in How to Build Your Own Feature Flag System from Scratch.
Step 4: Backfill historical data like a background system, not a one-off script
Backfills are where many "safe" migrations become dangerous. Someone writes a quick script that scans 80 million rows, issues unbounded updates, and wonders why primary latency doubled.
Treat backfills as production workloads. The patterns from Patterns for Background Jobs and Queues in Web Apps apply almost perfectly:
- Batch work into small chunks
- Checkpoint progress
- Rate-limit worker concurrency
- Make jobs idempotent
- Expose pause and resume controls
- Log progress in a structured way
Minimal example:
type BackfillJob = {
tenantId: string;
lastProcessedId?: string;
batchSize: number;
};
export const backfillCustomerNames = async ({
tenantId,
lastProcessedId,
batchSize,
}: BackfillJob) => {
const rows = await db
.selectFrom("customers")
.select(["id", "full_name"])
.where("tenant_id", "=", tenantId)
.where("id", ">", lastProcessedId ?? "")
.where("first_name", "is", null)
.orderBy("id asc")
.limit(batchSize)
.execute();
for (const row of rows) {
const { firstName, lastName } = splitName(row.full_name ?? "");
await db
.updateTable("customers")
.set({
first_name: firstName,
last_name: lastName,
})
.where("id", "=", row.id)
.where("tenant_id", "=", tenantId)
.where("first_name", "is", null)
.executeTakeFirst();
}
return {
processed: rows.length,
nextCursor: rows.at(-1)?.id,
};
};There are three production details people skip too often.
First, scope the job to the same access pattern your indexes support. If the backfill filters by tenant_id, id, and nullability, design supporting indexes with the same discipline described in Database Indexing Strategies Every Backend Developer Should Know. A backfill that performs a full table scan every minute is not a maintenance task. It is a latent incident.
Second, do not hold long transactions across giant batches. Short, repeatable batches are much friendlier to live traffic and align with the lock minimization mindset from Database Transactions and Concurrency Control in TypeScript APIs.
Third, instrument the backfill like a real subsystem. The structured event design in Designing a High Quality Logging Pipeline with Attention to Cost and Structure is exactly what you want here: batch size, duration, rows updated, tenant id, retry count, and error class.
Step 5: Add reconciliation before read cutover
Before switching the read path, verify that the new representation is actually complete.
Useful checks include:
- Rows where old and new representations disagree
- Rows that were never backfilled
- Dual-write drift introduced by application bugs
- Tenant-specific skew, where a few large tenants are lagging behind
Example reconciliation query:
SELECT tenant_id, count(*) AS mismatched_rows
FROM customers
WHERE full_name IS NOT NULL
AND concat_ws(' ', first_name, last_name) <> full_name
GROUP BY tenant_id
ORDER BY mismatched_rows DESC;This is where observability becomes a release gate, not an afterthought. Dashboards should show backfill completion percentage, migration read errors, reconciliation drift, and latency impact. If your logging and metric fields are inconsistent, you will not know whether the migration is healthy until users tell you. That is exactly the failure mode Designing a High Quality Logging Pipeline with Attention to Cost and Structure helps you avoid.
Step 6: Cut reads over behind a feature flag
Do not couple read cutover to deployment if you can avoid it. Make it a controlled runtime decision.
type CustomerRow = {
full_name: string | null;
first_name: string | null;
last_name: string | null;
};
const buildDisplayName = (row: CustomerRow) => {
const candidate = [row.first_name, row.last_name].filter(Boolean).join(" ").trim();
return candidate || row.full_name || "Unknown customer";
};
export const mapCustomerName = async (row: CustomerRow, tenantId: string) => {
const useSplitName = await flags.isEnabled("customers-read-split-name", { tenantId });
if (!useSplitName) {
return row.full_name || buildDisplayName(row);
}
return buildDisplayName(row);
};The rollout process should look familiar if you have ever shipped percentage rollouts with How to Build Your Own Feature Flag System from Scratch:
- Enable for internal traffic first
- Enable for one or two low-risk tenants
- Watch query latency, error rate, and reconciliation drift
- Expand gradually
This is effectively schema canarying. And just like API rollout safety in Best Practices for API Versioning and Backward Compatibility, the old path should remain available until the new one has proven itself under real load.
Step 7: Special handling for multi-tenant SaaS
Zero-downtime work gets harder in multi-tenant environments because "done" rarely means globally uniform. Some tenants are tiny, some have years of historical data, and some have custom integrations that exercise paths nobody else touches.
Use the guardrails from Designing Multi-Tenant SaaS Isolation: Data, Controls, and Cost Guardrails during migration rollout:
- Backfill per tenant instead of globally when feasible
- Track completion and drift per tenant
- Use tenant-aware kill switches for read cutover
- Keep all migration writes scoped by
tenant_id - Avoid cross-tenant scans on hot tables
Per-tenant progression also makes rollback far less painful. If tenant 42 exposes a hidden edge case, you can disable the flag just for that tenant and keep learning without globally reverting the migration.
This is another place where Patterns for Background Jobs and Queues in Web Apps becomes relevant. Large-scale backfills are rarely a single job. They are operational queues whose unit of control should usually be a tenant, partition, or bounded key range.
Step 8: Contract only after the system has forgotten the old world
The contract phase is when you remove the old column, old write path, and fallback logic. This should be the least dramatic step, not the most dramatic one.
Before contract, verify all of the following:
- No running application version still reads or writes the old field
- Reconciliation has been clean for a meaningful period
- Rollback no longer depends on the old schema
- Dashboards and alerts show no migration-specific anomalies
- Documentation and downstream consumers have been updated
Only then should you remove the old path:
ALTER TABLE customers
DROP COLUMN full_name;If external consumers or internal services still depend on the old shape, treat that as a versioning problem rather than a database housekeeping problem. The rollout coordination principles in Best Practices for API Versioning and Backward Compatibility are just as relevant during contract as they were during expand.
Common failure modes that still catch experienced teams
Renaming in place
Column renames feel harmless because the data stays put. Operationally, they are often still breaking changes because old code refers to the old name. Prefer additive changes plus delayed cleanup.
Backfilling without supporting indexes
If the backfill query path is not indexed, your migration load will fight user traffic for the same resources. Revisit Database Indexing Strategies Every Backend Developer Should Know before you start the backfill, not after dashboards go red.
Dual-write drift
If one code path writes only the new field and another still writes only the old field, your data will fork quietly. Dual-write logic belongs in a central service or repository boundary, not scattered across handlers. This is where transactional discipline from Database Transactions and Concurrency Control in TypeScript APIs saves you from subtle corruption.
Treating migration scripts like disposable code
Migration workers need the same retry policy, idempotency, observability, and pause controls as any other production background workload. If that sounds like a queueing problem, it is because it is one. Patterns for Background Jobs and Queues in Web Apps is the right mental model here.
No fast rollback lever
If the only rollback plan is "restore from backup," you do not have a realistic rollback plan. A runtime kill switch from How to Build Your Own Feature Flag System from Scratch is often the fastest way to stop the blast radius while you investigate.
A production checklist for zero-downtime migrations
Use this before shipping any migration that changes live read or write behavior:
- Define the compatibility window between old and new application versions.
- Ship additive schema changes first.
- Add required indexes before moving reads or backfills to the new shape.
- Implement dual-write in one transactional boundary.
- Run backfills as bounded background jobs with checkpointing.
- Emit structured logs and metrics for every migration phase.
- Reconcile old and new representations until drift is zero or understood.
- Cut reads over gradually behind a feature flag.
- Roll out per tenant when the system is multi-tenant.
- Delete old columns only after the system has fully stopped depending on them.
Conclusion
Zero-downtime migrations are less about clever SQL and more about disciplined change management. The database must stay compatible while code versions overlap, writes must remain correct while two representations coexist, and backfills must behave like controlled production workloads rather than heroic one-time scripts.
If you combine transactional correctness from Database Transactions and Concurrency Control in TypeScript APIs, query-aware rollout planning from Database Indexing Strategies Every Backend Developer Should Know, staged compatibility thinking from Best Practices for API Versioning and Backward Compatibility, and operational control from How to Build Your Own Feature Flag System from Scratch, database migrations stop feeling like midnight roulette and start feeling like any other well-managed production rollout.
Actionable takeaways
- Pick one migration in your backlog and rewrite it as an expand-contract plan before touching code.
- Add reconciliation queries and migration-specific logs up front, using the field discipline from Designing a High Quality Logging Pipeline with Attention to Cost and Structure.
- Move any large backfill into your worker system and manage it with the same controls you use for Patterns for Background Jobs and Queues in Web Apps.
- For SaaS products, make rollout state visible per tenant and align it with the boundaries in Designing Multi-Tenant SaaS Isolation: Data, Controls, and Cost Guardrails.
