Schema migration

Last updated

In software engineering, a schema migration (also database migration, database change management) refers to the management of version-controlled, incremental and sometimes reversible changes to relational database schemas. A schema migration is performed on a database whenever it is necessary to update or revert that database's schema to some newer or older version.

Contents

Migrations are performed programmatically by using a schema migration tool. When invoked with a specified desired schema version, the tool automates the successive application or reversal of an appropriate sequence of schema changes until it is brought to the desired state.

Most schema migration tools aim to minimize the impact of schema changes on any existing data in the database. Despite this, preservation of data in general is not guaranteed because schema changes such as the deletion of a database column can destroy data (i.e. all values stored under that column for all rows in that table are deleted). Instead, the tools help to preserve the meaning of the data or to reorganize existing data to meet new requirements. Since meaning of the data often cannot be encoded, the configuration of the tools usually needs manual intervention.

Risks and benefits

Schema migration allows for fixing mistakes and adapting the data as requirements change. They are an essential part of software evolution, especially in agile environments (see below).

Applying a schema migration to a production database is always a risk. Development and test databases tend to be smaller and cleaner. The data in them is better understood or, if everything else fails, the amount of data is small enough for a human to process. Production databases are usually huge, old and full of surprises. The surprises can come from many sources:

For these reasons, the migration process needs a high level of discipline, thorough testing and a sound backup strategy.

Migration strategies

In the steady state, one version of an application only understands one version of a schema. So the most basic strategy is to shut down the application, execute the schema migration, and then start the newer version of the application. While simple, this strategy causes a downtime. Depending on the criticality of the system and its usage patterns, downtimes of various durations may be tolerated, but in some cases none may be tolerated at all. In those cases, one of the following zero-downtime strategies may be used.

Dual writing [1]

These are the general steps of dual reading (also called double reading):

  1. Prepare the schema so that it can hold data in both the old and new formats. This might mean adding a new version of a column or a table, without affecting existing data.
  2. Deploy a new version of the application which writes data in both the old and new formats (hence the name dual writing). It's important to ensure consistency of these writes. After this point, all newly written data will exist in both old and new formats.
  3. Execute a backfill in the database: copy data from the old format to the new format that existed previously, and hasn't been updated recently, so it's not dual written yet. After this point, the database has a complete replica of the data in both the old and new formats.
  4. Deploy a new version of the application which switches to reading data in the new format, and stops dual writing. In distributed systems, it's important to switch the reading path before stopping dual writing, so this step may be divided into two.
  5. Remove the old format data from the schema.

Dual reading [2]

Dual reading (also called double reading) is similar to dual writing, with the following steps:

  1. Prepare the schema so that it can hold data in both the old and new formats. Same as above.
  2. Deploy a new version of the application which tries to read both the old and new formats (hence the name dual reading), and works with whichever format is currently present.
  3. Deploy another version of the application which stops writing the old format and starts writing the new format instead. Everything should continue working normally as the dual reading will recognize that it has to read the new format for newly written rows.
  4. Execute a backfill in the database: for all data that was written in the old format, move it over to the new format.
  5. Deploy an application change once more that stops reading the old data.
  6. Remove the old format data from the schema.

Dual reading and writing

In this combined approach, the application is changed to both dual read and dual write. Since both individual strategies ensure that the database can remain online without interruption, the combined approach achieves the same as well. This strategy allows for more fine grained control over the backfill, which can be divided into smaller batches, and feature flags may be used to toggle both the reading and writing paths more freely and separately from each other. This can also be useful when regular dual writing alone can't be guaranteed to happen in consistent transactions.

Comparison

Schema migration in agile software development

When developing software applications backed by a database, developers typically develop the application source code in tandem with an evolving database schema. The code typically has rigid expectations of what columns, tables and constraints are present in the database schema whenever it needs to interact with one, so only the version of database schema against which the code was developed is considered fully compatible with that version of source code.

In software testing, while developers may mock the presence of a compatible database system for unit testing, any level of testing higher than this (e.g. integration testing or system testing) it is common for developers to test their application against a local or remote test database schematically compatible with the version of source code under test. In advanced applications, the migration itself can be subject to migration testing.

With schema migration technology, data models no longer need to be fully designed up-front, and are more capable of being adapted with changing project requirements throughout the software development lifecycle.

Relation to revision control systems

Teams of software developers usually use version control systems to manage and collaborate on changes made to versions of source code. Different developers can develop on divergent, relatively older or newer branches of the same source code to make changes and additions during development.

Supposing that the software under development interacts with a database, every version of the source code can be associated with at least one database schema with which it is compatible.

Under good software testing practice, schema migrations can be performed on test databases to ensure that their schema is compatible to the source code. To streamline this process, a schema migration tool is usually invoked as a part of an automated software build as a prerequisite of the automated testing phase.

Schema migration tools can be said to solve versioning problems for database schemas just as version control systems solve versioning problems for source code. In practice, many schema migration tools actually rely on a textual representation of schema changes (such as files containing SQL statements) such that the version history of schema changes can effectively be stored alongside program source code within VCS. This approach ensures that the information necessary to recover a compatible database schema for a particular code branch is recoverable from the source tree itself. Another benefit of this approach is the handling of concurrent conflicting schema changes; developers may simply use their usual text-based conflict resolution tools to reconcile differences.

Relation to schema evolution

Schema migration tooling could be seen as a facility to track the history of an evolving schema (i.e. schema evolution).

Advantages

Developers no longer need to remove the entire test database in order to create a new test database from scratch (e.g. using schema creation scripts from DDL generation tools). Further, if generation of test data costs a lot of time, developers can avoid regenerating test data for small, non-destructive changes to the schema.

Related Research Articles

<span class="mw-page-title-main">Legacy system</span> Old computing technology or system that remains in use

In computing, a legacy system is an old method, technology, computer system, or application program, "of, relating to, or being a previous or outdated computer system", yet still in use. Often referencing a system as "legacy" means that it paved the way for the standards that would follow it. This can also imply that the system is out of date or in need of replacement.

<span class="mw-page-title-main">Microsoft Access</span> Database manager part of the Microsoft 365 package

Microsoft Access is a database management system (DBMS) from Microsoft that combines the relational Access Database Engine (ACE) with a graphical user interface and software-development tools. It is a member of the Microsoft 365 suite of applications, included in the Professional and higher editions or sold separately.

<span class="mw-page-title-main">Software testing</span> Checking software against a standard

Software testing is the act of checking whether software satisfies expectations.

<span class="mw-page-title-main">Eclipse (software)</span> Software development environment

Eclipse is an integrated development environment (IDE) used in computer programming. It contains a base workspace and an extensible plug-in system for customizing the environment. It is the second-most-popular IDE for Java development, and, until 2016, was the most popular. Eclipse is written mostly in Java and its primary use is for developing Java applications, but it may also be used to develop applications in other programming languages via plug-ins, including Ada, ABAP, C, C++, C#, Clojure, COBOL, D, Erlang, Fortran, Groovy, Haskell, JavaScript, Julia, Lasso, Lua, NATURAL, Perl, PHP, Prolog, Python, R, Ruby, Rust, Scala, and Scheme. It can also be used to develop documents with LaTeX and packages for the software Mathematica. Development environments include the Eclipse Java development tools (JDT) for Java and Scala, Eclipse CDT for C/C++, and Eclipse PDT for PHP, among others.

<span class="mw-page-title-main">WebObjects</span> Java web application server and framework originally developed by NeXT Software

WebObjects is a discontinued Java web application server and a server-based web application framework originally developed by NeXT Software, Inc.

Data migration is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another. Additionally, the validation of migrated data for completeness and the decommissioning of legacy data storage are considered part of the entire data migration process. Data migration is a key consideration for any system implementation, upgrade, or consolidation, and it is typically performed in such a way as to be as automated as possible, freeing up human resources from tedious tasks. Data migration occurs for a variety of reasons, including server or storage equipment replacements, maintenance or upgrades, application migration, website consolidation, disaster recovery, and data center relocation.

UVC-based preservation is an archival strategy for handling the preservation of digital objects. It employs the use of a Universal Virtual Computer (UVC)—a virtual machine (VM) specifically designed for archival purposes, that allows both emulation and migration to a language-neutral format like XML.

GRIB is a concise data format commonly used in meteorology to store historical and forecast weather data. It is standardized by the World Meteorological Organization's Commission for Basic Systems, known under number GRIB FM 92-IX, described in WMO Manual on Codes No.306. Currently there are three versions of GRIB. Version 0 was used to a limited extent by projects such as TOGA, and is no longer in operational use. The first edition is used operationally worldwide by most meteorological centers, for Numerical Weather Prediction output (NWP). A newer generation has been introduced, known as GRIB second edition, and data is slowly changing over to this format. Some of the second-generation GRIB is used for derived products distributed in the Eumetcast of Meteosat Second Generation. Another example is the NAM model.

The Extensible Metadata Platform (XMP) is an ISO standard, originally created by Adobe Systems Inc., for the creation, processing and interchange of standardized and custom metadata for digital documents and data sets.

<span class="mw-page-title-main">CakePHP</span> Open-source web framework in PHP

CakePHP is an open-source web framework. It follows the model–view–controller (MVC) approach and is written in PHP, modeled after the concepts of Ruby on Rails, and distributed under the MIT License.

ER/Studio is data architecture and database design software developed by IDERA, Inc. ER/Studio is compatible with multiple database platforms and is used to create and manage database designs, as well as to document and reuse data assets. In 2015, Embarcadero Technologies was acquired by database and infrastructure management software company IDERA, Inc. Since the acquisition by IDERA, Inc., ER/Studio has been renamed to ER/Studio Data Architect with updated features.

Perst is an open source, dual license, object-oriented embedded database management system (ODBMS). Both the Java programming language, and the C# programming language versions are compact and Perst has been implemented on smart phones running the Android and Windows Phone (WP7) operating systems.

Profile Scripting Language (PSL) is a superset of the MUMPS programming language that adds object-oriented language features. It is currently developed by Fidelity National Information Services.

Web2py is an open-source web application framework written in the Python programming language. Web2py allows web developers to program dynamic web content using Python. Web2py is designed to help reduce tedious web development tasks, such as developing web forms from scratch, although a web developer may build a form from scratch if required.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.

<span class="mw-page-title-main">Db4o</span> Open source object database

db4o was an embeddable open-source object database for Java and .NET developers. It was developed, commercially licensed and supported by Actian. In October 2014, Actian declined to continue to actively pursue and promote the commercial db4o product offering for new customers.

Continuous delivery (CD) is a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time and following a pipeline through a "production-like environment", without doing so manually. It aims at building, testing, and releasing software with greater speed and frequency. The approach helps reduce the cost, time, and risk of delivering changes by allowing for more incremental updates to applications in production. A straightforward and repeatable deployment process is important for continuous delivery.

<span class="mw-page-title-main">Database Workbench</span>

Database Workbench is a software application for development and administration of multiple relational databases using SQL, with interoperationality between different database systems, developed by Upscene Productions.

Evolutionary database design involves incremental improvements to the database schema so that it can be continuously updated with changes, reflecting the customer's requirements. People across the globe work on the same piece of software at the same time hence, there is a need for techniques that allow a smooth evolution of database as the design develops. Such methods utilize automated refactoring and continuous integration so that it supports agile methodologies for software development. These development techniques are applied on systems that are in pre-production stage as well on systems that have already been released. These techniques not only cover relevant changes in the database schema according to customer's changing needs, but also migration of modified data into the database and also customizing the database access code accordingly without changing the data semantics.

This article discusses a set of tactics useful in software testing. It is intended as a comprehensive list of tactical approaches to Software Quality Assurance (more widely colloquially known as Quality Assurance and general application of the test method.

References

  1. "Safe Database Migration Pattern Without Downtime" . Retrieved 24 May 2024.
  2. "Data store migrations with no downtime" . Retrieved 24 May 2024.