Database refactoring

Last updated

A database refactoring is a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics. Database refactoring does not change the way data is interpreted or used and does not fix bugs or add new functionality. Every refactoring to a database leaves the system in a working state, thus not causing maintenance lags, provided the meaningful data exists in the production environment.

Contents

A database refactoring is conceptually more difficult than a code refactoring; code refactorings only need to maintain behavioral semantics while database refactorings also must maintain informational semantics.

A database schema is typically refactored for one of several reasons:

  1. To develop the schema in an evolutionary manner in parallel with the evolutionary design of the rest of the system.
  2. To fix design problems with an existing legacy database schema. Database refactorings are often motivated by the desire for database normalization of an existing production database, typically to "clean up" the design of the database.
  3. To implement what would be a large (and potentially risky) change as a series of small, low-risk changes.

Categories of database refactoring

In 2006 Scott Ambler, Pramod Sadalage [1] describe the following categories of database refactoring: [2]

A change which improves the overall manner in which external programs interact with a database.

Methods of Architecture Refactoring category: Add CRUD Methods; Add Mirror Table; Add Read Method; Encapsulate Table With View; Introduce Calculation Method; Introduce Index; Introduce Read Only Table; Migrate Method From Database; Migrate Method To Database; Replace Method(s) With View; Replace View With Methods(s); Use Official Data Source.

A change to the table structure of your database schema.

Methods of Structural Refactoring category: Drop Column; Drop Table; Drop View; Introduce Calculated Column; Introduce Surrogate Key; Merge Columns; Merge Tables; Move Column; Rename Column; Rename Table; Rename View; Replace LOB With Table; Replace Column; Replace One-To-Many With Associative Tables; Replace Surrogate Key With Natural Key; Split Column; Split Table.

A change which improves and/or ensures the consistency and usage of the values stored within the database.

Methods of Data Quality Refactoring category: Add Lookup Table; Apply Standard Codes; Apply Standard Type; Consolidate Key Strategy; Drop Column Constraint; Drop Default Value; Drop Non Nullable; Introduce Column Constraint; Introduce Common Format; Introduce Default Value; Make Column Non Nullable; Move Data; Replace Type Code With Property Flags.

A change which ensures that a referenced row exists within another table and/or that ensures that a row which is no longer needed is removed appropriately.

Methods of Referential Integrity Refactoring category: Add Foreign Key Constraint; Add Trigger for Calculated Column; Drop Foreign Key Constraint; Introduce Cascading Delete; Introduce Hard Delete; Introduce Soft Delete; Introduce Trigger for History.

A change which changes the semantics of your database schema by adding new elements to it or by modifying existing elements.

Methods of Transformation category: Insert Data; Introduce New Column; Introduce New Table; Introduce View; Update Data.

A change which improves the quality of a stored procedure, stored function, or trigger.

Methods of the Method Refactoring category: Parameterize Methods; Remove Parameter; Rename Method; Reorder Parameters; Replace Parameter with Explicit Methods; Consolidate Conditional Expression; Decompose Conditional; Extract Method; Introduce Variable; Remove Control Flag; Remove Middle Man; Replace Literal with Table Lookup; Replace Nested; Conditional with Guard Clauses; Split Temporary Variable; Substitute Algorithm.

In 2019 Vladislav Struzik supplemented the categories of database refactoring with a new one: [3]

A change which relates to data access.

Methods of the Access Refactoring category: [4] [5] Change Authentication Attributes; Revoke Authorization Privileges; Grant Authorization Privileges; Extract Database Schema; Merge Database Schemas.

Process of database refactoring

The process of database refactoring is the act of applying database refactorings to evolve an existing database schema (database refactoring is a core practice of evolutionary database design). There are three considerations that need to be taken into account:

  1. How a single refactoring is implemented
  2. How database refactorings are tracked and shared within organizations
  3. How a series of database refactorings are applied

See also

Related Research Articles

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

A foreign key is a set of attributes in a table that refers to the primary key of another table. The foreign key links these two tables. Another way to put it: In the context of relational databases, a foreign key is a set of attributes subject to a certain kind of inclusion dependency constraints, specifically a constraint that the tuples consisting of the foreign key attributes in one relation, R, must also exist in some other relation, S, and furthermore that those attributes must also be a candidate key in S. In simpler words, a foreign key is a set of attributes that references a candidate key. For example, a table called TEAM may have an attribute, MEMBER_NAME, which is a foreign key referencing a candidate key, PERSON_NAME, in the PERSON table. Since MEMBER_NAME is a foreign key, any value existing as the name of a member in TEAM must also exist as a person's name in the PERSON table; in other words, every member of a TEAM is also a PERSON.

<span class="mw-page-title-main">Referential integrity</span> Where all data references are valid

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

In the context of SQL, data definition or data description language (DDL) is a syntax for creating and modifying database objects such as tables, indices, and users. DDL statements are similar to a computer programming language for defining data structures, especially database schemas. Common examples of DDL statements include CREATE, ALTER, and DROP.

<span class="mw-page-title-main">Entity–relationship model</span> Model or diagram describing interrelated things

An entity–relationship model describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types and specifies relationships that can exist between entities.

<span class="mw-page-title-main">Object–role modeling</span> Programming technique

Object–role modeling (ORM) is used to model the semantics of a universe of discourse. ORM is often used for data modeling and software engineering.

An SQL INSERT statement adds one or more records to any single table in a relational database.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

The object–relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being served by an application program written in an object-oriented programming language or style, particularly because objects or class definitions, must be mapped to database tables defined by a relational schema.

SQL-92 was the third revision of the SQL database query language. Unlike SQL-89, it was a major revision of the standard. Aside from a few minor incompatibilities, the SQL-89 standard is forward-compatible with SQL-92.

<span class="mw-page-title-main">IDEF1X</span>

Integration DEFinition for information modeling (IDEF1X) is a data modeling language for the development of semantic data models. IDEF1X is used to produce a graphical information model which represents the structure and semantics of information within an environment or system.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall.

Entity–attribute–value model (EAV) is a data model to encode, in a space-efficient manner, entities where the number of attributes that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. Such entities correspond to the mathematical notion of a sparse matrix.

In relational database management systems, a unique key is a candidate key. All the candidate keys of a relation can uniquely identify the records of the relation, but only one of them is used as the primary key of the relation. The remaining candidate keys are called unique keys because they can uniquely identify a record in a relation. Unique keys can consist of multiple columns. Unique keys are also called alternate keys. Unique keys are an alternative to the primary key of the relation. In SQL, the unique keys have a UNIQUE constraint assigned to them in order to prevent duplicates. Alternate keys may be used like the primary key when doing a single-table select or when filtering in a where clause, but are not typically used to join multiple tables.

Scott W. Ambler is a Canadian software engineer, consultant and author. He is an author of books about the Disciplined Agile Delivery toolkit, the Unified process, Agile software development, the Unified Modeling Language, and Capability Maturity Model (CMM) development.

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st century, triggered by the needs of Web 2.0 companies. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called Not only SQL to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

The following is provided as an overview of and topical guide to databases:

Semantic heterogeneity is when database schema or datasets for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values. Beyond structured data, the problem of semantic heterogeneity is compounded due to the flexibility of semi-structured data and various tagging methods applied to documents or unstructured data. Semantic heterogeneity is one of the more important sources of differences in heterogeneous datasets.

Evolutionary database design involves incremental improvements to the database schema so that it can be continuously updated with changes, reflecting the customer's requirements. People across the globe work on the same piece of software at the same time hence, there is a need for techniques that allow a smooth evolution of database as the design develops. Such methods utilize automated refactoring and continuous integration so that it supports agile methodologies for software development. These development techniques are applied on systems that are in pre-production stage as well on systems that have already been released. These techniques not only cover relevant changes in the database schema according to customer's changing needs, but also migration of modified data into the database and also customizing the database access code accordingly without changing the data semantics.

References

  1. Scott Ambler, Pramod Sadalage Refactoring Databases: Evolutionary Database Design - Addison-Wesley Professional; 1st edition (March 3, 2006) - 384 p. - ISBN   978-0321774514
  2. Scott Ambler Catalog of Database Refactorings - Agile Data - URL: http://agiledata.org/essays/databaseRefactoringCatalog.html
  3. Струзік, В. А. Категорія рефакторинг доступу / В. А. Струзік // Комп’ютерні науки, інформаційні технології та системи управління : Міжнародна науково-технічна конференція студентів, аспірантів та молодих вчених, 27–29 листопада 2019 р. – Івано-Франківськ : Прикарпатський національний університет ім. Василя Стефаника, 2019. – С. 20-21. URL: http://dspace.nuft.edu.ua/jspui/handle/123456789/31516
  4. Струзік, В. А. Категорія рефакторинг доступу / В. А. Струзік, С. В. Грибков, В. В. Чобану // Наукові праці НУХТ. – Т. 26, № 2. – 2020. – С. 31–49. URL: http://dspace.nuft.edu.ua/jspui/handle/123456789/31515
  5. Vladislav Struzik, PhD Refactoring: yesterday, today, tomorrow. URL: https://medium.com/@struzik/refactoring-yesterday-today-tomorrow-7fc8c845cfb1