Database scalability

Last updated

Database scalability is the ability of a database to handle changing demands by adding/removing resources. Databases use a host of techniques to cope. [1]

Contents

History

The initial history of database scalability was to provide service on ever smaller computers. The first database management systems such as IMS ran on mainframe computers. The second generation, including Ingres, Informix, Sybase, RDB and Oracle emerged on minicomputers. The third generation, including dBase and Oracle (again), ran on personal computers. [2]

During the same period, attention turned to handling more data and more demanding workloads. One key software innovation in the late 1980s was to reduce update locking granularity from tables and disk blocks to individual rows. This eliminated a critical scalability bottleneck, as coarser locks could delay access to rows even though they were not directly involved in a transaction. Earlier systems were completely insensitive to increasing resources. [3]

Once software limitations had been addressed, attention turned to hardware. Innovation occurred in many areas. The first was to support multiprocessor computers. This involved allowing multiple processors to handle database requests simultaneously, without blocking each other. This evolved into support for multi-core processors.

A much more significant change involved allowing distributed transactions to affect data stored on separate computers, using the two-phase commit protocol, establishing the shared-nothing architecture . [4]

Still later, Oracle introduced the shared-everything architecture, which provided full functionality on multi-server clusters. [5]

Another innovation was storing copies of tables on multiple computers ( database replication ), which both improved availability (processing could continue on a copy even if the main system was unavailable) and scalability particularly for query/analysis, in that requests could be routed to the copy if the primary reached its capacity. [6]

In the early twenty-first century, NoSQL systems gained favor over relational databases for some workloads. Motivations included still greater scalability and support for documents and other "non-relational" data types. Often sacrificed was the strict ACID consistency protocols that guaranteed perfect consistency at all times in favor of eventual consistency that ensured that all nodes would eventually return the latest data. Some even allowed for transactions to occasionally be lost, as long as the system could handle sufficiently many requests. [7] The most prominent early system was Google's BigTable/MapReduce, developed in 2004. It achieved near-linear scalability across multiple server farms, at the cost of features such as multi-row transactions and joins. [8]

In 2007, the first NewSQL system, H-Store, was developed. NewSQL systems attempt to combine NoSQL scalability with ACID transactions and SQL interfaces. [9]

Dimensions

Database scalability has three basic dimensions: amount of data, volume of requests and size of requests. Requests come in many sizes: transactions generally affect small amounts of data, but may approach thousands per second; analytic queries are generally fewer, but may access more data. A related concept is elasticity, the ability of a system to transparently add and subtract capacity to meet changing workloads. [10]

Vertical

Vertical database scaling implies that the database system can fully exploit maximally configured systems, including typically multiprocessors with large memories and vast storage capacity. Such systems are relatively simple to administer, but may offer reduced availability. However, any single computer has a maximum configuration. If workloads expand beyond that limit, the choices are either to migrate to a different, still larger system, or to rearchitect the system to achieve horizontal scalability. [10]

Horizontal

Horizontal database scaling involves adding more servers to work on a single workload. Most horizontally scalable systems come with functionality compromises. If an application requires more functionality, migration to a vertically scaled system may be preferable. [10]

Techniques

Hardware

Databases run on individual hardware ranging in capacity from smartwatches to supercomputers to multiple transparently reconfigurable server farms. [2] Databases also scaled vertically to run on 64-bit microprocessors, multi-core CPUs, and large SMP multiprocessors, using multi-threaded implementations.

Contention

Fully exploiting a hardware configuration requires a variety of locking techniques, ranging from locking an entire database to entire tables to disk blocks to individual table rows. The appropriate lock granularity depends on the workload. The smaller the object that is locked, the less the chance of database requests blocking each other, while the hardware idles. Typically row locks are necessary to support high volume transaction processing applications at the cost of processing overhead to manage the larger number of locks. [3]

Further, some systems ensure that a query sees a time-consistent view of the database by locking data that a query is examining to prevent an update from modifying it, stalling work. Alternatively, some databases use multi-version read consistency to avoid (blocking) read locks while still providing consistent query results. [11]

Another potential bottleneck can occur in some systems when many requests attempt to access the same data at the same time. For example, in OLTP systems, many transactions may attempt to insert data into the same table at the same time. In a shared nothing system, at any given moment, all such inserts are processed by the single server that manages that partition (shard) of the table, possibly overwhelming it, while the rest of the system has little to do. Many such tables use a sequence number as their primary key that increases for each new inserted row. The index for that key can also experience contention (overheat) as it processes those inserts. One solution for this is to reverse the digits of the primary key. This spreads the inserts into both the table and the key across multiple parts of the database. [12]

Partitioning

A basic technique is to split large tables into multiple partitions based on ranges of values in a key field. For example, the data for each year could be held on a separate disk drive or on a separate computer. Partitioning removes limits on the sizes of a single table.

Replication

Replicated databases maintain copies of tables or databases on multiple computers. This scaling technique is particularly convenient for seldom or never-updated data, such as transaction history or tax tables. [6]

Clustered computers

A variety of approaches are used to scale beyond the limits of a single computer. HP Enterprise's NonStop SQL uses the shared nothing architecture in which neither data nor memory are shared across server boundaries. A coordinator routes database requests to the correct server. This architecture provides near-linear scalability.

The widely supported X/Open XA standard employs a global transaction monitor to coordinate distributed transactions among semi-autonomous XA-compliant transaction resources.

Oracle RAC uses a different model to achieve scalability, based on a "shared-everything" architecture. This approach incorporates the shared disk approach that allows multiple computers to access any disk in the cluster. Network-attached storage (NAS) and Storage area networks (SANs) coupled with local area networks and Fibre Channel technology enable such configurations. The approach includes a "shared" logical cache in which data that has been cached in memory on server is made available to other servers without requiring them to again read the data from disk. Each page is moved from server to server to satisfy requests. Updates generally happen very quickly so that a "popular" page can be updated by multiple transactions with little delay. This approach is claimed to support clusters containing up to 100 servers. [13]

Some researchers question the inherent limitations of relational database management systems. GigaSpaces, for example, contends that space-based architecture is required to achieve performance and scalability. Base One makes the case for extreme scalability within mainstream relational database technology. [14]

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

<span class="mw-page-title-main">Ingres (database)</span>

Ingres Database is a proprietary SQL relational database management system intended to support large commercial and government applications.

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

In database systems, isolation determines how transaction integrity is visible to other users and systems.

The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

The Access Database Engine is a database engine on which several Microsoft products have been built. The first version of Jet was developed in 1992, consisting of three modules which could be used to manipulate a database.

Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to that such systems are expected to respond to user requests and process them in real-time. The term is contrasted with online analytical processing (OLAP) which instead focuses on data analysis.

In database computing, Oracle Real Application Clusters (RAC) — an option for the Oracle Database software produced by Oracle Corporation and introduced in 2001 with Oracle9i — provides software for clustering and high availability in Oracle database environments. Oracle Corporation includes RAC with the Enterprise Edition, provided the nodes are clustered using Oracle Clusterware.

<span class="mw-page-title-main">Virtuoso Universal Server</span> Computer software

Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional relational database management system (RDBMS), object–relational database (ORDBMS), virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. Rather than have dedicated servers for each of the aforementioned functionality realms, Virtuoso is a "universal server"; it enables a single multithreaded server process that implements multiple protocols. The free and open source edition of Virtuoso Universal Server is also known as OpenLink Virtuoso. The software has been developed by OpenLink Software with Kingsley Uyi Idehen and Orri Erling as the chief software architects.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

<span class="mw-page-title-main">Shadow table</span> Abstract object in computer science

Shadow tables are objects in computer science used to improve the way machines, networks and programs handle information. More specifically, a shadow table is an object that is read and written by a processor and contains data similar to its primary table, which is the table it's "shadowing". Shadow tables usually contain data that is relevant to the operation and maintenance of its primary table, but not within the subset of data required for the primary table to exist. Shadow tables are related to the data type "trails" in data storage systems. Trails are very similar to shadow tables but instead of storing identically formatted information that is different, they store a history of modifications and functions operated on a table.

In computer science, in-memory processing (PIM) is a computer architecture for processing data stored in an in-memory database. In-memory processing improves the power usage and performance of moving data between the processor and the main memory. Older systems have been based on disk storage and relational databases using Structured Query Language, which are increasingly regarded as inadequate to meet business intelligence (BI) needs. Because stored data is accessed much more quickly when it is placed in random-access memory (RAM) or flash memory, in-memory processing allows data to be analyzed in real time, enabling faster reporting and decision-making in business.

The following is provided as an overview of and topical guide to databases:

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.

The following outline is provided as an overview of and topical guide to MySQL:

References

  1. Bondi, André B. (2000). Characteristics of scalability and their impact on performance. Proceedings of the second international workshop on Software and performance – WOSP '00. p. 195. doi:10.1145/350391.350432. ISBN   158113195X.
  2. 1 2 Chopra, Rajiv (2010). Database Management System (DBMS)A Practical Approach. S. Chand Publishing. p. 33. ISBN   9788121932455.
  3. 1 2 "Row locks vs table locks in Oracle". www.dba-oracle.com. Retrieved 2019-04-11.
  4. "The Advantages of a Shared Nothing Architecture for Truly Non-Disruptive Upgrades". solidfire.com. 2014-09-17. Archived from the original on 2015-04-24. Retrieved 2015-04-21.
  5. "Real Application Clusters Administration and Deployment Guide". docs.oracle.com. Retrieved 2019-04-11.
  6. 1 2 "A Primer on Database Replication". www.brianstorti.com. Retrieved 2019-04-11.
  7. Martin Zapletal (2015-06-11). "Large volume data analysis on the Typesafe Reactive Platform".{{cite journal}}: Cite journal requires |journal= (help)
  8. "Overview of Cloud Bigtable | Cloud Bigtable Documentation". Google Cloud. Retrieved 2019-04-11.
  9. Aslett, Matthew (2011). "How Will The Database Incumbents Respond To NoSQL And NewSQL?" (PDF). 451 Group (published 2011-04-04). Retrieved 2012-07-06.
  10. 1 2 3 Branson, Tony (2016-12-06). "Two main approaches to database scalability". Infosecurity Magazine. Retrieved 2019-04-11.
  11. "Clojure - Refs and Transactions". clojure.org. Retrieved 2019-04-12.
  12. "Introduction To Reverse Key Indexes: Part I". Richard Foote's Oracle Blog. 2008-01-14. Retrieved 2019-04-13.
  13. "clustering" (PDF). Oracle.com. Retrieved 2012-11-07.
  14. Base One (2007). "Database Scalability - Dispelling myths about the limits of database-centric architecture" . Retrieved May 23, 2007.