Very large database

Last updated

A very large database, (originally written very large data base) or VLDB, [1] is a database that contains a very large amount of data, so much that it can require specialized architectural, management, processing and maintenance methodologies. [2] [3] [4] [5]

Contents

Definition

The vague adjectives of very and large allow for a broad and subjective interpretation, but attempts at defining a metric and threshold have been made. Early metrics were the size of the database in a canonical form via database normalization or the time for a full database operation like a backup. Technology improvements have continually changed what is considered very large. [6] [7]

One definition has suggested that a database has become a VLDB when it is "too large to be maintained within the window of opportunity… the time when the database is quiet". [8]

Sizes of a VLDB database

There is no absolute amount of data that can be cited. For example, one cannot say that any database with more than 1 TB of data is considered a VLDB. This absolute amount of data has varied over time as computer processing, storage and backup methods have become better able to handle larger amounts of data. [5] That said, VLDB issues may start to appear when 1 TB is approached, [8] [9] and are more than likely to have appeared as 30 TB or so is exceeded. [10]

VLDB challenges

Key areas where a VLDB may present challenges include configuration, storage, performance, maintenance, administration, availability and server resources. [11] :11

Configuration

Careful configuration of databases that lie in the VLDB realm is necessary to alleviate or reduce issues raised by VLDB databases. [11] :36–53 [12]

Administration

The complexities of managing a VLDB can increase exponentially for the database administrator as database size increases. [13]

Availability and maintenance

When dealing with VLDB operations relating to maintenance and recovery such as database reorganizations and file copies which were quite practical on a non-VLDB take very significant amounts of time and resources for a VLDB database. [14] In particular it typically infeasible to meet a typical recovery time objective (RTO), the maximum expected time a database is expected to be unavailable due to interruption, by methods which involve copying files from disk or other storage archives. [13] To overcome these issues techniques such as clustering, cloned/replicated/standby databases, file-snapshots, storage snapshots or a backup manager may help achieve the RTO and availability, although individual methods may have limitations, caveats, license, and infrastructure requirements while some may risk data loss and not meet the recovery point objective (RPO). [15] [16] [13] [17] [18] For many systems only geographically remote solutions may be acceptable. [19]

Backup and recovery

Best practice is for backup and recovery to be architectured in terms of the overall availability and business continuity solution. [20] [21]

Performance

Given the same infrastructure there may typically be a decrease in performance, that is increase in response time as database size increases. Some accesses will simply have more data to process (scan) which will take proportionally longer (linear time); while the indexes used to access data may grow slightly in height requiring perhaps an extra storage access to reach the data (sub-linear time). [22] Other effects can be caching becoming less efficient because proportionally less data can be cached and while some indexes such as the B+ automatically sustain well with growth others such as a hash table may need to be rebuilt.

Should an increase in database size cause the number of accessors of the database to increase then more server and network resources may be consumed, and the risk of contention will increase. Some solutions to regaining performance include partitioning, clustering, possibly with sharding, or use of a database machine. [23] :390 [24]

Partitioning

Partitioning may be able assist the performance of bulk operations on a VLDB including backup and recovery., [25] bulk movements due to information lifecycle management (ILM), [26] :3 [27] :105–118 reducing contention [27] :327–329 as well as allowing optimization of some query processing. [27] :215–230

Storage

In order to satisfy needs of a VLDB the database storage needs to have low access latency and contention, high throughput, and high availability.

Server resources

The increasing size of a VLDB may put pressure on server and network resources and a bottleneck may appear that may require infrastructure investment to resolve. [13] [28]

Relationship to big data

VLDB is not the same as big data , but the storage aspect of big data may involve a VLDB database. [2] That said some of the storage solutions supporting big data were designed from the start to support large volumes of data, so database administrators may not encounter VLDB issues that older versions of traditional RDBMS's might encounter. [29]

See also

Related Research Articles

<span class="mw-page-title-main">MySQL</span> SQL database engine software

MySQL is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database. In addition to relational databases and SQL, an RDBMS like MySQL works with an operating system to implement a relational database in a computer's storage system, manages users, allows for network access and facilitates testing database integrity and creation of backups.

<span class="mw-page-title-main">Ingres (database)</span> Database software

Ingres Database is a proprietary SQL relational database management system intended to support large commercial and government applications.

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

A database administrator (DBA) manages computer databases. The role may include capacity planning, installation, configuration, database design, migration, performance monitoring, security, troubleshooting, as well as backup and data recovery.

SAP ASE (Adaptive Server Enterprise), originally known as Sybase SQL Server, and also commonly known as Sybase DB or Sybase ASE, is a relational model database server developed by Sybase Corporation, which later became part of SAP SE. ASE was developed for the Unix operating system, and is also available for Microsoft Windows.

In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server.

MySQL Cluster is a technology providing shared-nothing clustering and auto-sharding for the MySQL database management system. It is designed to provide high availability and high throughput with low latency, while allowing for near linear scalability. MySQL Cluster is implemented through the NDB or NDBCLUSTER storage engine for MySQL.

The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to that such systems are expected to respond to user requests and process them in real-time. The term is contrasted with online analytical processing (OLAP) which instead focuses on data analysis.

IBM Storage Protect is a data protection platform that gives enterprises a single point of control and administration for backup and recovery. It is the flagship product in the IBM Spectrum Protect family.

Veritas Backup Exec is a data protection software product designed for customers with mixed physical and virtual environments, and who are moving to public cloud services. Supported platforms include VMware and Hyper-V virtualization, Windows and Linux operating systems, Amazon S3, Microsoft Azure and Google Cloud Storage, among others. All management and configuration operations are performed with a single user interface. Backup Exec also provides integrated deduplication, replication, and disaster recovery capabilities and helps to manage multiple backup servers or multi-drive tape loaders.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

An embedded database system is a database management system (DBMS) which is tightly integrated with an application software; it is embedded in the application. It is a broad technology category that includes:

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

Amazon Relational Database Service is a distributed relational database service by Amazon Web Services (AWS). It is a web service running "in the cloud" designed to simplify the setup, operation, and scaling of a relational database for use in applications. Administration processes like patching the database software, backing up databases and enabling point-in-time recovery are managed automatically. Scaling storage and compute resources can be performed by a single API call to the AWS control plane on-demand. AWS does not offer an SSH connection to the underlying virtual machine as part of the managed service.

Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS.

H-Store is an experimental database management system (DBMS). It was designed for online transaction processing applications. H-Store was developed by a team at Brown University, Carnegie Mellon University, the Massachusetts Institute of Technology, and Yale University in 2007 by researchers Michael Stonebraker, Sam Madden, Andy Pavlo and Daniel Abadi.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

The following outline is provided as an overview of and topical guide to MySQL:

<span class="mw-page-title-main">MSP360</span> Application service provider

MSP360, formerly CloudBerry Lab, is a software and application service provider company that develops online backup, remote desktop and file management products integrated with more than 20 cloud storage providers.

References

  1. "Oracle Database Online Documentation 11g Release 1 (11.1) / Database Administration Database Concepts" . oracle. 18 Very Large Databases (VLDB). Retrieved 3 October 2018.
  2. 1 2 "Very Large Database (VLDB)". Technopedia. Archived from the original on 4 July 2018. Retrieved 3 October 2018.
  3. Gaines, R. S. and R. Gammill. Very Large Data Bases: An Emerging Research Area, Informal working paper, RAND Corporation
  4. Data Processing Magazine . North American Publishing Company. 1964. p. 18,58.
  5. 1 2 Widlake, Marin (18 September 2009). "What is a VLDB?". mwidlake. Archived from the original on 6 October 2018. Retrieved 7 October 2018.
  6. Sidley, Edgar H. (1 April 1980). Encyclopedia of Computer Science and Technology: Volume 14 - Very Large Data Base Systems to Zero-Memory and Markov Information Source. CRC Press. pp. 1–18. ISBN   9780824722142.
  7. Gerritsen, Rob; Morgan, Howard; Zisman, Michael (June 1977). "On some metrics for databases or what is a very large database?". ACM SIGMOD Record. 9 (1): 50–74. doi: 10.1145/984382.984393 . ISSN   0163-5808. S2CID   6359244.
  8. 1 2 Rankins, Ray; Jensen, Paul; Bertucci, Paul (18 December 2002). "21" . Microsoft SQL Server 2000 (2nd ed.). SAMS. ISBN   978-0672324673. Administering Very Large SQL Server Databases.
  9. "Oracle Database Release 18 - VLDB and Partitioning Guide" . Oracle. 1 Introduction to Very Large Databases. Archived from the original on 3 October 2018. Retrieved 3 October 2018.
  10. "The Very Large Database Problem - How to Backup & Recover 30–100 TB Databases" (PDF). actifio. Archived (PDF) from the original on 19 February 2018.
  11. 1 2 Hussain, Syed Jaffer (2014). "Tuning & Applying Best Practices On Very Large Databases (VLDB)" (PDF). Sangam: AIOUG. Archived (PDF) from the original on 4 October 2018.
  12. Chaves, Warner (7 January 2015). "Top 10 Must-Do Items for your SQL Server Very Large Database". SQLTURBO. Archived from the original on 13 December 2017. Retrieved 5 October 2018.
  13. 1 2 3 4 Furman, Dimitri (22 January 2018). Rajesh Setlem; Mike Weiner; Xiaochen Wu (eds.). "SQL Server VLDB in Azure: DBA Tasks Made Simple". MSDN. Archived from the original on 6 October 2018. Retrieved 6 October 2018.
  14. "Specialized Requirements for Relational Data Warehouse Servers". Red Brick Systems, Inc. 21 June 1996. Archived from the original on 10 October 1997.
  15. "Cluster design considerations". Crouchbase. Archived from the original on 17 October 2018. Retrieved 17 October 2017.
  16. "Cross Datacenter Replication (XDCR)". Crouchbase. Archived from the original on 17 October 2018. Retrieved 17 October 2017.
  17. Chien, Tim. "Snapshots Are NOT Backups" . Oracle technetwork. Archived from the original on 7 September 2018. Retrieved 10 October 2018.
  18. "Using a split mirror as a backup image" . IBM Knowledge Center. Archived from the original on 9 January 2018. Retrieved 10 October 2018.
  19. "Chapter 1 High Availability and Scalability". dev.mysql. Archived from the original on 15 December 2016. Retrieved 12 October 2018.
  20. Brooks, Charlotte; Leung, Clem; Mirza, Aslam; Neal, Curtis; Qiu, Yin Lei; Sing, John; Wong, Francis TH; Wright, Ian R (March 2007). "Chapter 1. Three Business solution segments defined". IBM System Storage Business Continuity: Part 2 Solutions Guide. IBM Redbooks. ISBN   978-0738489728.
  21. Akhtar, Ali Navid; Buchholtz, Jeff; Ryan, Michael; Setty, Kumar (2012). "Database Backup and Recovery Best Practices". Archived from the original on 29 June 2018. Retrieved 12 October 2012.
  22. Tariq, Ovais (14 July 2011). "Understanding B+tree Indexes and how they Impact Performance". ovaistariq.net. Archived from the original on 7 February 2018. Retrieved 10 October 2018.
  23. Shrestha, Raju (2017). High Availability and Performance of Database in the Cloud - Traditional Master-slave Replication versus Modern Cluster-based Solutions. 7th International Conference on Cloud Computing and Services. Vol. 1: CLOSER. SCITEPRESS – Science and Technology Publications, Lda. doi: 10.5220/0006294604130420 . hdl: 10642/6140 . ISBN   978-989-758-243-1. Archived from the original on 17 October 2018.
  24. "Encyclopedia". Definition of: database machine. Archived from the original on 4 July 2016. Retrieved 10 October 2018.
  25. Burleson, Donald (26 March 2015). "Oracle Backup VLDB tips". Burleson Consulting. Archived from the original on 30 June 2017. Retrieved 11 October 2016.
  26. "Oracle Partitioning in Oracle Database 12c Release 2 Extreme Data Management and Performance for every System" (PDF). Oracle. March 2017. Archived (PDF) from the original on 15 December 2017. Retrieved 17 October 2018.
  27. 1 2 3 Teske, Thomas (8 February 2018). Get the best out of Oracle Partitioning - A practical guide and reference (PDF) (Speech). Cern. Hermann Bär. 40-S2-C01 - Salle Curie (CERN): Oracle. Archived (PDF) from the original on 12 October 2018. Retrieved 12 October 2018.{{cite speech}}: CS1 maint: location (link)
  28. Steel, Phil; Poggemeyer, Liza; Plett, Corey (1 August 2018). "Server Hardware Performance Considerations". Microsoft IT Pro Center. Archived from the original on 17 October 2018. Retrieved 17 October 2018.
  29. Li, Yishan; Manoharan, Sathiamoorthy (2013). A performance comparison of SQL and NoSQL databases. 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM). IEEE. p. 15. doi:10.1109/PACRIM.2013.6625441. ISBN   978-1-4799-1501-9.