MonetDB

Last updated
MonetDB
Developer(s) MonetDB Foundation
Stable release
Aug2024 (11.51) [1] /  ()
Repository
Written in C
Operating system Cross-platform
Type Column-oriented DBMS
RDBMS
License Mozilla Public License, version 2.0
Website www.monetdb.org

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), [2] Resource Description Framework (RDF), [3] text retrieval and sequence alignment processing. [4]

Contents

History

Data mining projects in the 1990s required improved analytical database support. This resulted in a CWI spin-off called Data Distilleries, which used early MonetDB implementations in its analytical suite. Data Distilleries eventually became a subsidiary of SPSS in 2003, which in turn was acquired by IBM in 2009. [5]

MonetDB in its current form was first created in 2002 by doctoral student Peter Boncz and professor Martin L. Kersten as part of the 1990s' MAGNUM research project at University of Amsterdam. [6] It was initially called simply Monet, after the French impressionist painter Claude Monet. The first version under an open-source software license (a modified version of the Mozilla Public License) was released on September 30, 2004. When MonetDB version 4 was released into the open-source domain, many extensions to the code base were added by the MonetDB/CWI team, including a new SQL front end, supporting the SQL:2003 standard. [7]

MonetDB introduced innovations in all layers of the DBMS: a storage model based on vertical fragmentation, a modern CPU-tuned query execution architecture that often gave MonetDB a speed advantage over the same algorithm over a typical interpreter-based RDBMS. It was one of the first database systems to tune query optimization for CPU caches. MonetDB includes automatic and self-tuning indexes, run-time query optimization, and a modular software architecture. [8] [9]

By 2008, a follow-on project called X100 (MonetDB/X100) started, which evolved into the VectorWise technology. VectorWise was acquired by Actian Corporation, integrated with the Ingres database and sold as a commercial product. [10] [11]

In 2011 a major effort to renovate the MonetDB codebase was started. As part of it, the code for the MonetDB 4 kernel and its XQuery components were frozen. In MonetDB 5, parts of the SQL layer were pushed into the kernel. [7] The resulting changes created a difference in internal APIs, as it transitioned from MonetDB Instruction Language (MIL) to MonetDB Assembly Language (MAL). Older, no-longer maintained top-level query interfaces were also removed. First was XQuery, which relied on MonetDB 4 and was never ported to version 5. [12] The experimental Jaql interface support was removed with the October 2014 release. [13] With the July 2015 release, MonetDB gained support for read-only data sharding and persistent indices. In this release the deprecated streaming data module DataCell was also removed from the main codebase in an effort to streamline the code. [14] In addition, the license has been changed into the Mozilla Public License, version 2.0.

Architecture

MonetDB architecture is represented in three layers, each with its own set of optimizers. [15] The front end is the top layer, providing query interface for SQL, with SciQL and SPARQL interfaces under development. Queries are parsed into domain-specific representations, like relational algebra for SQL, and optimized. The generated logical execution plans are then translated into MonetDB Assembly Language (MAL) instructions, which are passed to the next layer. The middle or back-end layer provides a number of cost-based optimizers for the MAL. The bottom layer is the database kernel, which provides access to the data stored in Binary Association Tables (BATs). Each BAT is a table consisting of an Object-identifier and value columns, representing a single column in the database. [15]

MonetDB internal data representation also relies on the memory addressing ranges of contemporary CPUs using demand paging of memory mapped files, and thus departing from traditional DBMS designs involving complex management of large data stores in limited memory.

Query Recycling

Query recycling is an architecture for reusing the byproducts of the operator-at-a-time paradigm in a column store DBMS. Recycling makes use of the generic idea of storing and reusing the results of expensive computations. Unlike low-level instruction caches, query recycling uses an optimizer to pre-select instructions to cache. The technique is designed to improve query response times and throughput, while working in a self-organizing fashion. [16] The authors from the CWI Database Architectures group, composed of Milena Ivanova, Martin Kersten, Niels Nes and Romulo Goncalves, won the "Best Paper Runner Up" at the ACM SIGMOD 2009 conference for their work on Query Recycling. [17] [18]

Database Cracking

MonetDB was one of the first databases to introduce Database Cracking. Database Cracking is an incremental partial indexing and/or sorting of the data. It directly exploits the columnar nature of MonetDB. Cracking is a technique that shifts the cost of index maintenance from updates to query processing. The query pipeline optimizers are used to massage the query plans to crack and to propagate this information. The technique allows for improved access times and self-organized behavior. [19] Database Cracking received the ACM SIGMOD 2011 J.Gray best dissertation award. [20]

Components

A number of extensions exist for MonetDB that extend the functionality of the database engine. Due to the three-layer architecture, top-level query interfaces can benefit from optimizations done in the backend and kernel layers.

SQL

MonetDB/SQL is a top-level extension, which provides complete support for transactions in compliance with the SQL:2003 standard. [15]

GIS

MonetDB/GIS is an extension to MonetDB/SQL with support for the Simple Features Access standard of Open Geospatial Consortium (OGC). [2]

SciQL

SciQL an SQL-based query language for science applications with arrays as first class citizens. SciQL allows MonetDB to effectively function as an array database. SciQL is used in the European Union PlanetData Archived 2014-05-30 at the Wayback Machine and TELEIOS project, together with the Data Vault technology, providing transparent access to large scientific data repositories. [21] Data Vaults map the data from the distributed repositories to SciQL arrays, allowing for improved handling of spatio-temporal data in MonetDB. [22] SciQL will be further extended for the Human Brain Project. [23]

Data Vaults

Data Vault is a database-attached external file repository for MonetDB, similar to the SQL/MED standard. The Data Vault technology allows for transparent integration with distributed/remote file repositories. It is designed for scientific data data exploration and mining, specifically for remote sensing data. [22] There is support for the GeoTIFF (Earth observation), FITS (astronomy), MiniSEED (seismology) and NetCDF formats. [22] [24] The data is stored in the file repository in the original format, and loaded in the database in a lazy fashion, only when needed. The system can also process the data upon ingestion, if the data format requires it. [25] As a result, even very large file repositories can be efficiently analyzed, as only the required data is processed in the database. The data can be accessed through either the MonetDB SQL or SciQL interfaces. The Data Vault technology was used in the European Union's TELEIOS project, which was aimed at building a virtual observatory for Earth observation data. [24] Data Vaults for FITS files have also been used for processing astronomical survey data for The INT Photometric H-Alpha Survey (IPHAS) [26] [27]

SAM/BAM

MonetDB has a SAM/BAM module for efficient processing of sequence alignment data. Aimed at the bioinformatics research, the module has a SAM/BAM data loader and a set of SQL UDFs for working with DNA data. [4] The module uses the popular SAMtools library. [28]

RDF/SPARQL

MonetDB/RDF is a SPARQL-based extension for working with linked data, which adds support for RDF and allowing MonetDB to function as a triplestore. Under development for the Linked Open Data 2 project. [3]

R integration

MonetDB/R module allows for UDFs written in R to be executed in the SQL layer of the system. This is done using the native R support for running embedded in another application, inside the RDBMS in this case. Previously the MonetDB.R connector allowed the using MonetDB data sources and process them in an R session. The newer R integration feature of MonetDB does not require data to be transferred between the RDBMS and the R session, reducing overhead and improving performance. The feature is intended to give users access to functions of the R statistical software for in-line analysis of data stored in the RDBMS. It complements the existing support for C UDFs and is intended to be used for in-database processing. [29]

Python integration

Similarly to the embedded R UDFs in MonetDB, the database now has support for UDFs written in Python/NumPy. The implementation uses Numpy arrays (themselves Python wrappers for C arrays), as a result there is limited overhead - providing a functional Python integration with speed matching native SQL functions. The Embedded Python functions also support mapped operations, allowing user to execute Python functions in parallel within SQL queries. The practical side of the feature gives users access to Python/NumPy/SciPy libraries, which can provide a large selection of statistical/analytical functions. [30]

MonetDB embedded

Following the release of an embedded driver for R and R UDFs in MonetDB (MonetDB/R), the authors created an embedded version of MonetDB in R called MonetDBLite, embedded versions for Python and Java followed. They are distributed as embeddable packages, removing the need to manage a database server, required for the previous API integrations. The DBMS runs within the process itself, eliminating socket communication and serialisation overhead - greatly improving efficiency. The idea behind it is to easily embed an SQLite-like package with the performance of an in-memory optimized columnar store. [31]

Former extensions

A number of former extensions have been deprecated and removed from the stable code base over time. Some notable examples include an XQuery extension removed in MonetDB version 5; a JAQL extension, and a streaming data extension called Data Cell. [15] [32] [33]

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

<span class="mw-page-title-main">Object database</span> Type of database management system

An object database or object-oriented database is a database management system in which information is represented in the form of objects as used in object-oriented programming. Object databases are different from relational databases which are table-oriented. A third type, object–relational databases, is a hybrid of both approaches. Object databases have been considered since the early 1980s.

A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The Centrum Wiskunde & Informatica is a research centre in the field of mathematics and theoretical computer science. It is part of the institutes organization of the Dutch Research Council (NWO) and is located at the Amsterdam Science Park. This institute is famous as the creation site of the programming language Python. It was a founding member of the European Research Consortium for Informatics and Mathematics (ERCIM).

Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from Prolog. It is often used as a query language for deductive databases. Datalog has been applied to problems in data integration, networking, program analysis, and more.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

A virtual observatory (VO) is a collection of interoperating data archives and software tools which utilize the internet to form a scientific research environment in which research programs can be conducted. Historically addressing astronomical subjects, space physics and geosciences have also been a focus of VO development.

.QL is an object-oriented query language used to retrieve data from relational database management systems. It is reminiscent of the standard query language SQL and the object-oriented programming language Java. .QL is an object-oriented variant of a logical query language called Datalog. Hierarchical data can therefore be naturally queried in .QL in a recursive manner.

SQL/XML or XML-Related Specifications is part 14 of the Structured Query Language (SQL) specification. In addition to the traditional predefined SQL data types like NUMERIC, CHAR, TIMESTAMP, ... it introduces the predefined data type XML together with constructors, several routines, functions, and XML-to-SQL data type mappings to support manipulation and storage of XML in a SQL database.

XLDB was a yearly conference about databases, data management and analytics held from 2007 to 2019. The definition of extremely large refers to data sets that are too big in terms of volume, and/or velocity, and/or variety to be handled using conventional solutions. This conference dealt with the high-end of very large databases (VLDB). It was conceived and chaired by Jacek Becla.

<span class="mw-page-title-main">Michael Stonebraker</span> American computer scientist (born 1943)

Michael Ralph Stonebraker is an American computer scientist specializing in database systems. Through a series of academic prototypes and commercial startups, Stonebraker's research and products are central to many relational databases. He is also the founder of many database companies, including Ingres Corporation, Illustra, Paradigm4, StreamBase Systems, Tamr, Vertica and VoltDB, and served as chief technical officer of Informix. For his contributions to database research, Stonebraker received the 2014 Turing Award, often described as "the Nobel Prize for computing."

Richard Thomas Snodgrass is an American computer scientist and writer and is professor emeritus at the University of Arizona. He is best known for his work on temporal databases, query language design, query optimization and evaluation, storage structures, database design, and ergalics.

Patrick Eugene O'Neil was an American computer scientist, an expert on databases, and a professor of computer science at the University of Massachusetts Boston. He is of Irish descent.

QUEL is a relational database query language, based on tuple relational calculus, with some similarities to SQL. It was created as a part of the Ingres DBMS effort at University of California, Berkeley, based on Codd's earlier suggested but not implemented Data Sub-Language ALPHA. QUEL was used for a short time in most products based on the freely available Ingres source code, most notably in an implementation called POSTQUEL supported by POSTGRES. As Oracle and DB2 gained market share in the early 1980s, most companies then supporting QUEL moved to SQL instead. QUEL continues to be available as a part of the Ingres DBMS, although no QUEL-specific language enhancements have been added for many years.

<span class="mw-page-title-main">Actian Vector</span> SQL relational database management system

Actian Vector is an SQL relational database management system designed for high performance in analytical database applications. It published record breaking results on the Transaction Processing Performance Council's TPC-H benchmark for database sizes of 100 GB, 300 GB, 1 TB and 3 TB on non-clustered hardware.

The following is provided as an overview of and topical guide to databases:

<span class="mw-page-title-main">Array DBMS</span> System that provides database services specifically for arrays

An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.

NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.

<span class="mw-page-title-main">Martin L. Kersten</span> Dutch computer scientist (born 1953)

Martin L. Kersten was a computer scientist with research focus on database architectures, query optimization and their use in scientific databases. He was an architect of the MonetDB system, an open-source column store for data warehouses, online analytical processing (OLAP) and geographic information systems (GIS). He has been (co-) founder of several successful spin-offs of the Centrum Wiskunde & Informatica (CWI).

Peter Boncz is a Dutch computer scientist specializing in database systems. He is a researcher at the Centrum Wiskunde & Informatica and professor at the Vrije Universiteit Amsterdam in the special chair of Large-Scale Analytical Data Management.

References

  1. "MonetDB Release Notes". 19 August 2024.
  2. 1 2 "GeoSpatial - MonetDB". 25 July 2023.
  3. 1 2 "MonetDB - LOD2 - Creating Knowledge out of Interlined Data". 6 March 2014.
  4. 1 2 "Life Sciences in MonetDB". 25 July 2023.
  5. "A short history about us - MonetDB". 6 March 2014.
  6. Boncz, Peter (May 2002). Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications (PDF) (Ph.D. Thesis). Universiteit van Amsterdam. Archived from the original (PDF) on 13 August 2011.
  7. 1 2 MonetDB historic background
  8. Stefan Manegold (June 2006). "An Empirical Evaluation of XQuery Processors" (PDF). Proceedings of the International Workshop on Performance and Evaluation of Data Management Systems (ExpDB). 33 (2). ACM: 203–220. doi:10.1016/j.is.2007.05.004 . Retrieved December 11, 2013.
  9. P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, J. Teubner. MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine Archived 2008-05-19 at the Wayback Machine . In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, June 2006.
  10. Marcin Zukowski; Peter Boncz (May 20, 2012). "From x100 to vectorwise". Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM. pp. 861–862. doi:10.1145/2213836.2213967. ISBN   978-1-4503-1247-9. S2CID   9187072.
  11. Inkster, D.; Zukowski, M.; Boncz, P. A. (September 20, 2011). "Integration of VectorWise with Ingres" (PDF). ACM SIGMOD Record. 40 (3). ACM: 45. CiteSeerX   10.1.1.297.4985 . doi:10.1145/2070736.2070747. S2CID   6372175.
  12. "XQuery". 12 December 2014.
  13. "MonetDB Oct2014 Release Notes". 12 December 2014.
  14. "MonetDB July 2015 Released". 31 August 2015.
  15. 1 2 3 4 Idreos, S.; Groffen, F. E.; Nes, N. J.; Manegold, S.; Mullender, K. S.; Kersten, M. L. (March 2012). "MonetDB: Two Decades of Research in Column-oriented Database Architectures" (PDF). IEEE Data Engineering Bulletin. IEEE: 40–45. Retrieved March 6, 2014.
  16. "CWI database team wins Best Paper Runner Up at SIGMOD 2009". CWI Amsterdam. Retrieved 2009-07-01.
  17. "SIGMOD Awards". ACM SIGMOD. Retrieved 2014-07-01.
  18. Idreos, Stratos; Kersten, Martin L; Manegold, Stefan (2007). Database cracking. Proceedings of CIDR.
  19. "SIGMOD Awards". ACM SIGMOD. Retrieved 2014-12-12.
  20. Zhang, Y.; Scheers, L. H. A.; Kersten, M. L.; Ivanova, M.; Nes, N. J. (2011). "Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data". Astronomical Data Analysis Software and Systems.
  21. 1 2 3 Ivanova, Milena; Kersten, Martin; Manegold, Stefan (2012). "Data vaults: a symbiosis between database technology and scientific file repositories". Scientific and Statistical Database Management. SSDBM 20212. Springer Berlin Heidelberg. pp. 485–494.
  22. "SciQL". 4 March 2014.
  23. 1 2 Ivanova, Milena; Kargin, Yagiz; Kersten, Martin; Manegold, Stefan; Zhang, Ying; Datcu, Mihai; Molina, Daniela Espinoza (2013). "Data Vaults: A Database Welcome to Scientific File Repositories". Proceedings of the 25th International Conference on Scientific and Statistical Database Management. SSDBM. ACM. doi:10.1145/2484838.2484876. ISBN   978-1-4503-1921-8.
  24. Kargin, Yagiz; Ivanova, Milena; Zhang, Ying; Manegold, Stefan; Kersten, Martin (August 2013). "Lazy ETL in Action: ETL Technology Dates Scientific Data" (PDF). Proceedings of the VLDB Endowment. 6 (12): 1286–1289. doi:10.14778/2536274.2536297. ISSN   2150-8097.
  25. "Astronomical data analysis with MonetDB Data Vaults". 2015-09-09.
  26. "Data Vaults". 2015-09-09.
  27. "SAM/BAM installation". 24 November 2014.
  28. "Embedded R in MonetDB". 13 November 2014. Archived from the original on 13 November 2014. Retrieved 12 November 2014.
  29. "Embedded Python/NumPy in MonetDB". 11 January 2015.
  30. "MonetDBLite for R". 25 November 2015.
  31. "Xquery (obsolete)". MonetDB. Retrieved 2015-05-26.
  32. "Announcement: New Oct2014 Feature release of MonetDB suite". MonetDB. Retrieved 2015-05-26.

Bibliography