This article needs additional citations for verification .(March 2024) |
Developer(s) | DuckDB Labs |
---|---|
Stable release | v1.1.3 / November 4, 2024 |
Repository | |
Written in | C++ |
Operating system | Cross-platform |
Type | Column-oriented DBMS RDBMS |
License | MIT License |
Website | www |
DuckDB is an open-source column-oriented relational database management system (RDBMS). [1] It is designed to provide high performance on complex queries against large databases in embedded configuration, [2] such as combining tables with hundreds of columns and billions of rows. Unlike other embedded databases (for example, SQLite) DuckDB is not focusing on transactional (OLTP) applications and instead is specialized for online analytical processing (OLAP) workloads. [3] The project has over 6 million downloads per month. [4] [5] [6]
DuckDB was originally developed by Mark Raasveldt and Hannes Mühleisen at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. [2] The project co-founders designed DuckDB to address the need for an in-process OLAP database solution. [7] DuckDB was first released in 2019. [8] DuckDB version 1.0.0 was released on June 3, 2024 under the codename SnowDuck. [9]
DuckDB uses a vectorized query processing engine. [10] DuckDB is special amongst database management systems because it does not have any external dependencies and can build with just a C++11 compiler. [11] DuckDB also deviates from the traditional client–server model by running inside a host process (it has bindings, for example, for a Python interpreter with the ability to directly place data into NumPy arrays [2] ). DuckDB's SQL parser is derived from the pg_query library developed by Lukas Fittl, which is itself derived from PostgreSQL's SQL parser that has been stripped down as much as possible. [12] [13] DuckDB uses a single-file storage format to store data on disk, designed to support efficient scans and bulk updates, appends and deletes. [14]
DuckDB in its OLAP niche does not compete with the traditional DBMS like MSSQL, PostgreSQL and Oracle database. While using SQL for queries, DuckDB targets serverless applications and provides extremely fast responses using Apache Parquet files for storage. These attributes make it a popular choice for large dataset analysis in interactive mode, but certain commenters have indicated that they believe the serverless nature of DuckDB makes it, as a stand alone tool, "not so suitable for enterprise data warehousing". [15]
DuckDB is used at Facebook, Google, and Airbnb. [16]
DuckDB co-author Mühleisen also runs a support and consultancy firm for the software, DuckDB Labs. [8] The company has chosen not to take venture capital funding, stating "We feel investment would force the project direction towards monetization, and we would much prefer keeping DuckDB open and available for as many people as possible". [6] Another company, MotherDuck, has received $100m funding for its data platform based on DuckDB, with investors including Andreessen Horowitz. [17]
The independent non-profit DuckDB Foundation safeguards the long-term maintenance and development of DuckDB. The foundation holds much of the intellectual property of the project and is funded by charitable donations. [18] The DuckDB Foundation's statutes ensure DuckDB remains open-source under the MIT license in perpetuity. [19]
In addition to the native C and C++ APIs, DuckDB supports a range of programming languages.
Language | Notes | Reference |
---|---|---|
Java | The Java API is implemented using JNI. [20] Integration with the Apache Arrow [21] format is provided. | [22] |
Python | The Python API implements support for the Pandas, [23] Apache Arrow [24] and Polars data analysis packages. | [25] |
Rust | The Rust API is distributed as a rust crate that exposes an elegant wrapper over the native C API. | [26] |
Node.JS | Node API | [27] |
R | R API | [28] |
Julia | Julia API | [29] |
Swift | Swift API | [30] |
PostgreSQL also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transactions with atomicity, consistency, isolation, durability (ACID) properties, automatically updatable views, materialized views, triggers, foreign keys, and stored procedures. It is supported on all major operating systems, including Windows, Linux, macOS, FreeBSD, and OpenBSD, and handles a range of workloads from single machines to data warehouses, data lakes, or web services with many concurrent users.
Structured Query Language (SQL) is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables.
Ingres Database is a proprietary SQL relational database management system intended to support large commercial and government applications.
In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.
SQLite is a database engine written in the C programming language. It is not a standalone app; rather, it is a library that software developers embed in their apps. As such, it belongs to the family of embedded databases. It is the most widely deployed database engine, as it is used by several of the top web browsers, operating systems, mobile phones, and other embedded systems.
A database server is a server which uses a database application that provides database services to other computer programs or to computers, as defined by the client–server model. Database management systems (DBMSs) frequently provide database-server functionality, and some database management systems rely exclusively on the client–server model for database access.
The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.
Microsoft SQL Server Analysis Services (SSAS) is an online analytical processing (OLAP) and data mining tool in Microsoft SQL Server. SSAS is used as a tool by organizations to analyze and make sense of information possibly spread out across multiple databases, or in disparate tables or files. Microsoft has included a number of services in SQL Server related to business intelligence and data warehousing. These services include Integration Services, Reporting Services and Analysis Services. Analysis Services includes a group of OLAP and data mining capabilities and comes in two flavors multidimensional and tabular, where the difference between the two is how the data is presented. In a tabular model, the information is arranged in two-dimensional tables which can thus be more readable for a human. A multidimensional model can contain information with many degrees of freedom, and must be unfolded to increase readability by a human.
H2 is a relational database management system written in Java. It can be embedded in Java applications or run in client–server mode.
Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. The technology was created by a company of the same name headquartered in San Mateo, California around 2005. Greenplum was acquired by EMC Corporation in July 2010.
Michael Ralph Stonebraker is an American computer scientist specializing in database systems. Through a series of academic prototypes and commercial startups, Stonebraker's research and products are central to many relational databases. He is also the founder of many database companies, including Ingres Corporation, Illustra, Paradigm4, StreamBase Systems, Tamr, Vertica and VoltDB, and served as chief technical officer of Informix. For his contributions to database research, Stonebraker received the 2014 Turing Award, often described as "the Nobel Prize for computing."
The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.
Amazon Relational Database Service is a distributed relational database service by Amazon Web Services (AWS). It is a web service running "in the cloud" designed to simplify the setup, operation, and scaling of a relational database for use in applications. Administration processes like patching the database software, backing up databases and enabling point-in-time recovery are managed automatically. Scaling storage and compute resources can be performed by a single API call to the AWS control plane on-demand. AWS does not offer an SSH connection to the underlying virtual machine as part of the managed service.
A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.
Fat-Free Framework is an open-source web framework distributed under the GNU General Public License and hosted by GitHub and SourceForge. The software seeks to combine a full featureset with a lightweight code base while being easy to learn, use and extend.
Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.
Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.
Serverless computing is a cloud computing execution model in which the cloud provider allocates resources on demand, taking care of the servers on behalf of their customers. According to ISO/IEC 22123-2: "Serverless computing is a cloud service category in which the customer can use different cloud capabilities types without the customer having to provision, deploy and manage either hardware or software resources, other than providing customer application code or providing customer data. Serverless computing represents a form of virtualized computing." Function as a service and serverless database are two forms of serverless computing.
Microsoft Azure Stream Analytics is a serverless scalable complex event processing engine by Microsoft that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social media, and other applications. Users can set up alerts to detect anomalies, predict trends, trigger necessary workflows when certain conditions are observed, and make data available to other downstream applications and services for presentation, archiving, or further analysis.