DuckDB

Last updated
DuckDB
Developer(s) DuckDB Labs
Stable release
v1.1.3 / November 4, 2024 (2024-11-04)
Repository
Written in C++
Operating system Cross-platform
Type Column-oriented DBMS
RDBMS
License MIT License
Website www.duckdb.org

DuckDB is an open-source column-oriented relational database management system (RDBMS). [1] It is designed to provide high performance on complex queries against large databases in embedded configuration, [2] such as combining tables with hundreds of columns and billions of rows. Unlike other embedded databases (for example, SQLite) DuckDB is not focusing on transactional (OLTP) applications and instead is specialized for online analytical processing (OLAP) workloads. [3] The project has over 6 million downloads per month. [4] [5] [6]

Contents

History

DuckDB was originally developed by Mark Raasveldt and Hannes Mühleisen at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. [2] The project co-founders designed DuckDB to address the need for an in-process OLAP database solution. [7] DuckDB was first released in 2019. [8] DuckDB version 1.0.0 was released on June 3, 2024 under the codename SnowDuck. [9]

Features

DuckDB uses a vectorized query processing engine. [10] DuckDB is special amongst database management systems because it does not have any external dependencies and can build with just a C++11 compiler. [11] DuckDB also deviates from the traditional client–server model by running inside a host process (it has bindings, for example, for a Python interpreter with the ability to directly place data into NumPy arrays [2] ). DuckDB's SQL parser is derived from the pg_query library developed by Lukas Fittl, which is itself derived from PostgreSQL's SQL parser that has been stripped down as much as possible. [12] [13] DuckDB uses a single-file storage format to store data on disk, designed to support efficient scans and bulk updates, appends and deletes. [14] . DuckDB is also compiled to WebAssembly using emscripten which enables DuckDB to run SQL in browser-based analytics tools. [15] [16]

Comparison

DuckDB in its OLAP niche does not compete with the traditional DBMS like MSSQL, PostgreSQL and Oracle database. While using SQL for queries, DuckDB targets serverless applications and provides extremely fast responses using Apache Parquet files for storage. These attributes make it a popular choice for large dataset analysis in interactive mode, but certain commenters[ who? ] have indicated that they believe the serverless nature of DuckDB makes it, as a stand alone tool, "not so suitable for enterprise data warehousing". [17]

Commercial use

DuckDB is used at Facebook, Google, and Airbnb. [18]

DuckDB co-author Mühleisen also runs a support and consultancy firm for the software, DuckDB Labs. [8] The company has chosen not to take venture capital funding, stating "We feel investment would force the project direction towards monetization, and we would much prefer keeping DuckDB open and available for as many people as possible". [6] Another company, MotherDuck, has received $100m funding for its data platform based on DuckDB, with investors including Andreessen Horowitz. [19]

DuckDB Foundation

The independent non-profit DuckDB Foundation safeguards the long-term maintenance and development of DuckDB. The foundation holds much of the intellectual property of the project and is funded by charitable donations. [20] The DuckDB Foundation's statutes ensure DuckDB remains open-source under the MIT license in perpetuity. [21]

Language support

In addition to the native C and C++ APIs, DuckDB supports a range of programming languages.

Client APIs
LanguageNotesReference
Java The Java API is implemented using JNI. [22] Integration with the Apache Arrow [23] format is provided. [24]
Python The Python API implements support for the Pandas, [25] Apache Arrow [26] and Polars data analysis packages. [27]
Rust The Rust API is distributed as a rust crate that exposes an elegant wrapper over the native C API. [28]
Node.JS Node API [29]
R R API [30]
Julia Julia API [31]
Swift Swift API [32]
WebAssembly WASM API [33]

Extensions

DuckDB's architecture supports extensions, allowing additional functionality to be added dynamically. [34] Many popular extensions are maintained by the core DuckDB team, and there are over 30 community extensions maintained by third parties. [35] [36] [37]

Related Research Articles

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form. In the early days, it was sometimes wrongly styled as DB/2 in a false derivation from the operating system OS/2.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">SQLite</span> Serverless relational database management system (RDBMS)

SQLite is a database engine written in the C programming language. It is not a standalone app; rather, it is a library that software developers embed in their apps. As such, it belongs to the family of embedded databases. It is the most widely deployed database engine, as it is used by several of the top web browsers, operating systems, mobile phones, and other embedded systems.

<span class="mw-page-title-main">MonetDB</span> Open source column-oriented relational database management system

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

Multidimensional Expressions (MDX) is a query language for online analytical processing (OLAP) using a database management system. Much like SQL, it is a query language for OLAP cubes. It is also a calculation language, with syntax similar to spreadsheet formulae.

Microsoft SQL Server Analysis Services (SSAS) is an online analytical processing (OLAP) and data mining tool in Microsoft SQL Server. SSAS is used as a tool by organizations to analyze and make sense of information possibly spread out across multiple databases, or in disparate tables or files. Microsoft has included a number of services in SQL Server related to business intelligence and data warehousing. These services include Integration Services, Reporting Services and Analysis Services. Analysis Services includes a group of OLAP and data mining capabilities and comes in two flavors multidimensional and tabular, where the difference between the two is how the data is presented. In a tabular model, the information is arranged in two-dimensional tables which can thus be more readable for a human. A multidimensional model can contain information with many degrees of freedom, and must be unfolded to increase readability by a human.

An embedded database system is a database management system (DBMS) which is tightly integrated with an application software; it is embedded in the application. It is a broad technology category that includes:

<span class="mw-page-title-main">Exasol</span> German database management software company

Exasol is an analytics database management software company. Its product is called Exasol, an in-memory, column-oriented, relational database management system

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

Redis is a source-available, in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Redis offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Redis is the most popular NoSQL database, and one of the most popular databases overall. Companies that use Redis include Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, Amazon and OpenAI.

<span class="mw-page-title-main">Neo4j</span> Graph database implemented in Java

Neo4j is a graph database management system (GDBMS) developed by Neo4j Inc.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Cubes is a light-weight open source multidimensional modelling and OLAP toolkit for development reporting applications and browsing of aggregated data written in Python programming language released under the MIT License.

<span class="mw-page-title-main">Cosmos DB</span> Cloud-based NoSQL database service

Azure Cosmos DB is a globally distributed, multi-model database service offered by Microsoft. It is designed to provide high availability, scalability, and low-latency access to data for modern applications. Unlike traditional relational databases, Cosmos DB is a NoSQL and vector database, which means it can handle unstructured, semi-structured, structured, and vector data types.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

<span class="mw-page-title-main">RocksDB</span> Embedded key-value database

RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit multi-core processors (CPUs), and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads. It is based on a log-structured merge-tree data structure. It is written in C++ and provides official language bindings for C++, C, and Java. Many third-party language bindings exist. RocksDB is free and open-source software, released originally under a BSD 3-clause license. However, in July 2017 the project was migrated to a dual license of both Apache 2.0 and GPLv2 license. This change helped its adoption in Apache Software Foundation's projects after blacklist of the previous BSD+Patents license clause.

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

Microsoft Azure Stream Analytics is a serverless scalable complex event processing engine by Microsoft that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social media, and other applications. Users can set up alerts to detect anomalies, predict trends, trigger necessary workflows when certain conditions are observed, and make data available to other downstream applications and services for presentation, archiving, or further analysis.

RavenDB is an open-source document-oriented database written in C#, developed by Hibernating Rhinos Ltd. It is cross-platform, supported on Windows, Linux, and Mac OS. RavenDB stores data as JSON documents and can be deployed in distributed clusters with master-master replication.

<span class="mw-page-title-main">Valkey</span> Open source in-memory key–value database

Valkey is an open-source in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Valkey offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Valkey is a successor to Redis, the most popular NoSQL database, and one of the most popular databases overall. Valkey or its predecessor Redis are used in companies like Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, Amazon and OpenAI.

References

  1. "DuckDB Documentation SQL Introduction" . Retrieved 2024-11-20.
  2. 1 2 3 Kamphuis, Chris (2020). "Graph Databases for Information Retrieval". Advances in Information Retrieval. Lecture Notes in Computer Science. Vol. 12036. Cham: Springer International Publishing. pp. 608–612. doi:10.1007/978-3-030-45442-5_79. ISBN   978-3-030-45441-8. PMC   7148032 .
  3. Raasveldt, Mark; Mühleisen, Hannes (2019-06-25). DuckDB: an Embeddable Analytical Database. ACM. pp. 1981–1984. doi:10.1145/3299869.3320212. ISBN   978-1-4503-5643-5.
  4. "PyPi Download Stats". www.pypistats.org. Archived from the original on 2024-08-13. Retrieved 2024-08-13.
  5. "DuckDB Python Downloads Dashboard". duckdbstats.com. Archived from the original on 2024-08-13. Retrieved 2024-08-13.
  6. 1 2 Clark, Lindsay. "DuckDB Labs puts limit on free support, rules out VC funding". www.theregister.com. Archived from the original on 2024-03-23. Retrieved 2024-03-23.
  7. van der Ent, Leendert (April 2023). "DuckDB: Introducing a New Class of Data Management Systems" (PDF). I/O Magazine. ICT Research Platform Nederland. Retrieved 12 November 2024.
  8. 1 2 Clark, Lindsay. "DuckDB reaches version 0.5.0". www.theregister.com. Archived from the original on 2024-03-07. Retrieved 2024-03-23.
  9. Raasveldt, Mark; Mühleisen, Hannes (3 June 2024). "Announcing DuckDB 1.0.0" . Retrieved 12 November 2024.
  10. Raasveldt, Mark; Mühleisen, Hannes (2019-06-25). DuckDB: an Embeddable Analytical Database. ACM. pp. 1981–1984. doi:10.1145/3299869.3320212. ISBN   978-1-4503-5643-5.
  11. "DuckDB Building Instructions" . Retrieved 2024-08-16.
  12. Raasveldt, Mark; Mühleisen, Hannes (2019-06-25). DuckDB: an Embeddable Analytical Database. ACM. pp. 1981–1984. doi:10.1145/3299869.3320212. ISBN   978-1-4503-5643-5.
  13. Slot, Marco (24 May 2024). "How We Fused DuckDB into Postgres with Crunchy Bridge for Analytics" . Retrieved 12 November 2024.
  14. Raasveldt, Mark; Mühleisen, Hannes (2020). Data Management for Data Science Towards Embedded Analytics (PDF). Conference on Innovative Data Systems Research.
  15. "Introducing Universal SQL" . Retrieved 2025-01-17.
  16. "How we evolved our query architecture with DuckDB" . Retrieved 2025-01-17.
  17. Bannert, M. (2024). Research Software Engineering: A Guide to the Open Source Ecosystem. Chapman & Hall/CRC Data Science Series. CRC Press. p. 25. ISBN   978-1-04-000513-2. Archived from the original on 2024-03-23. Retrieved 2024-03-23.
  18. Clark, Lindsay. "Scale-up database wrangler MotherDuck scores $47.5 million". www.theregister.com. Archived from the original on 2024-03-23. Retrieved 2024-03-23.
  19. Clark, Lindsay. "MotherDuck serverless analytics platform wins $52.5M funding". www.theregister.com. Archived from the original on 2024-03-23. Retrieved 2024-03-23.
  20. "DuckDB Foundation" . Retrieved 2024-11-09.
  21. "DuckDB Project FAQs" . Retrieved 2024-11-09.
  22. "Java JNI Source Code". www.github.com. Retrieved 2024-09-07.
  23. "DuckDB Java Arrow Source Code". www.github.com. Retrieved 2024-09-07.
  24. "DuckDB Java Source Code". www.github.com. Retrieved 2024-09-07.
  25. "DuckDB Pandas Source". www.github.com. Retrieved 2024-09-07.
  26. "DuckDB PyArrow Source". www.github.com. Retrieved 2024-09-07.
  27. "DuckDB Python Source Code". www.github.com. Retrieved 2024-09-07.
  28. "DuckDB Rust Source Code". www.github.com. Retrieved 2024-09-07.
  29. "DuckDB Node Source Code". www.github.com. Retrieved 2024-09-07.
  30. "DuckDB R Source Code". www.github.com. Retrieved 2024-09-07.
  31. "DuckDB Jullia Source Code". www.github.com. Retrieved 2024-09-07.
  32. "DuckDB Swift Source Code". www.github.com. Retrieved 2024-09-07.
  33. "DuckDB Swift Source Code". www.github.com. Retrieved 2025-01-17.
  34. "DuckDB Extensions Overview". www.duckdb.org. Retrieved 2025-01-17.
  35. "Core DuckDB Extensions". www.duckdb.org. Retrieved 2025-01-17.
  36. "List of Community Extensions". www.duckdb.org. Retrieved 2025-01-17.
  37. "DuckDB Extension Radar". www.github.com. Retrieved 2025-01-17.

Further reading