Apache IoTDB

Last updated
Apache IoTDB
Developer(s) Apache Software Foundation
Stable release
1.1.0 / 3 April 2023
Repository github.com/apache/iotdb
Written in Java
Platform Cross-platform
Type
License Apache License 2.0
Website iotdb.apache.org

Apache IoTDB is a column-oriented open-source, time-series database (TSDB) management system written in Java. [1] It has both edge and cloud versions, provides an optimized columnar file format for efficient time-series data storage, and TSDB with high ingestion rate, low latency queries and data analysis support. It is specially optimized for time-series oriented operations like aggregations query, downsampling and sub-sequence similarity search. The name IoTDB comes from Internet of Things (IoT) Database, which means it was designed as an IoT-native TSDB that resolves the pain points of the typical IoT scenarios, including massive data generation, high frequency sampling, out-of-order data, specific analytics requirements, high costs of storage and operation & maintenance, low computational power of IoT devices. [2]

Contents

History

Apache IoTDB is a project initiated by Prof. Jianmin Wang's team in the School of Software at Tsinghua University. [1] In 2011, the team chose to use open source NoSQL technology instead of Oracle for a project with mass machine data management, and noticed the insufficiency of NoSQL in the industrial internet of things (IIoT) scenarios. The team started to develop a data management system and formally proposed TsFile, [3] an optimized columnar compact file storage format for time series data, in March 2016. The source code was then opened on GitHub.

In June 2016, based on TsFile, the team began to develop IoTDB, an IIoT database supporting real-time reading & writing and analysis.

In November 2018, the project IoTDB entered incubator at the Apache Software Foundation (ASF). [4]

On September 16, 2020, the ASF officially issued a resolution to promote Apache IoTDB to the global Top-Level Project (TLP) following a public discussion vote by the community and a show of hands vote by the board. [1] [5]

Architecture

The Architecture of Apache IoTDB Structure of Apache IoTDB.png
The Architecture of Apache IoTDB

The complete storage system of Apache IoTDB follows a client-server architecture, including IoTDB engine (server) and several components as IoTDB suite (client). IoTDB suite can provide a series of functions in the real situation such as data collection, data writing, data storage, data query, data visualization and data analysis. This allows data collected by the sensor to constantly persist in server, where the data can then be used for native query or shipped to other open-source platforms for data analysis. In particular, IoTDB provides a mode called "Edge-Cloud Cooperation", which can synchronize data collected at every user-configured interval from one IoTDB instance to another using Sync Tool. [6] [7]

Users can use JDBC to write time series data to local/remote IoTDB. This time series data may represent system state data (such as server load and CPU memory, etc.), message queue data, time series data from applications, or other time series data in the database. The data can be directly written to TsFile locally or on Hadoop Distributed File System (HDFS).

TsFile is a column storage file format developed for accessing, compressing and storing time series data in Apache IoTDB. Its structure is based on LSM-Tree, which reduces the computational resources and optimizes the performance of Apache IoTDB. [3] [8]

TsFile could be written to the HDFS, thereby implementing data processing tasks such as abnormality detection and machine learning on the Hadoop or Spark data processing platform.

For the data written to HDFS or local TsFile, users can use TsFile-Hadoop-Connector or TsFile-Spark-Connector to allow Hadoop or Spark to process data. The results of the analysis can be written back to TsFile in the same way. Also, IoTDB and TsFile provide client tools to meet the various needs of users in writing and viewing data in SQL form, script form and graphical form. [2] [9] [10] [11] [12] [13]

Features

Flexible and cross-platform deployment

IoTDB is designed to fit three deployment scenarios: 1) file-based storage or embedded time-series database on edge appliance like Raspberry Pi, 2) standalone TSDB on Industrial PC and 3) distributed TSDB or Hadoop cluster with TsFile. IoTDB provides users a one-click installation tool on the cloud, once-decompressed-used terminal tool and the bridging tool between cloud platforms and terminal tools (Data Synchronization Tool). [2] [6]

Low storage cost

IoTDB can reach a high compression ratio of disk storage, which means IoTDB can store the same amount of data with less hardware disk cost. [2] [3] [14]

Efficient directory structure

IoTDB supports efficient organization of complex time-series data structures from intelligent networking devices, organization of time-series data from devices of the same type, fuzzy searching strategy for massive and complex directory of time-series data. [1] [2] [3]

High-throughput read and write

IoTDB supports millions of low-power devices' strong connection data access, high-speed data read and write for intelligent networking devices and mixed devices mentioned above. Currently, IoTDB supports the ingestion rate of up to 30 million data points per second on a single node. [1] [2] [14] [15]

Rich query semantics

IoTDB supports time alignment for timeseries data across devices and sensors, computation in timeseries field (frequency domain transformation) and rich aggregation function support in time dimension. [2] [14]

Easy to get started

IoTDB supports SQL-Like language, JDBC standard API and import/export tools which are easy to use. [1] [2] [14]

Intense integration with open source ecosystem

IoTDB supports Hadoop, Spark, etc. analysis ecosystems and Grafana visualization tool. [1] [2] [16]

Licensing

The Apache 2.0 License is a permissive free software license written by the Apache Software Foundation. It allows end users to modify parts of the original code as long as it contains the appropriate documentation that Apache requires within the redistributed code. [17]

Related Research Articles

<span class="mw-page-title-main">Hierarchical Data Format</span> Set of file formats

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

A time series database is a software system that is optimized for storing and serving time series through associated pairs of time(s) and value(s). In some fields, time series may be called profiles, curves, traces or trends. Several early time series databases are associated with industrial applications which could efficiently store measured values from sensory equipment, but now are used in support of a much wider range of applications.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

Structured storage is computer storage for structured data, often in the form of a distributed database. Computer software formally known as structured storage systems include Apache Cassandra, Google's Bigtable and Apache HBase.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

In computing, a distributed file system (DFS) or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and Google Docs, according to Verma, et.al. Registration requires a credit card or bank account details.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Trino (SQL query engine)</span> Open-source distributed SQL query engine

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query data lakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

References

  1. 1 2 3 4 5 6 7 Sally (23 September 2020). "The Apache Software Foundation Announces Apache® IoTDB™ as A Top-Level Project". The Apache Software Foundation Blog. Retrieved 18 November 2022.
  2. 1 2 3 4 5 6 7 8 9 Wang, Chen; Huang, Xiangdong; Qiao, Jialin; Jiang, Tian; Rui, Lei; Zhang, Jinrui; Kang, Rong; Feinauer, Julian; McGrail, Keven A.; Wang, Peng; Yuan, Jun; Wang, Jianmin; Sun, Jiaguang (August 2020). "Apache IoTDB" (PDF). Proceedings of the VLDB Endowment. 13 (12): 2901–2904. doi:10.14778/3415478.3415504. S2CID   221352039.
  3. 1 2 3 4 Hou, Haonan (14 March 2022). "TsFile Format". ASF Confluence. Retrieved 18 November 2022.
  4. "Apache IoTDB Project Incubation Status". Apache Incubator. Retrieved 18 November 2022.
  5. online, heise (23 September 2020). "Apache Software Foundation erhebt IoTDB zum Top-Level-Projekt". Developer (in German). Retrieved 2022-12-13.
  6. 1 2 "IoTDB User Guide: System Architecture". Apache IoTDB. Retrieved 18 November 2022.
  7. "Apache IoTDB". Database of Databases. 27 June 2022. Retrieved 18 November 2022.
  8. Xiao, Jinzhao; Huang, Yuxiang; Hu, Changyu; Song, Shaoxu; Huang, Xiangdong; Wang, Jianmin (2022-09-07). "Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB". Proceedings of the VLDB Endowment. 15 (10): 2148–2160. doi:10.14778/3547305.3547319. ISSN   2150-8097. S2CID   252135944.
  9. Huang, Xiangdong; Wang, Jianmin; Wong, Raymond; Zhang, Jinrui; Wang, Chen (2016-10-24). "PISA". Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. CIKM '16. New York, NY, USA: Association for Computing Machinery. pp. 979–988. doi:10.1145/2983323.2983775. ISBN   978-1-4503-4073-1. S2CID   12456810.
  10. Kang, Rong; Wang, Chen; Wang, Peng; Ding, Yuting; Wang, Jianmin (2018). "Matching Consecutive Subpatterns over Streaming Time Series". In Cai, Yi; Ishikawa, Yoshiharu; Xu, Jianliang (eds.). Web and Big Data. Lecture Notes in Computer Science. Vol. 10988. Cham: Springer International Publishing. pp. 90–105. arXiv: 1805.06757 . doi:10.1007/978-3-319-96893-3_8. ISBN   978-3-319-96893-3. S2CID   21687305.
  11. Wu, Jiaye; Wang, Peng; Pan, Ningting; Wang, Chen; Wang, Wei; Wang, Jianmin (2019). "KV-Match: A Subsequence Matching Approach Supporting Normalization and Time Warping". 2019 IEEE 35th International Conference on Data Engineering (ICDE). pp. 866–877. arXiv: 1710.00560 . doi:10.1109/ICDE.2019.00082. ISBN   978-1-5386-7474-1. S2CID   46926461.
  12. Mao, Dongfang; Li, Tianan; Huang, Xiangdong; Yuan, Jun; Xu, Yi; Wang, Jianmin (27 April 2020). "The design of Apache IoTDB distributed framework". National Database Conference. 50 (5): 621–636. doi: 10.1360/SSI-2019-0189 . S2CID   219053248.
  13. Qiao, Jialin; Huang, Xiangdong; Wang, Jianmin; Wong, Raymond K. (2020-01-01). "Dual-PISA: An index for aggregation operations on time series data". Information Systems. 87: 101427. doi:10.1016/j.is.2019.101427. ISSN   0306-4379. S2CID   201127537.
  14. 1 2 3 4 "IoTDB User Guide: Features". Iotdb Website. Retrieved 18 November 2022.
  15. vogler. "Automation Gateway with Apache IoTDB… | RocWorks" . Retrieved 2022-12-13.
  16. "Apache IoTDB Dashboard v0.13.1". Grafana Labs. Retrieved 2022-12-13.
  17. "Apache License, version 2.0". The Apache Software Foundation. Retrieved 18 November 2022.