Aiyara cluster

Last updated
A0, the first Aiyara cluster built at Suranaree University of Technology Aiyara-cluster-A0.jpg
A0, the first Aiyara cluster built at Suranaree University of Technology

An Aiyara cluster is a low-powered computer cluster specially designed to process Big Data. The Aiyara cluster model can be considered as a specialization of the Beowulf cluster in the sense that Aiyara is also built from commodity hardware, not inexpensive personal computers, but system-on-chip computer boards. Unlike Beowulf, applications of an Aiyara cluster are scoped only for the Big Data area, not for scientific high-performance computing. Another important property of an Aiyara cluster is that it is low-power. It must be built with a class of processing units that produces less heat.

Contents

The name Aiyara originally referred to the first ARM-based cluster built by Wichai Srisuruk and Chanwit Kaewkasi at Suranaree University of Technology. The name "Aiyara" came from a Thai word literally an elephant to reflect its underneath software stack, which is Apache Hadoop.

Like Beowulf, an Aiyara cluster does not define a particular software stack to run atop it. A cluster normally runs a variant of the Linux operating system. Commonly used Big Data software stacks are Apache Hadoop and Apache Spark.

Development

A report of the Aiyara hardware which successfully processed a non-trivial amount of Big Data was published in the Proceedings of ICSEC 2014. [1] Aiyara Mk-I, the second Aiyara cluster, consists of 22 Cubieboards. It is the first known SoC-based ARM cluster which is able to process Big Data successfully using the Spark and HDFS stack. [2]

The Aiyara cluster model, a technical description explaining how to build an Aiyara cluster, was later published by Chanwit Kaewkasi in the DZone's 2014 Big Data Guide. [3] The further results and cluster optimization techniques, that make the cluster's processing rate boost to 0.9 GB/min while still preserve low-power consumption, were reported in the Proceeding of IEEE's TENCON 2014. [4]

The whole architecture of software stack, including the runtime, data integrity verification and data compression, is studied and improved. The work reported in this paper achieved the processing rate at almost 0.9 GB/min, successfully processed the same benchmarks from the previous work by roughly 38 minutes.

See also

Related Research Articles

Commodity computing involves the use of large numbers of already-available computing components for parallel computing, to get the greatest amount of useful computation at low cost. It is computing done in commodity computers as opposed to in high-cost superminicomputers or in boutique computers. Commodity computers are computer systems - manufactured by multiple vendors - incorporating components based on open standards.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache ZooKeeper</span>

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

Within computing database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

<span class="mw-page-title-main">MapR</span>

MapR was a business software company headquartered in Santa Clara, California. MapR software provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multi-model database management system, and event stream processing, combining analytics in real-time with operational applications. Its technology runs on both commodity hardware and public cloud computing services. In August 2019, following financial difficulties, the technology and intellectual property of the company were sold to Hewlett Packard Enterprise.

The Oracle Big Data Appliance consists of hardware and software from Oracle Corporation sold as a computer appliance. It was announced in 2011,and is used for the consolidating and loading unstructured data into Oracle Database software.

<span class="mw-page-title-main">Actian Vector</span>

Actian Vector is an SQL relational database management system designed for high performance in analytical database applications. It published record breaking results on the Transaction Processing Performance Council's TPC-H benchmark for database sizes of 100 GB, 300 GB, 1 TB and 3 TB on non-clustered hardware.

<span class="mw-page-title-main">Cubieboard</span>

Cubieboard is a single-board computer, made in Zhuhai, Guangdong, China. The first short run of prototype boards were sold internationally in September 2012, and the production version started to be sold in October 2012. It can run Android 4 ICS, Ubuntu 12.04 desktop, Fedora 19 ARM Remix desktop, Armbian, Arch Linux ARM, a Debian-based Cubian distribution, FreeBSD, or OpenBSD.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

oneAPI Data Analytics Library, is a library of optimized algorithmic building blocks for data analysis stages most commonly associated with solving Big Data problems.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

References

  1. C. Kaewkasi and W. Srisuruk. A Study of Big Data Processing Constraints on a Low-Power Hadoop Cluster . Proceedings of the 18th ICSEC, 2014, pp. 308-313
  2. The first Spark/Hadoop ARM cluster runs atop Cubieboards April 8, 2014 on Cubieboard.org
  3. Chanwit Kaewkasi. The DIY Big Data Cluster. DZone Big Data Guide 2014. September 22, 2014, pp. 20-21
  4. C. Kaewkasi and W. Srisuruk. Optimizing performance and power consumption for an ARM-based big data cluster . Proceedings of the 2014 IEEE Region 10 Conference, 2014, pp. 1-6