H2O (software)

Last updated
H2O
H2O logo from H2O.ai.png
Original author(s) H2O.ai
Developer(s) H2O.ai
Initial release 2011;7 years ago (2011)
Stable release
3.20.0.8 / August 1, 2018;56 days ago (2018-08-01)
Repository Blue pencil.svg
Written in H2O (written in Java, Python, and R) [1] [2] [3]
Operating system Linux, macOS, and Microsoft Windows
Platform Apache Hadoop Distributed File System; Amazon EC2, Google Compute Engine, and Microsoft Azure.
Standard(s) Databricks certified on Spark. [3]
Available in English
Type big data analytics, machine learning, statistical learning theory [4]
License Apache license 2.0 [5]
Website www.h2o.ai
As of 1 June 2015

H2O is open-source software for big-data analysis. It is produced by the company H2O.ai. H2O allows users to fit thousands of potential models as part of discovering patterns in data.

Open-source software software licensed to ensure source code usage rights

Open-source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Open-source software may be developed in a collaborative public manner. Open-source software is a prominent example of open collaboration.

Big data Information assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity may lead to a higher false discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. Other concepts later attributed with big data are veracity and value.

Data analysis activity for gaining insight from data

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Contents

The H2O software runs can be called from the statistical package R, Python, and other environments. It is used for exploring and analyzing datasets held in cloud computing systems and in the Apache Hadoop Distributed File System as well as in the conventional operating-systems Linux, macOS, and Microsoft Windows. The H2O software is written in Java, Python, and R. Its graphical-user interface is compatible with four browsers: Chrome, Safari, Firefox, and Internet Explorer.

R (programming language) programming language for statistical computing

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity in recent years. as of March 2019, R ranks 14th in the TIOBE index, a measure of popularity of programming languages.

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. Van Rossum led the language community until stepping down as leader in July 2018.

Cloud computing form of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand

Cloud computing is the on demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. The term is generally used to describe data centers available to many users over the Internet. Large clouds, predominant today, often have functions distributed over multiple locations from central servers. If the connection to the user is relatively close, it may be designated an edge server.

H2O

The H2O project aims to develop an analytical interface for cloud computing, providing users with tools for data analysis. [1] The software is open-source and freely distributed. The company receives fees for providing customer service and customized extensions.

Mining of big data

Big datasets are too large to be analyzed using traditional software like R. The H2O software provides data structures and methods suitable for big data. H2O allow users to analyze and visualize whole sets of data without using the Procrustean strategy of studying only a small subset with a conventional statistical package. [2] H2O's statistical algorithms includes K-means clustering, generalized linear models, distributed random forests, gradient boosting machines, naive bayes, principal component analysis, and generalized low rank models. [6]

<i>k</i>-means clustering Vector quantization algorithm minimizing the sum of squared deviations

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Generalized linear model statistical model

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Random forest An ensemble machine learning method

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

H2O is also able to run on Spark. [7]

Iterative methods for real-time problems

H2O uses iterative methods that provide quick answers using all of the client's data. When a client cannot wait for an optimal solution, the client can interrupt the computations and use an approximate solution. [1] In its approach to deep learning, [2] [6] [8] H2O divides all the data into subsets and then analyzing each subset simultaneously using the same method. These processes are combined to estimate parameters by using the Hogwild scheme, [9] a parallel stochastic gradient method. [10] These methods allow H2O to provide answers that use all the client's data, rather than throwing away most of it and analyzing a subset with conventional software.

In computational mathematics, an iterative method is a mathematical procedure that uses an initial guess to generate a sequence of improving approximate solutions for a class of problems, in which the n-th approximation is derived from the previous ones. A specific implementation of an iterative method, including the termination criteria, is an algorithm of the iterative method. An iterative method is called convergent if the corresponding sequence converges for given initial approximations. A mathematically rigorous convergence analysis of an iterative method is usually performed; however, heuristic-based iterative methods are also common.

Deep learning branch of machine learning

Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised.

Stochastic gradient descent, also known as incremental gradient descent, is an iterative method for optimizing a differentiable objective function, a stochastic approximation of gradient descent optimization. A 2018 article implicitly credits Herbert Robbins and Sutton Monro for developing SGD in their 1951 article titled "A Stochastic Approximation Method"; see Stochastic approximation for more information. It is called stochastic because samples are selected randomly instead of as a single group or in the order they appear in the training set.

Software

Programming languages

The H2O software has an interface to the following programming languages: Java (6 or later), Python (2.7.x, 3.5.x), R (3.0.0 or later) and Scala (1.4-1.6). [2] [3]

Java (programming language) Object-oriented programming language

Java is a general-purpose computer-programming language that is concurrent, class-based, object-oriented,and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that compiled Java code can run on all platforms that support Java without the need for recompilation. Java applications are typically compiled to "bytecode" that can run on any Java virtual machine (JVM) regardless of the underlying computer architecture. The language derives much of its original features from SmallTalk, with a syntax similar to C and C++, but it has fewer low-level facilities than either of them. As of 2016, Java was one of the most popular programming languages in use, particularly for client-server web applications, with a reported 9 million developers.

Scala (programming language) programming language

Scala is a general-purpose programming language providing support for functional programming and a strong static type system. Designed to be concise, many of Scala's design decisions aimed to address criticisms of Java.

Operating systems

The H2O software can be run on conventional operating-systems: Microsoft Windows (7 or later), Mac OS X (10.9 or later), and Linux (Ubuntu 12.04; RHEL/CentOS 6 or later), [3] It also runs on big-data systems, particularly Apache Hadoop Distributed File System (HDFS), several popular versions: Cloudera (5.1 or later), MapR (3.0 or later), and Hortonworks (HDP 2.1 or later). It also operates on cloud computing environments, for example using Amazon EC2, Google Compute Engine, and Microsoft Azure. The H2O Sparkling Water software is Databricks-certified on Apache Spark. [3]

Graphical user interface and browsers

Its graphical user interface is compatible with four browsers (unless specified, in their latest versions as of 1 June 2015): Chrome, Safari, Firefox, Internet Explorer (IE10). [3]

Notes

  1. 1 2 3 Harris (2012)
  2. 1 2 3 4 Novet (2014)
  3. 1 2 3 4 5 6 "Recommended systems for H2O". 0xdata.com. H2O.ai. May 2015. Archived from the original on 2015-05-30. Retrieved 2015-06-01.
  4. Hardy (2014)
  5. https://github.com/h2oai/h2o-2/blob/master/LICENSE.txt
  6. 1 2 Aiello, Spencer; Tom Kraljevic; Petr Maj (2015), with contributions from the 0xdata team, "h2o: R Interface for H2O", The Comprehensive R Archive Network (CRAN), Contributed Packages, The R Project for Statistical Computing (3.0.0.12)
  7. "FAQ — H2O 3.10.2.1 documentation". docs.h2o.ai. Retrieved 2017-01-28.
  8. "Prediction of IncRNA using Deep Learning Approach". Tripathi, Rashmi; Kumari, Vandana; Patel, Sunil; Singh, Yashbir; Varadwaj, Pritish. International Conference on Advances in Biotechnology (BioTech). Proceedings: 138-142. Singapore: Global Science and Technology Forum. (2015)
  9. Description of the iterative method for computing maximum-likelihood estimates for a generalized linear model.
  10. Benjamin Recht; Re, Christopher; Wright, Stephen & Feng Niu (2011). J. Shawe-Taylor; R.S. Zemel; P.L. Bartlett; F. Pereira & K.Q. Weinberger, eds. "Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent" (PDF). Advances in Neural Information Processing Systems. Curran Associates, Inc. 24: 693–701. Recht's PDF

Related Research Articles

In computing, Common Gateway Interface (CGI) offers a standard protocol for web servers to execute programs that execute like console applications running on a server that generates web pages dynamically. Such programs are known as CGI scripts or simply as CGIs. The specifics of how the script is executed by the server are determined by the server. In the common case, a CGI script executes at the time a request is made and generates HTML.

In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in a different address space, which is coded as if it were a normal (local) procedure call, without the programmer explicitly coding the details for the remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. This is a form of client–server interaction, typically implemented via a request–response message-passing system. In the object-oriented programming paradigm, RPC calls are represented by remote method invocation (RMI). The RPC model implies a level of location transparency, namely that calling procedures is largely the same whether it is local or remote, but usually they are not identical, so local calls can be distinguished from remote calls. Remote calls are usually orders of magnitude slower and less reliable than local calls, so distinguishing them is important.

Knowledge engineering (KE) refers to all technical, scientific and social aspects involved in building, maintaining and using knowledge-based systems.

In computing, a solution stack or software stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to "run on" or "run on top of" the resulting platform.

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Originally designed for computer clusters built from commodity hardware—still the common use—it has also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Thrift is an interface definition language and binary communication protocol used for defining and creating services for numerous languages. It forms a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development". It combines a software stack with a code generation engine to build cross-platform services which can connect applications written in a variety of languages and frameworks, including ActionScript, C, C++, C#, Cappuccino, Cocoa, Delphi, Erlang, Go, Haskell, Java, Node.js, Objective-C, OCaml, Perl, PHP, Python, Ruby, Rust, Smalltalk and Swift. Although developed at Facebook, it is now an open source project in the Apache Software Foundation. The implementation was described in an April 2007 technical paper released by Facebook, now hosted on Apache.

Vertica company

Vertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker and Andrew Palmer. Palmer was the founding CEO. Ralph Breslauer and Christopher P. Lynch served as later CEOs.

Apache ZooKeeper is a software project of the Apache Software Foundation. It is essentially a centralized service for distributed systems to a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub-project of Hadoop but is now a top-level Apache project in its own right.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Apache Spark open-source data analytics cluster computing framework

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Anaconda (Python distribution) package manager, environment manager, and Python (and related packages) distribution

Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. Package versions are managed by the package management system conda. The Anaconda distribution is used by over 12 million users and includes more than 1400 popular data-science packages suitable for Windows, Linux, and MacOS.

Deeplearning4j open source deep learning library

Eclipse Deeplearning4j is a deep learning programming library written for Java and the Java virtual machine (JVM) and a computing framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

XGBoost is an open-source software library which provides a gradient boosting framework for C++, Java, Python, R, and Julia. It works on Linux, Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting Library". Other than running on a single machine, it also supports the distributed processing frameworks Apache Hadoop, Apache Spark, and Apache Flink. It has gained much popularity and attention recently as the algorithm of choice for many winning teams of machine learning competitions.

Project Jupyter Nonprofit organization developing open-source software

Project Jupyter is a nonprofit organization created to "develop open-source software, open-standards, and services for interactive computing across dozens of programming languages". Spun-off from IPython in 2014 by Fernando Pérez, Project Jupyter supports execution environments in several dozen languages. Project Jupyter's name is a reference to the three core programming languages supported by Jupyter, which are Julia, Python and R, and also an homage to Galileo's notebooks recording the discovery of the moons of Jupiter. Project Jupyter has developed and supported the interactive computing products Jupyter Notebook, Jupyter Hub, and Jupyter Lab, the next-generation version of Jupyter Notebook.

References