Cascading (software)

Last updated
Cascading
Stable release
3.3.0 / March 24, 2018;5 years ago (2018-03-24) [1]
Preview release
4.0-wip-120 / March 27, 2021;2 years ago (2021-03-27) [2]
Repository github.com/Cascading/cascading
Written in Java
License Apache License v2 [3]
Website www.cascading.org

Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc. [4]

Contents

Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc, which has been re-branded as Driven. [5] Cascading is being actively developed by the community[ citation needed ] and a number of add-on modules are available. [6]

Architecture

To use Cascading, Apache Hadoop must also be installed, and the Hadoop job .jar must contain the Cascading .jars. Cascading consists of a data processing API, integration API, process planner and process scheduler.

Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks. [7] [ better source needed ] Developers use Cascading to create a .jar file that describes the required processes. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs. [8]

Developers write the code in a JVM-based language and do not need to learn MapReduce. The resulting program can be regression tested and integrated with external applications like any other Java application. [9]

Cascading is most often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, web content mining, and extract, transform and load (ETL) applications. [5]

Uses of Cascading

Cascading was cited as one of the top five most powerful Hadoop projects by SD Times in 2011, [10] [ unreliable source? ] as a major open source project relevant to bioinformatics [11] [ unreliable source? ] and is included in Hadoop: A Definitive Guide, by Tom White. [12] The project has also been cited in presentations, conference proceedings and Hadoop user group meetings as a useful tool for working with Hadoop [13] [14] [15] [16] and with Apache Spark [17]

Domain-Specific Languages Built on Cascading

Related Research Articles

Eclipse Jetty is a Java web server and Java Servlet container. While web servers are usually associated with serving documents to people, Jetty is now often used for machine to machine communications, usually within larger software frameworks. Jetty is developed as a free and open source project as part of the Eclipse Foundation. The web server is used in products such as Apache ActiveMQ, Alfresco, Scalatra, Apache Geronimo, Apache Maven, Apache Spark, Google App Engine, Eclipse, FUSE, iDempiere, Twitter's Streaming API and Zimbra. Jetty is also the server in open source projects such as Lift, Eucalyptus, OpenNMS, Red5, Hadoop and I2P. Jetty supports the latest Java Servlet API as well as protocols HTTP/2 and WebSocket.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Dryad was a research project at Microsoft Research for a general purpose runtime for execution of data parallel applications. The research prototypes of the Dryad and DryadLINQ data-parallel processing frameworks are available in source form at GitHub.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

<span class="mw-page-title-main">Scalatra</span>

Scalatra is a free and open source web application framework written in Scala. It is a port of the Sinatra framework written in Ruby. Scalatra is an alternative to the Lift, Play!, and Unfiltered frameworks.

<span class="mw-page-title-main">RStudio</span> Integrated development environment for R

RStudio is an integrated development environment for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

<span class="mw-page-title-main">Akka (toolkit)</span> Open-source runtime

Akka is a source-available toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka supports multiple programming models for concurrency, but it emphasizes actor-based concurrency, with inspiration drawn from Erlang.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Deeplearning4j</span> Open-source deep learning library

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

oneAPI Data Analytics Library, is a library of optimized algorithmic building blocks for data analysis stages most commonly associated with solving Big Data problems.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

<span class="mw-page-title-main">Apache Apex</span>

Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

References

  1. "Releases · Cascading/cascading". github.com. Retrieved 2021-03-29.
  2. "Releases · cwensel/cascading". github.com. Retrieved 2021-03-29.
  3. "cascading/LICENSE.txt at 3.3 · Cascading/cascading". github.com. Retrieved 2021-03-29.
  4. "Cascading and Driven | Support". Driven.
  5. 1 2 "Integrate.io - One Platform To Support Your Entire Data Journey". Integrate.io.
  6. "Cascading modules". Archived from the original on 2011-08-11. Retrieved 2011-08-22.
  7. 1 2 Blog post by Etsy describing their use of Cascading with Hadoop
  8. "Cascading User Guide" (PDF). Archived from the original (PDF) on February 6, 2011.
  9. "Hadoop Application Performance Management - DRIVEN's Features". Driven.
  10. Handy, Alex (1 June 2011). "The top five most powerful Hadoop projects". SD Times . Retrieved 26 October 2013.
  11. Taylor, Ronald (21 December 2010). "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics". BioMed Central . Springer Science+Business Media . Retrieved 26 October 2013.
  12. White, Tom (September 24, 2010). Hadoop: The Definitive Guide. "O'Reilly Media, Inc.". ISBN   9781449396893 via Google Books.
  13. "Getting Started on Hadoop". www.slideshare.net.
  14. "Julio Guijarro, Steve Loughran and Paolo Castagna, "Hadoop and beyond," HP Labs, Bristol UK, 2008" (PDF). Archived from the original (PDF) on 2011-10-01. Retrieved 2011-08-22.
  15. "Flightcaster Presentation Hadoop". www.slideshare.net.
  16. "NoSQL, Hadoop, Cascading June 2010". www.slideshare.net.
  17. "Using Cascading to Build Data-centric Applications on Spark". Spark Summit 2014. 2014-05-07. Retrieved 2016-03-25.
  18. "Cascading.Multitool on AWS".
  19. "AWS Articles". Amazon Web Services, Inc.
  20. BackType blog Archived August 25, 2011, at the Wayback Machine
  21. "FlightCaster".
  22. "Ion Flux". Archived from the original on October 23, 2011.
  23. RapLeaf Blog Archived February 1, 2011, at the Wayback Machine
  24. "Razorfish Case Study". Amazon Web Services, Inc.
  25. "PyCascading is no longer maintained". GitHub . 17 September 2021.
  26. "Cascading.JRuby". August 8, 2018 via GitHub.
  27. "Cascalog". June 23, 2023 via GitHub.
  28. "Scalding". June 22, 2023 via GitHub.