Data-centric programming language

Last updated

Data-centric programming language defines a category of programming languages where the primary function is the management and manipulation of data. A data-centric programming language includes built-in processing primitives for accessing data stored in sets, tables, lists, and other data structures and databases, and for specific manipulation and transformation of data required by a programming application. Data-centric programming languages are typically declarative and often dataflow-oriented, and define the processing result desired; the specific processing steps required to perform the processing are left to the language compiler. The SQL relational database language is an example of a declarative, data-centric language. Declarative, data-centric programming languages are ideal for data-intensive computing applications.

Contents

Background

The rapid growth of the Internet and World Wide Web has led to huge amounts of information available online and the need for Big Data processing capabilities. Business and government organizations create large amounts of both structured and unstructured information which needs to be processed, analyzed, and linked. [1] The storing, managing, accessing, and processing of this vast amount of data represents a fundamental need and an immense challenge in order to satisfy needs to search, analyze, mine, and visualize this data as information. [2] Declarative, data-centric languages are increasingly addressing these problems, because focusing on the data makes these problems much simpler to express. [3]

Computer system architectures such as Hadoop and HPCC which can support data-parallel applications are a potential solution to the terabyte and petabyte scale data processing requirements of data-intensive computing. [4] [5] Clusters of commodity hardware are commonly being used to address Big Data problems. [6] The fundamental challenges for Big Data applications and data-intensive computing [7] are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The National Science Foundation has identified key issues related to data-intensive computing problems such as the programming abstractions including models, languages, and algorithms which allow a natural expression of parallel processing of data. [8] Declarative, data-centric programming languages are well-suited to this class of problems.

Data-centric programming languages provide a processing approach in which applications are expressed in terms of high-level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications, and movement of programs and data across the computing cluster. [9] The programming abstraction and language tools allow the processing to be expressed in terms of data flows and transformations incorporating shared libraries of common data manipulation algorithms such as sorting.

Declarative Data-centric programming languages are inherently adaptable to various forms of distributed computing including clusters and data grids and cloud computing. [10] Using declarative, data-centric programming languages suggest more than just adapting to a new computing capability, it also suggests changes to the thought process of data analysis and design of applications. [11]

Data-centric language examples

SQL is the best known declarative, data-centric programming language and has been in use since the 1980s and became a de facto standard for use with relational databases. However, a variety of new system architectures and associated programming languages have been implemented for data-intensive computing, Big Data applications, and large-scale data analysis applications. Most data growth is with data in unstructured form [12] and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the MapReduce architecture pioneered by Google and now available in an open-source implementation called Hadoop used by Yahoo, Facebook, and others and the HPCC system architecture offered by LexisNexis Risk Solutions.

Hadoop Pig

Figure 1: Sample Pig Latin program Data-Centric Figure1.jpg
Figure 1: Sample Pig Latin program
Figure 2: Pig program translation to MapReduce Data-Centric Figure2.jpg
Figure 2: Pig program translation to MapReduce

Hadoop is an open source software project sponsored by The Apache Software Foundation (http://www.apache.org) which implements the MapReduce architecture. The Hadoop execution environment supports additional distributed data processing capabilities which are designed to run using the Hadoop MapReduce architecture. These include Pig – a high-level data-flow programming language and execution framework for data-intensive computing. Pig was developed at Yahoo! to provide a specific data-centric language notation for data analysis applications and to improve programmer productivity and reduce development cycles when using the Hadoop MapReduce environment. Pig programs are automatically translated into sequences of MapReduce programs if needed in the execution environment. Pig provides capabilities in the language for loading, storing, filtering, grouping, de-duplication, ordering, sorting, aggregation, and joining operations on the data. [13] Figure 1 shows a sample Pig program and Figure 2 shows how this is translated into a series of MapReduce operations.

HPCC ECL

Figure 3: ECL sample syntax for JOIN operation Data-Centric Figure3.jpg
Figure 3: ECL sample syntax for JOIN operation

The HPCC data-intensive computing platform from LexisNexis Risk Solutions includes a new high-level declarative, data-centric programming language called ECL. ECL allows the programmer to define what the data processing result should be and the dataflows and transformations that are necessary to achieve the result. The ECL language includes extensive capabilities for data definition, filtering, data management, and data transformation, and provides an extensive set of built-in functions to operate on records in datasets which can include user-defined transformation functions. ECL programs are compiled into optimized C++ source code, which is subsequently compiled into executable code and distributed to the nodes of a processing cluster. ECL combines data representation with algorithm implementation, and is the fusion of both a query language and a parallel data processing language.

Figure 5: ECL code example for NLP Data-Centric Figure5.jpg
Figure 5: ECL code example for NLP
Figure 4: ECL code example Data-Centric Figure4.jpg
Figure 4: ECL code example

ECL includes built-in data transform operations which process through entire datasets including PROJECT, ITERATE, ROLLUP, JOIN, COMBINE, FETCH, NORMALIZE, DENORMALIZE, and PROCESS. For example, the transform function defined for a JOIN operation receives two records, one from each dataset being joined, and can perform any operations on the fields in the pair of records, and returns an output record which can be completely different from either of the input records. Example syntax for the JOIN operation from the ECL Language Reference Manual is shown in Figure 3. Figure 4 shows an example of the equivalent ECL code for the Pig example program shown in Figure 1.

The ECL programming language also provides built-in primitives for Natural language processing (NLP) with PATTERN statements and the built-in PARSE operation. PATTERN statements allow matching patterns including regular expressions to be defined and used to parse information from unstructured data such as raw text. PATTERN statements can be combined to implement complex parsing operations or complete grammars from Backus–Naur form (BNF) definitions. The PARSE operation operates across a dataset of records on a specific field within a record, this field could be an entire line in a text file for example. Using this capability of the ECL language is possible to implement parallel processing form information extraction applications across document files and all types of unstructured and semi-structured data including XML-based documents or Web pages. Figure 5 shows an example of ECL code used in a log analysis application which incorporates NLP.

See also

Related Research Articles

IBM Db2 Family Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. They initially supported the relational model, but were extended to support object-relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

In computing, hardware acceleration is the use of computer hardware specially made to perform some functions more efficiently than is possible in software running on a general-purpose central processing unit (CPU). Any transformation of data or routine that can be computed, can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in application-specific hardware designed or programmed to compute the operation than specified in software and performed on a general-purpose computer processor. Each approach has advantages and disadvantages. The implementation of computing tasks in hardware to decrease latency and increase throughput is known as hardware acceleration.

Stream processing is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming, that allows some applications to more easily exploit a limited form of parallel processing. Such applications can use multiple computational units, such as the floating point unit on a graphics processing unit or field-programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Dryad was a research project at Microsoft Research for a general purpose runtime for execution of data parallel applications. The research prototypes of the Dryad and DryadLINQ data-parallel processing frameworks are available in source form at GitHub.

Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS and MapReduce technology. Sector is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming architecture framework that supports in-storage parallel data processing for data stored in Sector. Sector/Sphere operates in a wide area network (WAN) setting.

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Apache Hive database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

ECL is a declarative, data-centric programming language designed in 2000 to allow a team of programmers to process big data across a high performance computing cluster without the programmer being involved in many of the lower level, imperative decisions.

HPCC open source, data-intensive computing system platform

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

Apache Drill open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Drill is an Apache top-level project.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Apache Spark open-source data analytics cluster computing framework

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.

John Darlington (academic) British academic and author

John Darlington is a British academic, researcher and author. He is an Emeritus Professor at Imperial College London. He was Director of the London e-Science Centre and was head of the Functional Programming and Social Computing Sections at Imperial.

References

  1. Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud Computing" by A. M. Middleton. Handbook of Cloud Computing. Springer, 2010.
  2. "Got Data? A Guide to Data Preservation in the Information Age" by F. Berman. Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50–66.
  3. The Data Centric Gambit, by J. Hellerstein, 2008.
  4. "A Design Methodology for Data-Parallel Applications" by L. S. Nyland, J. F. Prins, A. Goldberg, and P. H. Mills. Handbook of Cloud Computing. Springer, 2010.
  5. "The terascale challenge" by D. Ravichandran, P. Pantel, and E. Hovy. Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004.
  6. "BOOM: Data-Centric Programming in the Datacenter" by P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J. Hellerstein, and R. Sears. Electrical Engineering and Computer Sciences Department, University of California at Berkeley, Technical Report, 2009.
  7. "Data-Intensive Computing in the 21st Century" by I. Gorton, P. Greenfield, A. Szalay, and R. Williams. IEEE Computer, Vol. 41, No. 4, 2008, pp. 30–32.
  8. Data-Intensive Computing, NSF, 2009.
  9. Data Intensive Scalable Computing, by R. E. Bryant, 2008.
  10. Bamboo: A Data-Centric, Object-Oriented Approach to Many-core Software, by J. Zhou, and B. Demsky. Programming Language Design and Implementation, 2010.
  11. "Data-Centric Computing with the Netezza Architecture" by G. S. Davison, K. W. Boyack, R. A. Zacharski, S. C. Helmreich, and J. R. Cowie. Sandia National Laboratories, Technical Report, 2006.
  12. "The Expanding Digital Universe" by J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, J. Xheneti, A. Toncheva, and A. Manfrediz. IDC, White Paper, 2007.
  13. 1 2 3 Pig latin: A Not-So-Foreign Language for Data Processing Archived 2011-07-20 at the Wayback Machine , by C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Stanford University, 2008.