ECL (data-centric programming language)

Last updated
ECL
Paradigm declarative, structured, data-centric
Developer HPCC Systems®, LexisNexis Risk Solutions
First appeared2000
Typing discipline static, strong, safe
OS GNU/Linux
Website http://hpccsystems.com/
Influenced by
Prolog, Pascal, SQL, Snobol4, C++, Clarion

ECL is a declarative, data-centric programming language designed in 2000 to allow a team of programmers to process big data across a high performance computing cluster without the programmer being involved in many of the lower level, imperative decisions. [1] [2]

Contents

History

ECL was initially designed and developed in 2000 by David Bayliss as an in-house productivity tool within Seisint Inc and was considered to be a ‘secret weapon’ that allowed Seisint to gain market share in its data business. Equifax had an SQL-based process for predicting who would go bankrupt in the next 30 days, but it took 26 days to run the data. The first ECL implementation solved the same problem in 6 minutes. The technology was cited as a driving force behind the acquisition of Seisint by LexisNexis and then again as a major source of synergies when LexisNexis acquired ChoicePoint Inc. [3]

Language constructs

ECL, at least in its purest form, is a declarative, data-centric language. Programs, in the strictest sense, do not exist. Rather an ECL application will specify a number of core datasets (or data values) and then the operations which are to be performed on those values.

Hello world

ECL is to have succinct solutions to problems and sensible defaults. The "Hello World" program is characteristically short:

'Hello World'

Perhaps a more flavorful example would take a list of strings, sort them into order, and then return that as a result instead.

// First declare a dataset with one column containing a list of strings// Datasets can also be binary, CSV, XML or externally defined structuresD:=DATASET([{'ECL'},{'Declarative'},{'Data'},{'Centric'},{'Programming'},{'Language'}],{STRINGValue;});SD:=SORT(D,Value);output(SD)

The statements containing a := are defined in ECL as attribute definitions. They do not denote an action; rather a definition of a term. Thus, logically, an ECL program can be read: "bottom to top"

OUTPUT(SD)

What is an SD?

SD:=SORT(D,Value);

SD is a D that has been sorted by ‘Value’

What is a D?

D:=DATASET([{'ECL'},{'Declarative'},{'Data'},{'Centric'},{'Programming'},{'Language'}],{STRINGValue;});

D is a dataset with one column labeled ‘Value’ and containing the following list of data.

ECL primitives

ECL primitives that act upon datasets include SORT, ROLLUP, DEDUP, ITERATE, PROJECT, JOIN, NORMALIZE, DENORMALIZE, PARSE, CHOSEN, ENTH, TOPN, DISTRIBUTE

ECL encapsulation

Whilst ECL is terse and LexisNexis claims that 1 line of ECL is roughly equivalent to 120 lines of C++, it still has significant support for large scale programming including data encapsulation and code re-use. The constructs available include MODULE, FUNCTION, FUNCTIONMACRO, INTERFACE, MACRO, EXPORT, SHARED

Support for Parallelism in ECL

In the HPCC implementation, by default, most ECL constructs will execute in parallel across the hardware being used. Many of the primitives also have a LOCAL option to specify that the operation is to occur locally on each node.

Comparison to Map-Reduce

The Hadoop Map-Reduce paradigm consists of three phases which correlate to ECL primitives as follows.

Hadoop Name/TermECL equivalentComments
MAPing within the MAPperPROJECT/TRANSFORMTakes a record and converts to a different format; in the Hadoop case the conversion is into a key-value pair
SHUFFLE (Phase 1)DISTRIBUTE(,HASH(KeyValue))The records from the mapper are distributed depending upon the KEY value
SHUFFLE (Phase 2)SORT(,LOCAL)The records arriving at a particular reducer are sorted into KEY order
REDUCEROLLUP(,Key,LOCAL)The records for a particular KEY value are now combined

Related Research Articles

COBOL Programming language with English-like syntax

COBOL is a compiled English-like computer programming language designed for business use. It is imperative, procedural and, since 2002, object-oriented. COBOL is primarily used in business, finance, and administrative systems for companies and governments. COBOL is still widely used in applications deployed on mainframe computers, such as large-scale batch and transaction processing jobs. But due to its declining popularity and the retirement of experienced COBOL programmers, programs are being migrated to new platforms, rewritten in modern languages or replaced with software packages. Most programming in COBOL is now purely to maintain existing applications, however many large financial institutions were still developing new systems in COBOL in 2006 due to the mainframe processing speed.

Scheme is a minimalist dialect of the Lisp family of programming languages. Scheme consists of a small standard core with powerful tools for language extension.

LexisNexis Risk Solutions is a global data and analytics company that provides data and technology services, analytics, predictive insights and fraud prevention for a wide range of industries. It is headquartered in Alpharetta, Georgia and has offices throughout the U.S. and in Australia, Brazil, China, Hong Kong SAR, India, Ireland, Israel, Philippines and the U.K. The company’s customers include businesses within the insurance, financial services, healthcare and corporate sectors as well as the local, state and federal government, law enforcement and public safety.

In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. In some cases, an object is considered immutable even if some internally used attributes change, but the object's state appears unchanging from an external point of view. For example, an object that uses memoization to cache the results of expensive computations could still be considered an immutable object.

A string literal or anonymous string is a type of literal in programming for the representation of a string value within the source code of a computer program. Most often in modern languages this is a quoted sequence of characters, as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value, and one must use a method such as escape sequences to avoid the problem of delimiter collision and allow the delimiters themselves to be embedded in a string. However, there are numerous alternate notations for specifying string literals, particularly more complicated cases, and the exact notation depends on the individual programming language in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

In computer science, primitive data type is either of the following:

RELX plc is a British corporate group comprising companies that publish scientific, technical and medical material, and legal textbooks; provide decision-making tools; and organise exhibitions. It operates in 40 countries and serves customers in over 180 nations. It was previously known as Reed Elsevier, and came into being in 1992 as a result of the merger of Reed International, a British trade book and magazine publisher, and Elsevier, a Netherlands-based scientific publisher.

Delimiter Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

LexisNexis is a corporation providing computer-assisted legal research (CALR) as well as business research and risk management services. During the 1970s, LexisNexis pioneered the electronic accessibility of legal and journalistic documents. As of 2006, the company had the world's largest electronic database for legal and public-records related information.

K is a proprietary array processing programming language developed by Arthur Whitney and commercialized by Kx Systems. The language serves as the foundation for kdb+, an in-memory, column-based database, and other related financial products. The language, originally developed in 1993, is a variant of APL and contains elements of Scheme. Advocates of the language emphasize its speed, facility in handling arrays, and expressive syntax.

CMS-2 (programming language) embedded systems programming language

CMS-2 is an embedded systems programming language used by the United States Navy. It was an early attempt to develop a standardized high-level computer programming language intended to improve code portability and reusability. CMS-2 was developed primarily for the US Navy’s tactical data systems (NTDS).

In computer programming, an entry point is where the first instructions of a program are executed, and where the program has access to command line arguments.

thinBasic is a BASIC-like computer programming language interpreter with a central core engine architecture surrounded by many specialized modules. Although originally designed mainly for computer automation, thanks to its modular structure it can be used for wide range of tasks.

Hank Asher American businessman

Hank Asher was a businessman best known as "the father of data fusion." With a reported fortune of around US$500 million earned as the founder of several data fusion / data mining companies that compile information about companies, individuals and their interrelationships from thousands of different electronic databases. "He's kind of a legend" among those who use investigative data tools, says Greg Lambert, a legal information specialist.

The PHP syntax and semantics are the format (syntax) and the related meanings (semantics) of the text and symbols in the PHP programming language. They form a set of rules that define how a PHP program can be written and interpreted.

Southampton BASIC System (SOBS) was a dialect of the BASIC programming language developed for and used on ICT 1900 series computers in the late 1960s and early 1970s; it was implemented under the MINIMOP operating system at the University of Southampton and also ran under MAXIMOP.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

HPCC open source, data-intensive computing system platform

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

Data-centric programming language defines a category of programming languages where the primary function is the management and manipulation of data. A data-centric programming language includes built-in processing primitives for accessing data stored in sets, tables, lists, and other data structures and databases, and for specific manipulation and transformation of data required by a programming application. Data-centric programming languages are typically declarative and often dataflow-oriented, and define the processing result desired; the specific processing steps required to perform the processing are left to the language compiler. The SQL relational database language is an example of a declarative, data-centric language. Declarative, data-centric programming languages are ideal for data-intensive computing applications.

PL/SQL is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database, Times Ten in-memory database, and IBM DB 2. Oracle Corporation usually extends PL/SQL functionality with each successive release of the Oracle Database.

References

  1. A Guide to ECL, Lexis-Nexis.
  2. "Evaluating use of data flow systems for large graph analysis," by A. Yoo, and I. Kaplan. Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS, 2009
  3. "Acquisition of Seisint". Archived from the original on 2011-06-21. Retrieved 2011-03-24.