Jaql

Last updated
Jaql
Paradigm Functional
Designed by Vuk Ercegovac (Google)
First appearedOctober 9, 2008;15 years ago (2008-10-09)
Stable release
0.5.1 / July 12, 2010;14 years ago (2010-07-12)
Implementation language Java
OS Cross-platform
License Apache License 2.0
Website code.google.com/p/jaql/m
Major implementations
IBM BigInsights

Jaql (pronounced "jackal") is a functional data processing and query language most commonly used for JSON query processing on big data.

Contents

It started as an open source project at Google [1] but the latest release was on 2010-07-12. IBM [2] took it over as primary data processing language for their Hadoop software package BigInsights.

Although having been developed for JSON it supports a variety of other data sources like CSV, TSV, XML.

A comparison [3] to other BigData query languages like PIG Latin and Hive QL illustrates performance and usability aspects of these technologies.

Jaql supports [4] lazy evaluation, so expressions are only materialized when needed.

Syntax

The basic concept of Jaql is

source->operator(parameter)->sink;

where a sink can be a source for a downstream operator. So typically a Jaql program has to following structure, expressing a data processing graph:

source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;

Most commonly for readability reasons Jaql programs are linebreaked after the arrow, as is also a common idiom in Twitter Scalding:

source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;

Core operators

Source: [5]

Expand

Use the EXPAND expression to flatten nested arrays. This expression takes as input an array of nested arrays [ [ T ] ] and produces an output array [ T ], by promoting the elements of each nested array to the top-level output array.

Filter

Use the FILTER operator to filter away elements from the specified input array. This operator takes as input an array of elements of type T and outputs an array of the same type, retaining those elements for which a predicate evaluates to true. It is the Jaql equivalent of the SQL WHERE clause. Example:

data=[{name:"Jon Doe",income:20000,manager:false},{name:"Vince Wayne",income:32500,manager:false},{name:"Jane Dean",income:72000,manager:true},{name:"Alex Smith",income:25000,manager:false}];data->filter$.manager;[{"income":72000,"manager":true,"name":"Jane Dean"}]data->filter$.income<30000;[{"income":20000,"manager":false,"name":"Jon Doe"},{"income":25000,"manager":false,"name":"Alex Smith"}]

Group

Use the GROUP expression to group one or more input arrays on a grouping key and applies an aggregate function per group.

Join

Use the JOIN operator to express a join between two or more input arrays. This operator supports multiple types of joins, including natural, left-outer, right-outer, and outer joins.

Sort

Use the SORT operator to sort an input by one or more fields.

Top

The TOP expression selects the first k elements of its input. If a comparator is provided, the output is semantically equivalent to sorting the input, then selecting the first k elements.

Transform

Use the TRANSFORM operator to realize a projection or to apply a function to all items of an output.

See also

Related Research Articles

<span class="mw-page-title-main">AWK</span> Programming language

AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

<span class="mw-page-title-main">JavaScript</span> High-level programming language

JavaScript, often abbreviated as JS, is a programming language and core technology of the Web, alongside HTML and CSS. 99% of websites use JavaScript on the client side for webpage behavior.

XSLT is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subsequently be converted to other formats, such as PDF, PostScript and PNG. Support for JSON and plain-text transformation was added in later updates to the XSLT 1.0 specification.

Java Platform, Standard Edition is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE).

A list comprehension is a syntactic construct available in some programming languages for creating a list based on existing lists. It follows the form of the mathematical set-builder notation as distinct from the use of map and filter functions.

The SQL SELECT statement returns a result set of rows, from one or more tables.

<span class="mw-page-title-main">Java syntax</span> Set of rules defining correctly structured program

The syntax of Java is the set of rules defining how a Java program is written and interpreted.

<span class="mw-page-title-main">JavaScript syntax</span> Set of rules defining correctly structured programs

The syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.

Tacit programming, also called point-free style, is a programming paradigm in which function definitions do not identify the arguments on which they operate. Instead the definitions merely compose other functions, among which are combinators that manipulate the arguments. Tacit programming is of theoretical interest, because the strict use of composition results in programs that are well adapted for equational reasoning. It is also the natural style of certain programming languages, including APL and its derivatives, and concatenative languages such as Forth. The lack of argument naming gives point-free style a reputation of being unnecessarily obscure, hence the epithet "pointless style".

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.

EMML, or Enterprise Mashup Markup Language, is an XML markup language for creating enterprise mashups, which are software applications that consume and mash data from variety of sources. These applications often perform logical or mathematical operations as well as present the data.

Ateji PX is an object-oriented programming language extension for Java. It is intended to facilliate parallel computing on multi-core processors, GPU, Grid and Cloud.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

<span class="mw-page-title-main">Array DBMS</span> System that provides database services specifically for arrays

An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.

<span class="mw-page-title-main">UCBLogo</span> Logo programming language dialect

UCBLogo, also termed Berkeley Logo, is a programming language, a dialect of Logo, which derived from Lisp. It is a dialect of Logo intended to be a "minimum Logo standard".

Tritium is a simple scripting language for efficiently transforming structured data like HTML, XML, and JSON. It is similar in purpose to XSLT but has a syntax influenced by jQuery, Sass, and CSS versus XSLT's XML based syntax.

JSONiq is a query and functional programming language that is designed to declaratively query and transform collections of hierarchical and heterogeneous data in format of JSON, XML, as well as unstructured, textual data.

Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a data analytics process.

jq (programming language) Programming language for JSON

jq is a very high-level lexically scoped functional programming language in which every JSON value is a constant. jq supports backtracking and managing indefinitely long streams of JSON data. It is related to the Icon and Haskell programming languages. The language supports a namespace-based module system and has some support for closures. In particular, functions and functional expressions can be used as parameters of other functions.

References

  1. Original Jaql project
  2. Initial Publication
  3. Stewart, Robert J.; Trinder, Phil W.; Loidl, Hans-Wolfgang (2011). "Comparing High Level MapReduce Query Languages". Advanced Parallel Processing Technologies. Lecture Notes in Computer Science. Vol. 6965. pp. 58–72. doi:10.1007/978-3-642-24151-2_5. ISBN   978-3-642-24150-5.
  4. JAQL in Hadoop, a brief introduction
  5. IBM BigInsights Documentation