Jaql

Last updated
Jaql
Paradigm Functional
Designed by Vuk Ercegovac (Google)
First appearedOctober 9, 2008;14 years ago (2008-10-09)
Stable release
0.5.1 / July 12, 2010;12 years ago (2010-07-12)
Implementation language Java
OS Cross-platform
License Apache License 2.0
Website code.google.com/p/jaql/m
Major implementations
IBM BigInsights

Jaql (pronounced "jackal") is a functional data processing and query language most commonly used for JSON query processing on big data.

Contents

It started as an open source project at Google [1] but the latest release was on 2010-07-12. IBM [2] took it over as primary data processing language for their Hadoop software package BigInsights.

Although having been developed for JSON it supports a variety of other data sources like CSV, TSV, XML.

A comparison [3] to other BigData query languages like PIG Latin and Hive QL illustrates performance and usability aspects of these technologies.

Jaql supports [4] lazy evaluation, so expressions are only materialized when needed.

Syntax

The basic concept of Jaql is

source->operator(parameter)->sink;

where a sink can be a source for a downstream operator. So typically a Jaql program has to following structure, expressing a data processing graph:

source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;

Most commonly for readability reasons Jaql programs are linebreaked after the arrow, as is also a common idiom in Twitter Scalding:

source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;

Core operators [5]

Expand

Use the EXPAND expression to flatten nested arrays. This expression takes as input an array of nested arrays [ [ T ] ] and produces an output array [ T ], by promoting the elements of each nested array to the top-level output array.

Filter

Use the FILTER operator to filter away elements from the specified input array. This operator takes as input an array of elements of type T and outputs an array of the same type, retaining those elements for which a predicate evaluates to true. It is the Jaql equivalent of the SQL WHERE clause. Example:

data=[{name:"Jon Doe",income:20000,manager:false},{name:"Vince Wayne",income:32500,manager:false},{name:"Jane Dean",income:72000,manager:true},{name:"Alex Smith",income:25000,manager:false}];data->filter$.manager;[{"income":72000,"manager":true,"name":"Jane Dean"}]data->filter$.income<30000;[{"income":20000,"manager":false,"name":"Jon Doe"},{"income":25000,"manager":false,"name":"Alex Smith"}]

Group

Use the GROUP expression to group one or more input arrays on a grouping key and applies an aggregate function per group.

Join

Use the JOIN operator to express a join between two or more input arrays. This operator supports multiple types of joins, including natural, left-outer, right-outer, and outer joins.

Sort

Use the SORT operator to sort an input by one or more fields.

Top

The TOP expression selects the first k elements of its input. If a comparator is provided, the output is semantically equivalent to sorting the input, then selecting the first k elements.

Transform

Use the TRANSFORM operator to realize a projection or to apply a function to all items of an output.

See also

Related Research Articles

<span class="mw-page-title-main">AWK</span> Programming language

AWK (awk) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

<span class="mw-page-title-main">JavaScript</span> High-level programming language

JavaScript, often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2023, 98.7% of websites use JavaScript on the client side for webpage behavior, often incorporating third-party libraries. All major web browsers have a dedicated JavaScript engine to execute the code on users' devices.

Java Platform, Standard Edition is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE).

A list comprehension is a syntactic construct available in some programming languages for creating a list based on existing lists. It follows the form of the mathematical set-builder notation as distinct from the use of map and filter functions.

YAML is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from Standard Generalized Markup Language (SGML). It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps but forbids tab characters to use as indentation thus only some JSON files are valid YAML 1.2.

The SQL SELECT statement returns a result set of records, from one or more tables.

The syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.

Tacit programming, also called point-free style, is a programming paradigm in which function definitions do not identify the arguments on which they operate. Instead the definitions merely compose other functions, among which are combinators that manipulate the arguments. Tacit programming is of theoretical interest, because the strict use of composition results in programs that are well adapted for equational reasoning. It is also the natural style of certain programming languages, including APL and its derivatives, and concatenative languages such as Forth. The lack of argument naming gives point-free style a reputation of being unnecessarily obscure, hence the epithet "pointless style".

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.

EMML, or Enterprise Mashup Markup Language, is an XML markup language for creating enterprise mashups, which are software applications that consume and mash data from variety of sources, often performing logical or mathematical operations as well as presenting data.

Ateji PX is an object-oriented programming language extension for Java. It is intended to facilliate parallel computing on multi-core processors, GPU, Grid and Cloud.

<span class="mw-page-title-main">Apache Pig</span> Open-source data analytics software

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

<span class="mw-page-title-main">Array DBMS</span> System that provides database services specifically for arrays

An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.

<span class="mw-page-title-main">UCBLogo</span>

UCBLogo, also termed Berkeley Logo, is a programming language, a dialect of Logo, which derived from Lisp. It is a dialect of Logo intended to being a “minimum Logo standard.” It has the best facilities for handling lists, files, input/output (I/O), and recursion. It can be used to teach most computer science concepts, as University of California, Berkeley lecturer Brian Harvey did in his Computer Science Logo Style trilogy. It is free and open-source software released under a GNU General Public License (GPL).

Tritium is a simple scripting language for efficiently transforming structured data like HTML, XML, and JSON. It is similar in purpose to XSLT but has a syntax influenced by jQuery, Sass, and CSS versus XSLT's XML based syntax.

JSONiq is a query and functional programming language that is designed to declaratively query and transform collections of hierarchical and heterogeneous data in format of JSON, XML, as well as unstructured, textual data.

Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a data analytics process.

jq (programming language) Programming language for JSON

jq is a very high-level lexically scoped functional programming language in which every JSON value is a constant. jq supports backtracking and managing indefinitely long streams of JSON data. It is related to the Icon and Haskell programming languages. The language supports a namespace-based module system and has some support for closures. In particular, functions and functional expressions can be used as parameters of other functions.

References

  1. Original Jaql project
  2. Initial Publication
  3. Comparing High Level MapReduce Query Languages, 9th International Symposium of Advanced Parallel Processing Technologies 2011, Shanghei, China
  4. JAQL in Hadoop, a brief introduction
  5. IBM BigInsights Documentation