Paradigm | Functional |
---|---|
Designed by | Vuk Ercegovac (Google) |
First appeared | October 9, 2008 |
Stable release | 0.5.1 / July 12, 2010 |
Implementation language | Java |
OS | Cross-platform |
License | Apache License 2.0 |
Website | code |
Major implementations | |
IBM BigInsights |
Jaql (pronounced "jackal") is a functional data processing and query language most commonly used for JSON query processing on big data.
It started as an open source project at Google [1] but the latest release was on 2010-07-12. IBM [2] took it over as primary data processing language for their Hadoop software package BigInsights.
Although having been developed for JSON it supports a variety of other data sources like CSV, TSV, XML.
A comparison [3] to other BigData query languages like PIG Latin and Hive QL illustrates performance and usability aspects of these technologies.
Jaql supports [4] lazy evaluation, so expressions are only materialized when needed.
The basic concept of Jaql is
source->operator(parameter)->sink;
where a sink can be a source for a downstream operator. So typically a Jaql program has to following structure, expressing a data processing graph:
source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;
Most commonly for readability reasons Jaql programs are linebreaked after the arrow, as is also a common idiom in Twitter Scalding:
source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;
Source: [5]
Use the EXPAND expression to flatten nested arrays. This expression takes as input an array of nested arrays [ [ T ] ] and produces an output array [ T ], by promoting the elements of each nested array to the top-level output array.
Use the FILTER operator to filter away elements from the specified input array. This operator takes as input an array of elements of type T and outputs an array of the same type, retaining those elements for which a predicate evaluates to true. It is the Jaql equivalent of the SQL WHERE clause. Example:
data=[{name:"Jon Doe",income:20000,manager:false},{name:"Vince Wayne",income:32500,manager:false},{name:"Jane Dean",income:72000,manager:true},{name:"Alex Smith",income:25000,manager:false}];data->filter$.manager;[{"income":72000,"manager":true,"name":"Jane Dean"}]data->filter$.income<30000;[{"income":20000,"manager":false,"name":"Jon Doe"},{"income":25000,"manager":false,"name":"Alex Smith"}]
Use the GROUP expression to group one or more input arrays on a grouping key and applies an aggregate function per group.
Use the JOIN operator to express a join between two or more input arrays. This operator supports multiple types of joins, including natural, left-outer, right-outer, and outer joins.
Use the SORT operator to sort an input by one or more fields.
The TOP expression selects the first k elements of its input. If a comparator is provided, the output is semantically equivalent to sorting the input, then selecting the first k elements.
Use the TRANSFORM operator to realize a projection or to apply a function to all items of an output.
AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and it is a standard feature of most Unix-like operating systems.
XSLT is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text, or XSL Formatting Objects. These formats can be subsequently converted to formats such as PDF, PostScript, and PNG. Support for JSON and plain-text transformation was added in later updates to the XSLT 1.0 specification.
Java Platform, Standard Edition is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE).
A list comprehension is a syntactic construct available in some programming languages for creating a list based on existing lists. It follows the form of the mathematical set-builder notation as distinct from the use of map and filter functions.
The SQL SELECT statement returns a result set of rows, from one or more tables.
JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of name–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.
The syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.
Tacit programming, also called point-free style, is a programming paradigm in which function definitions do not identify the arguments on which they operate. Instead the definitions merely compose other functions, among which are combinators that manipulate the arguments. Tacit programming is of theoretical interest, because the strict use of composition results in programs that are well adapted for equational reasoning. It is also the natural style of certain programming languages, including APL and its derivatives, and concatenative languages such as Forth. The lack of argument naming gives point-free style a reputation of being unnecessarily obscure, hence the epithet "pointless style".
Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.
XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.
XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.
Ateji PX is an object-oriented programming language extension for Java. It is intended to facilliate parallel computing on multi-core processors, GPU, Grid and Cloud.
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive if they require large volumes of data and devote most of their processing time to input/output and manipulation of data.
An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.
UCBLogo, also termed Berkeley Logo, is a programming language, a dialect of Logo, which derived from Lisp. It is a dialect of Logo intended to be a "minimum Logo standard".
Tritium is a simple scripting language for efficiently transforming structured data like HTML, XML, and JSON. It is similar in purpose to XSLT but has a syntax influenced by jQuery, Sass, and CSS versus XSLT's XML based syntax.
JSONiq is a query and functional programming language that is designed to declaratively query and transform collections of hierarchical and heterogeneous data in format of JSON, XML, as well as unstructured, textual data.
Data lineage refers to the process of tracking how data is generated, transformed, transmitted and used across a system over time. It documents data's origins, transformations and movements, providing detailed visibility into its life cycle. This process simplifies the identification of errors in data analytics workflows, by enabling users to trace issues back to their root causes.
jq is a very high-level lexically scoped functional programming language in which every JSON value is a constant. jq supports backtracking and managing indefinitely long streams of JSON data. It is related to the Icon and Haskell programming languages. The language supports a namespace-based module system and has some support for closures. In particular, functions and functional expressions can be used as parameters of other functions.