Paradigm | Functional |
---|---|
Designed by | Vuk Ercegovac (Google) |
First appeared | October 9, 2008 |
Stable release | 0.5.1 / July 12, 2010 |
Implementation language | Java |
OS | Cross-platform |
License | Apache License 2.0 |
Website | code |
Major implementations | |
IBM BigInsights |
Jaql (pronounced "jackal") is a functional data processing and query language most commonly used for JSON query processing on big data.
It started as an open source project at Google [1] but the latest release was on 2010-07-12. IBM [2] took it over as primary data processing language for their Hadoop software package BigInsights.
Although having been developed for JSON it supports a variety of other data sources like CSV, TSV, XML.
A comparison [3] to other BigData query languages like PIG Latin and Hive QL illustrates performance and usability aspects of these technologies.
Jaql supports [4] lazy evaluation, so expressions are only materialized when needed.
The basic concept of Jaql is
source->operator(parameter)->sink;
where a sink can be a source for a downstream operator. So typically a Jaql program has to following structure, expressing a data processing graph:
source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;
Most commonly for readability reasons Jaql programs are linebreaked after the arrow, as is also a common idiom in Twitter Scalding:
source->operator1(parameter)->operator2(parameter)->operator2(parameter)->operator3(parameter)->operator4(parameter)->sink;
Source: [5]
Use the EXPAND expression to flatten nested arrays. This expression takes as input an array of nested arrays [ [ T ] ] and produces an output array [ T ], by promoting the elements of each nested array to the top-level output array.
Use the FILTER operator to filter away elements from the specified input array. This operator takes as input an array of elements of type T and outputs an array of the same type, retaining those elements for which a predicate evaluates to true. It is the Jaql equivalent of the SQL WHERE clause. Example:
data=[{name:"Jon Doe",income:20000,manager:false},{name:"Vince Wayne",income:32500,manager:false},{name:"Jane Dean",income:72000,manager:true},{name:"Alex Smith",income:25000,manager:false}];data->filter$.manager;[{"income":72000,"manager":true,"name":"Jane Dean"}]data->filter$.income<30000;[{"income":20000,"manager":false,"name":"Jon Doe"},{"income":25000,"manager":false,"name":"Alex Smith"}]
Use the GROUP expression to group one or more input arrays on a grouping key and applies an aggregate function per group.
Use the JOIN operator to express a join between two or more input arrays. This operator supports multiple types of joins, including natural, left-outer, right-outer, and outer joins.
Use the SORT operator to sort an input by one or more fields.
The TOP expression selects the first k elements of its input. If a comparator is provided, the output is semantically equivalent to sorting the input, then selecting the first k elements.
Use the TRANSFORM operator to realize a projection or to apply a function to all items of an output.
AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.
JavaScript, often abbreviated as JS, is a programming language and core technology of the Web, alongside HTML and CSS. 99% of websites use JavaScript on the client side for webpage behavior.
XSLT is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subsequently be converted to other formats, such as PDF, PostScript and PNG. Support for JSON and plain-text transformation was added in later updates to the XSLT 1.0 specification.
Java Platform, Standard Edition is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE).
A list comprehension is a syntactic construct available in some programming languages for creating a list based on existing lists. It follows the form of the mathematical set-builder notation as distinct from the use of map and filter functions.
The SQL SELECT statement returns a result set of rows, from one or more tables.
The syntax of Java is the set of rules defining how a Java program is written and interpreted.
The syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.
Tacit programming, also called point-free style, is a programming paradigm in which function definitions do not identify the arguments on which they operate. Instead the definitions merely compose other functions, among which are combinators that manipulate the arguments. Tacit programming is of theoretical interest, because the strict use of composition results in programs that are well adapted for equational reasoning. It is also the natural style of certain programming languages, including APL and its derivatives, and concatenative languages such as Forth. The lack of argument naming gives point-free style a reputation of being unnecessarily obscure, hence the epithet "pointless style".
Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.
XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.
EMML, or Enterprise Mashup Markup Language, is an XML markup language for creating enterprise mashups, which are software applications that consume and mash data from variety of sources. These applications often perform logical or mathematical operations as well as present the data.
Ateji PX is an object-oriented programming language extension for Java. It is intended to facilliate parallel computing on multi-core processors, GPU, Grid and Cloud.
Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.
An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.
UCBLogo, also termed Berkeley Logo, is a programming language, a dialect of Logo, which derived from Lisp. It is a dialect of Logo intended to be a "minimum Logo standard".
Tritium is a simple scripting language for efficiently transforming structured data like HTML, XML, and JSON. It is similar in purpose to XSLT but has a syntax influenced by jQuery, Sass, and CSS versus XSLT's XML based syntax.
JSONiq is a query and functional programming language that is designed to declaratively query and transform collections of hierarchical and heterogeneous data in format of JSON, XML, as well as unstructured, textual data.
Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a data analytics process.
jq is a very high-level lexically scoped functional programming language in which every JSON value is a constant. jq supports backtracking and managing indefinitely long streams of JSON data. It is related to the Icon and Haskell programming languages. The language supports a namespace-based module system and has some support for closures. In particular, functions and functional expressions can be used as parameters of other functions.