Cuneiform (programming language)

Last updated
Cuneiform
G18225.png
Cf screenshot.jpg
Paradigm functional, scientific workflow
Designed by Jörgen Brandt
First appeared2013
Stable release
3.0.4 / November 19, 2018 (2018-11-19)
Typing discipline static, simple types
Implementation language Erlang
OS Linux, Mac OS
License Apache License 2.0
Filename extensions .cfl
Website cuneiform-lang.org
Influenced by
Swift (parallel scripting language)

Cuneiform is an open-source workflow language for large-scale scientific data analysis. [1] [2] It is a statically typed functional programming language promoting parallel computing. It features a versatile foreign function interface allowing users to integrate software from many external programming languages. At the organizational level Cuneiform provides facilities like conditional branching and general recursion making it Turing-complete. In this, Cuneiform is the attempt to close the gap between scientific workflow systems like Taverna, KNIME, or Galaxy and large-scale data analysis programming models like MapReduce or Pig Latin while offering the generality of a functional programming language.

Contents

Cuneiform is implemented in distributed Erlang. If run in distributed mode it drives a POSIX-compliant distributed file system like Gluster or Ceph (or a FUSE integration of some other file system, e.g., HDFS). Alternatively, Cuneiform scripts can be executed on top of HTCondor or Hadoop. [3] [4] [5] [6]

Cuneiform is influenced by the work of Peter Kelly who proposes functional programming as a model for scientific workflow execution. [7] [8] In this, Cuneiform is distinct from related workflow languages based on dataflow programming like Swift. [9]

External software integration

External tools and libraries (e.g., R or Python libraries) are integrated via a foreign function interface. In this it resembles, e.g., KNIME which allows the use of external software through snippet nodes, or Taverna which offers BeanShell services for integrating Java software. By defining a task in a foreign language it is possible to use the API of an external tool or library. This way, tools can be integrated directly without the need of writing a wrapper or reimplementing the tool. [10]

Currently supported foreign programming languages are:

Foreign language support for AWK and gnuplot are planned additions.

Type system

Cuneiform provides a simple, statically checked type system. [11] While Cuneiform provides lists as compound data types it omits traditional list accessors (head and tail) to avoid the possibility of runtime errors which might arise when accessing the empty list. Instead lists are accessed in an all-or-nothing fashion by only mapping or folding over them. Additionally, Cuneiform omits (at the organizational level) arithmetics which excludes the possibility of division by zero. The omission of any partially defined operation allows to guarantee that runtime errors can arise exclusively in foreign code.

Base data types

As base data types Cuneiform provides Booleans, strings, and files. Herein, files are used to exchange data in arbitrary format between foreign functions.

Records and pattern matching

Cuneiform provides records (structs) as compound data types. The example below shows the definition of a variable r being a record with two fields a1 and a2, the first being a string and the second being a Boolean.

letr:<a1:Str,a2:Bool>=<a1="my string",a2=true>;

Records can be accessed either via projection or via pattern matching. The example below extracts the two fields a1 and a2 from the record r.

leta1:Str=(r|a1);let<a2=a2:Bool>=r;

Lists and list processing

Furthermore, Cuneiform provides lists as compound data types. The example below shows the definition of a variable xs being a file list with three elements.

letxs:[File]=['a.txt','b.txt','c.txt':File];

Lists can be processed with the for and fold operators. Herein, the for operator can be given multiple lists to consume list element-wise (similar to for/list in Racket, mapcar in Common Lisp or zipwith in Erlang).

The example below shows how to map over a single list, the result being a file list.

forx<-xsdoprocess-one(arg1=x):Fileend;

The example below shows how to zip two lists the result also being a file list.

forx<-xs,y<-ysdoprocess-two(arg1=x,arg2=y):Fileend;

Finally, lists can be aggregated by using the fold operator. The following example sums up the elements of a list.

  fold acc = 0, x <- xs do     add( a = acc, b = x )   end; 

Parallel execution

Cuneiform is a purely functional language, i.e., it does not support mutable references. In the consequence, it can use subterm-independence to divide a program into parallelizable portions. The Cuneiform scheduler distributes these portions to worker nodes. In addition, Cuneiform uses a Call-by-Name evaluation strategy to compute values only if they contribute to the computation result. Finally, foreign function applications are memoized to speed up computations that contain previously derived results.

For example, the following Cuneiform program allows the applications of f and g to run in parallel while h is dependent and can be started only when both f and g are finished.

let output-of-f : File = f(); let output-of-g : File = g();  h( f = output-of-f, g = output-of-g );

The following Cuneiform program creates three parallel applications of the function f by mapping f over a three-element list:

let xs : [File] =   ['a.txt', 'b.txt', 'c.txt' : File];  for x <- xs do   f( x = x )   : File end;

Similarly, the applications of f and g are independent in the construction of the record r and can, thus, be run in parallel:

letr:<a:File,b:File>=<nowiki><a=f(),b=g()></nowiki>;

Examples

A hello-world script:

defgreet(person:Str)-><out:Str>inBash*{out="Hello $person"}*(greet(person="world")|out);

This script defines a task greet in Bash which prepends "Hello " to its string argument person. The function produces a record with a single string field out. Applying greet, binding the argument person to the string "world" produces the record <out = "Hello world">. Projecting this record to its field out evaluates the string "Hello world".

Command line tools can be integrated by defining a task in Bash:

defsamtoolsSort(bam:File)-><sorted:File>inBash*{sorted=sorted.bamsamtoolssort-m2G$bam-o$sorted}*

In this example a task samtoolsSort is defined. It calls the tool SAMtools, consuming an input file, in BAM format, and producing a sorted output file, also in BAM format.

Release history

VersionAppearanceImplementation LanguageDistribution PlatformForeign Languages
1.0.0May 2014 Java Apache Hadoop Bash, Common Lisp, GNU Octave, Perl, Python, R, Scala
2.0.xMar. 2015 Java HTCondor, Apache Hadoop Bash, BeanShell, Common Lisp, MATLAB, GNU Octave, Perl, Python, R, Scala
2.2.xApr. 2016 Erlang HTCondor, Apache Hadoop Bash, Perl, Python, R
3.0.xFeb. 2018 Erlang Distributed ErlangBash, Erlang, Java, MATLAB, GNU Octave, Perl, Python, R, Racket

In April 2016, Cuneiform's implementation language switched from Java to Erlang and, in February 2018, its major distributed execution platform changed from a Hadoop to distributed Erlang. Additionally, from 2015 to 2018 HTCondor had been maintained as an alternative execution platform.

Cuneiform's surface syntax was revised twice, as reflected in the major version number.

Version 1

In its first draft published in May 2014, Cuneiform was closely related to Make in that it constructed a static data dependency graph which the interpreter traversed during execution. The major difference to later versions was the lack of conditionals, recursion, or static type checking. Files were distinguished from strings by juxtaposing single-quoted string values with a tilde ~. The script's query expression was introduced with the target keyword. Bash was the default foreign language. Function application had to be performed using an apply form that took task as its first keyword argument. One year later, this surface syntax was replaced by a streamlined but similar version.

The following example script downloads a reference genome from an FTP server.

declare download-ref-genome;  deftask download-fa( fa : ~path ~id ) *{     wget $path/$id.fa.gz     gunzip $id.fa.gz     mv $id.fa $fa }*  ref-genome-path = ~'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes'; ref-genome-id = ~'chr22';  ref-genome = apply(     task : download-fa     path : ref-genome-path     id : ref-genome-id );  target ref-genome; 

Version 2

Swing-based editor and REPL for Cuneiform 2.0.3 Cf screenshot.jpg
Swing-based editor and REPL for Cuneiform 2.0.3

The second draft of the Cuneiform surface syntax, first published in March 2015, remained in use for three years outlasting the transition from Java to Erlang as Cuneiform's implementation language. Evaluation differs from earlier approaches in that the interpreter reduces a query expression instead of traversing a static graph. During the time the surface syntax remained in use the interpreter was formalized and simplified which resulted in a first specification of Cuneiform's semantics. The syntax featured conditionals. However, Booleans were encoded as lists, recycling the empty list as Boolean false and the non-empty list as Boolean true. Recursion was added later as a byproduct of formalization. However, static type checking was introduced only in Version 3.

The following script decompresses a zipped file and splits it into evenly sized partitions.

deftask unzip( <out( File )> : zip( File ) ) in bash *{   unzip -d dir $zip   out=`ls dir | awk '{print "dir/" $0}'` }*  deftask split( <out( File )> : file( File ) ) in bash *{   split -l 1024 $file txt   out=txt* }*  sotu = "sotu/stateoftheunion1790-2014.txt.zip"; fileLst = split( file: unzip( zip: sotu ) );  fileLst; 


Version 3

The current version of Cuneiform's surface syntax, in comparison to earlier drafts, is an attempt to close the gap to mainstream functional programming languages. It features a simple, statically checked type system and introduces records in addition to lists as a second type of compound data structure. Booleans are a separate base data type.

The following script untars a file resulting in a file list.

def untar( tar : File ) -> <fileLst : [File]> in Bash *{   tar xf $tar   fileLst=`tar tf $tar` }*  let hg38Tar : File =   'hg38/hg38.tar';  let <fileLst = faLst : [File]> =   untar( tar = hg38Tar );  faLst; 

Related Research Articles

<span class="mw-page-title-main">Erlang (programming language)</span> Programming language

Erlang is a general-purpose, concurrent, functional high-level programming language, and a garbage-collected runtime system. The term Erlang is used interchangeably with Erlang/OTP, or Open Telecom Platform (OTP), which consists of the Erlang runtime system, several ready-to-use components (OTP) mainly written in Erlang, and a set of design principles for Erlang programs.

ML is a functional programming language. It is known for its use of the polymorphic Hindley–Milner type system, which automatically assigns the types of most expressions without requiring explicit type annotations, and ensures type safety – there is a formal proof that a well-typed ML program does not cause runtime type errors. ML provides pattern matching for function arguments, garbage collection, imperative programming, call-by-value and currying. While a general-purpose programming language, ML is used heavily in programming language research and is one of the few languages to be completely specified and verified using formal semantics. Its types and pattern matching make it well-suited and commonly used to operate on other formal languages, such as in compiler writing, automated theorem proving, and formal verification.

<span class="mw-page-title-main">Shell script</span> Script written for the shell, or command line interpreter, of an operating system

A shell script is a computer program designed to be run by a Unix shell, a command-line interpreter. The various dialects of shell scripts are considered to be scripting languages. Typical operations performed by shell scripts include file manipulation, program execution, and printing text. A script which sets up the environment, runs the program, and does any necessary cleanup or logging, is called a wrapper.

Standard ML (SML) is a general-purpose, modular, functional programming language with compile-time type checking and type inference. It is popular for writing compilers, for programming language research, and for developing theorem provers.

This is a "genealogy" of programming languages. Languages are categorized under the ancestor language with the strongest influence. Those ancestor languages are listed in alphabetic order. Any such categorization has a large arbitrary element, since programming languages often incorporate major ideas from multiple sources.

<span class="mw-page-title-main">F Sharp (programming language)</span> Microsoft programming language

F# is a general-purpose, strongly typed, multi-paradigm programming language that encompasses functional, imperative, and object-oriented programming methods. It is most often used as a cross-platform Common Language Infrastructure (CLI) language on .NET, but can also generate JavaScript and graphics processing unit (GPU) code.

<span class="mw-page-title-main">D (programming language)</span> Multi-paradigm system programming language

D, also known as dlang, is a multi-paradigm system programming language created by Walter Bright at Digital Mars and released in 2001. Andrei Alexandrescu joined the design and development effort in 2007. Though it originated as a re-engineering of C++, D is a profoundly different language — features of D can be considered streamlined and expanded-upon ideas from C++, however D also draws inspiration from other high-level programming languages, notably Java, Python, Ruby, C#, and Eiffel.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard standing for "any string of characters except /" and *.txt is a glob pattern. The other common wildcard is the question mark (?), which stands for one character. For example, mv?.txtshorttextfiles/ will move all files named with a single character followed by .txt from the current directory to directory shorttextfiles, while ??.txt would match all files whose name consists of 2 characters followed by .txt.

<span class="mw-page-title-main">Apache Groovy</span> Programming language

Apache Groovy is a Java-syntax-compatible object-oriented programming language for the Java platform. It is both a static and dynamic language with features similar to those of Python, Ruby, and Smalltalk. It can be used as both a programming language and a scripting language for the Java Platform, is compiled to Java virtual machine (JVM) bytecode, and interoperates seamlessly with other Java code and libraries. Groovy uses a curly-bracket syntax similar to Java's. Groovy supports closures, multiline strings, and expressions embedded in strings. Much of Groovy's power lies in its AST transformations, triggered through annotations.

<span class="mw-page-title-main">ActionScript</span> Object-oriented programming language created for the Flash multimedia platform

ActionScript is an object-oriented programming language originally developed by Macromedia Inc.. It is influenced by HyperTalk, the scripting language for HyperCard. It is now an implementation of ECMAScript, though it originally arose as a sibling, both being influenced by HyperTalk. ActionScript code is usually converted to byte-code format by a compiler.

In computer programming, an anonymous function is a function definition that is not bound to an identifier. Anonymous functions are often arguments being passed to higher-order functions or used for constructing the result of a higher-order function that needs to return a function. If the function is only used once, or a limited number of times, an anonymous function may be syntactically lighter than using a named function. Anonymous functions are ubiquitous in functional programming languages and other languages with first-class functions, where they fulfil the same role for the function type as literals do for other data types.

<span class="mw-page-title-main">Scripting language</span> Programming language for run-time events

A scripting language or script language is a programming language that is used to manipulate, customize, and automate the facilities of an existing system. Scripting languages are usually interpreted at runtime rather than compiled.

XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.

Haskell is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell has pioneered a number of programming language features such as type classes, which enable type-safe operator overloading, and monadic input/output (IO). It is named after logician Haskell Curry. Haskell's main implementation is the Glasgow Haskell Compiler (GHC).

<span class="mw-page-title-main">Anduril (workflow engine)</span> Data analysis framework

Anduril is an open source component-based workflow framework for scientific data analysis developed at the Systems Biology Laboratory, University of Helsinki.

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large, so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

Elixir is a functional, concurrent, high-level general-purpose programming language that runs on the BEAM virtual machine, which is also used to implement the Erlang programming language. Elixir builds on top of Erlang and shares the same abstractions for building distributed, fault-tolerant applications. Elixir also provides tooling and an extensible design. The latter is supported by compile-time metaprogramming with macros and polymorphism via protocols.

<span class="mw-page-title-main">PureScript</span> Strongly-typed language that compiles to JavaScript

PureScript is a strongly-typed, purely-functional programming language that transpiles to JavaScript, C++11, Erlang, and Go. It can be used to develop web applications, server side apps, and also desktop applications with use of Electron or via C++11 and Go compilers with suitable libraries. Its syntax is mostly comparable to that of Haskell. In addition, it introduces row polymorphism and extensible records. Also, contrary to Haskell, the PureScript language is defined as having a strict evaluation strategy, although there are non-conforming back ends which implement a lazy evaluation strategy.

Futhark is a functional data parallel array programming language originally developed at UCPH Department of Computer Science (DIKU) as part of the HIPERFIT project. It focuses on enabling data parallel programs written in a functional style to be executed with high performance on massively parallel hardware, in particular on graphics processing units (GPUs). Futhark is strongly inspired by NESL, and its implementation uses a variant of the flattening transformation, but imposes constraints on how parallelism can be expressed in order to enable more aggressive compiler optimisations. In particular, irregular nested data parallelism is not supported.

References

  1. "Joergen7/Cuneiform". GitHub . 14 October 2021.
  2. Brandt, Jörgen; Bux, Marc N.; Leser, Ulf (2015). "Cuneiform: A functional language for large scale scientific data analysis" (PDF). Proceedings of the Workshops of the EDBT/ICDT. 1330: 17–26.
  3. "Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience by Jörgen Brandt". Erlang Central. Archived from the original on 2 October 2016. Retrieved 28 October 2016.
  4. Bux, Marc; Brandt, Jörgen; Lipka, Carsten; Hakimzadeh, Kamal; Dowling, Jim; Leser, Ulf (2015). "SAASFEE: scalable scientific workflow execution engine" (PDF). Proceedings of the VLDB Endowment. 8 (12): 1892–1895. doi:10.14778/2824032.2824094.
  5. Bessani, Alysson; Brandt, Jörgen; Bux, Marc; Cogo, Vinicius; Dimitrova, Lora; Dowling, Jim; Gholami, Ali; Hakimzadeh, Kamal; Hummel, Michael; Ismail, Mahmoud; Laure, Erwin; Leser, Ulf; Litton, Jan-Eric; Martinez, Roxanna; Niazi, Salman; Reichel, Jane; Zimmermann, Karin (2015). "Biobankcloud: a platform for the secure storage, sharing, and processing of large biomedical data sets" (PDF). The First International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).
  6. "Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience". Erlang-factory.com. Retrieved 28 October 2016.
  7. Kelly, Peter M.; Coddington, Paul D.; Wendelborn, Andrew L. (2009). "Lambda calculus as a workflow model". Concurrency and Computation: Practice and Experience. 21 (16): 1999–2017. doi:10.1002/cpe.1448. S2CID   10833434.
  8. Barseghian, Derik; Altintas, Ilkay; Jones, Matthew B.; Crawl, Daniel; Potter, Nathan; Gallagher, James; Cornillon, Peter; Schildhauer, Mark; Borer, Elizabeth T.; Seabloom, Eric W. (2010). "Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis" (PDF). Ecological Informatics. 5 (1): 42–50. doi:10.1016/j.ecoinf.2009.08.008. S2CID   16392118.
  9. Di Tommaso, Paolo; Chatzou, Maria; Floden, Evan W; Barja, Pablo Prieto; Palumbo, Emilio; Notredame, Cedric (2017). "Nextflow enables reproducible computational workflows". Nature Biotechnology. 35 (4): 316–319. doi:10.1038/nbt.3820. PMID   28398311. S2CID   9690740.
  10. "A Functional Workflow Language Implementation in Erlang" (PDF). Retrieved 28 October 2016.
  11. Brandt, Jörgen; Reisig, Wolfgang; Leser, Ulf (2017). "Computation semantics of the functional scientific workflow language Cuneiform". Journal of Functional Programming . 27. doi:10.1017/S0956796817000119. S2CID   6128299.