Dataflow programming

Last updated

In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. [1] Dataflow programming languages share some features of functional languages, and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing. Some authors use the term datastream instead of dataflow to avoid confusion with dataflow computing or dataflow architecture, based on an indeterministic machine paradigm. Dataflow programming was pioneered by Jack Dennis and his graduate students at MIT in the 1960s.

Contents

Considerations

Traditionally, a program is modelled as a series of operations happening in a specific order; this may be referred to as sequential, [2] :p.3 procedural, [3] control flow [3] (indicating that the program chooses a specific path), or imperative programming. The program focuses on commands, in line with the von Neumann [2] :p.3 vision of sequential programming, where data is normally "at rest". [3] :p.7

In contrast, dataflow programming emphasizes the movement of data and models programs as a series of connections. Explicitly defined inputs and outputs connect operations, which function like black boxes. [3] :p.2 An operation runs as soon as all of its inputs become valid. [4] Thus, dataflow languages are inherently parallel and can work well in large, decentralized systems. [2] :p.3 [5] [6]

State

One of the key concepts in computer programming is the idea of state, essentially a snapshot of various conditions in the system. Most programming languages require a considerable amount of state information, which is generally hidden from the programmer. Often, the computer itself has no idea which piece of information encodes the enduring state. This is a serious problem, as the state information needs to be shared across multiple processors in parallel processing machines. Most languages force the programmer to add extra code to indicate which data and parts of the code are important to the state. This code tends to be both expensive in terms of performance, as well as difficult to read or debug. Explicit parallelism is one of the main reasons for the poor performance of Enterprise Java Beans when building data-intensive, non-OLTP applications.[ citation needed ]

Where a sequential program can be imagined as a single worker moving between tasks (operations), a dataflow program is more like a series of workers on an assembly line, each doing a specific task whenever materials are available. Since the operations are only concerned with the availability of data inputs, they have no hidden state to track, and are all "ready" at the same time.

Representation

Dataflow programs are represented in different ways. A traditional program is usually represented as a series of text instructions, which is reasonable for describing a serial system which pipes data between small, single-purpose tools that receive, process, and return. Dataflow programs start with an input, perhaps the command line parameters, and illustrate how that data is used and modified. The flow of data is explicit, often visually illustrated as a line or pipe.

In terms of encoding, a dataflow program might be implemented as a hash table, with uniquely identified inputs as the keys, used to look up pointers to the instructions. When any operation completes, the program scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs it. When that operation finishes, it will typically output data, thereby making another operation become valid.

For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of maintaining state is removed from the programmer and given to the language's runtime. On machines with a single processor core where an implementation designed for parallel operation would simply introduce overhead, this overhead can be removed completely by using a different runtime.

Incremental updates

Some recent dataflow libraries such as Differential/Timely Dataflow have used incremental computing for much more efficient data processing. [1] [7] [8]

History

A pioneer dataflow language was BLODI (BLOck DIagram), published in 1961 by John Larry Kelly, Jr., Carol Lochbaum and Victor A. Vyssotsky for specifying sampled data systems. [9] A BLODI specification of functional units (amplifiers, adders, delay lines, etc.) and their interconnections was compiled into a single loop that updated the entire system for one clock tick.

In a 1966 Ph.D. thesis, The On-line Graphical Specification of Computer Procedures, [10] Bert Sutherland created one of the first graphical dataflow programming frameworks in order to make parallel programming easier. Subsequent dataflow languages were often developed at the large supercomputer labs. POGOL, an otherwise conventional data-processing language developed at NSA, compiled large-scale applications composed of multiple file-to-file operations, e.g. merge, select, summarize, or transform, into efficient code that eliminated the creation of or writing to intermediate files to the greatest extent possible. [11] SISAL, a popular dataflow language developed at Lawrence Livermore National Laboratory, looks like most statement-driven languages, but variables should be assigned once. This allows the compiler to easily identify the inputs and outputs. A number of offshoots of SISAL have been developed, including SAC, Single Assignment C, which tries to remain as close to the popular C programming language as possible.

The United States Navy funded development of ACOS and SPGN (signal processing graph notation) starting in the early 1980s. This is in use on a number of platforms in the field today. [12]

A more radical concept is Prograph, in which programs are constructed as graphs onscreen, and variables are replaced entirely with lines linking inputs to outputs. Incidentally, Prograph was originally written on the Macintosh, which remained single-processor until the introduction of the DayStar Genesis MP in 1996.[ citation needed ]

There are many hardware architectures oriented toward the efficient implementation of dataflow programming models.[ vague ] MIT's tagged token dataflow architecture was designed by Greg Papadopoulos.[ undue weight? ]

Data flow has been proposed[ by whom? ] as an abstraction for specifying the global behavior of distributed system components: in the live distributed objects programming model, distributed data flows are used to store and communicate state, and as such, they play the role analogous to variables, fields, and parameters in Java-like programming languages.

Languages

Dataflow programming languages include:

Libraries

See also

Related Research Articles

In computer science, an abstract machine is a theoretical model that allows for a detailed and precise analysis of how a computer system functions. It is similar to a mathematical function in that it receives inputs and produces outputs based on predefined rules. Abstract machines vary from literal machines in that they are expected to perform correctly and independently of hardware. Abstract machines are "machines" because they allow step-by-step execution of programmes; they are "abstract" because they ignore many aspects of actual (hardware) machines. A typical abstract machine consists of a definition in terms of input, output, and the set of allowable operations used to turn the former into the latter. They can be used for purely theoretical reasons as well as models for real-world computer systems. In the theory of computation, abstract machines are often used in thought experiments regarding computability or to analyse the complexity of algorithms. This use of abstract machines is fundamental to the field of computational complexity theory, such as finite state machines, Mealy machines, push-down automata, and Turing machines.

In computer engineering, a hardware description language (HDL) is a specialized computer language used to describe the structure and behavior of electronic circuits, most commonly to design ASICs and program FPGAs.

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

<span class="mw-page-title-main">Visual programming language</span> Programming language written graphically by a user

In computing, a visual programming language, also known as diagrammatic programming, graphical programming or block coding, is a programming language that lets users create programs by manipulating program elements graphically rather than by specifying them textually. A VPL allows programming with visual expressions, spatial arrangements of text and graphic symbols, used either as elements of syntax or secondary notation. For example, many VPLs are based on the idea of "boxes and arrows", where boxes or other screen objects are treated as entities, connected by arrows, lines or arcs which represent relations.

In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming.

Dataflow architecture is a dataflow-based computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures have no program counter, in concept: the executability and execution of instructions is solely determined based on the availability of input arguments to the instructions, so that the order of instruction execution may be hard to predict.

<span class="mw-page-title-main">SystemVerilog</span> Hardware description and hardware verification language

SystemVerilog, standardized as IEEE 1800, is a hardware description and hardware verification language used to model, design, simulate, test and implement electronic systems. SystemVerilog is based on Verilog and some extensions, and since 2008, Verilog is now part of the same IEEE standard. It is commonly used in the semiconductor and electronic design industry as an evolution of Verilog.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

In computer programming, flow-based programming (FBP) is a programming paradigm that defines applications as networks of black box processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. FBP is thus naturally component-oriented.

Arvind is the Johnson Professor of Computer Science and Engineering in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology (MIT). He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and the Association for Computing Machinery (ACM). He was also elected as a member into the National Academy of Engineering in 2008 for contributions to dataflow and multithread computing and the development of tools for the high-level synthesis of digital electronics hardware.

In computing, reactive programming is a declarative programming paradigm concerned with data streams and the propagation of change. With this paradigm, it's possible to express static or dynamic data streams with ease, and also communicate that an inferred dependency within the associated execution model exists, which facilitates the automatic propagation of the changed data flow.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

SIGNAL is a programming language based on synchronized data-flow : a process is a set of equations on elementary flows describing both data and control.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Apache Flink</span> Framework and distributed processing engine

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

<span class="mw-page-title-main">Apache Beam</span> Unified programming model for data processing pipelines

Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. Beam Pipelines are defined using one of the provided SDKs and executed in one of the Beam’s supported runners including Apache Flink, Apache Samza, Apache Spark, and Google Cloud Dataflow.

References

  1. 1 2 Schwarzkopf, Malte (7 March 2020). "The Remarkable Utility of Dataflow Computing". ACM SIGOPS. Retrieved 31 July 2022.
  2. 1 2 3 Johnston, Wesley M.; J.R. Paul Hanna; Richard J. Millar (March 2004). "Advances in Dataflow Programming Languages" (PDF). ACM Computing Surveys. 36: 1–34. doi:10.1145/1013208.1013209. S2CID   5257722 . Retrieved 15 August 2013.
  3. 1 2 3 4 5 Wadge, William W.; Edward A. Ashcroft (1985). Lucid, the Dataflow Programming Language (illustrated ed.). Academia Press. ISBN   9780127296500 . Retrieved 15 August 2013.
  4. 1 2 "Dataflow Programming Basics". Getting Started with NI Products. National Instruments Corporation. Retrieved 15 August 2013.
  5. Harter, Richard. "Data Flow languages and programming - Part I". Richard Harter's World. Archived from the original on 8 December 2015. Retrieved 15 August 2013.
  6. "Why Dataflow Programming Languages are Ideal for Programming Parallel Hardware". Multicore Programming Fundamentals Whitepaper Series. National Instruments Corporation. Retrieved 15 August 2013.
  7. McSherry, Frank; Murray, Derek; Isaacs, Rebecca; Isard, Michael (5 January 2013). "Differential dataflow". Microsoft . Retrieved 31 July 2022.
  8. "Differential Dataflow". Timely Dataflow. 30 July 2022. Retrieved 31 July 2022.
  9. John L. Kelly Jr.; Carol Lochbaum; V. A. Vyssotsky (1961). "A block diagram compiler". Bell System Tech. J. 40 (3): 669–678. doi:10.1002/j.1538-7305.1961.tb03236.x.
  10. Sutherland, William Robert (January 1966). The on-line graphical specification of computer procedures (PhD thesis). MIT. hdl:1721.1/13474 . Retrieved 2022-08-25.
  11. Gloria Lambert (1973). "Large scale file processing: POGOL". POPL '73: Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages. ACM. pp. 226–234.
  12. Underwater Acoustic Data Processing, Y.T. Chan