Dataflow architecture

Last updated April 25, 2024

Dataflow architecture is a dataflow-based computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures have no program counter, in concept: the executability and execution of instructions is solely determined based on the availability of input arguments to the instructions,^[1] so that the order of instruction execution may be hard to predict.

Although no commercially successful general-purpose computer hardware has used a dataflow architecture, it has been successfully implemented in specialized hardware such as in digital signal processing, network routing, graphics processing, telemetry, and more recently in data warehousing, and artificial intelligence (as: polymorphic dataflow^[2] Convolution Engine,^[3] structure-driven,^[4] dataflow scheduling ^[5]). It is also very relevant in many software architectures today including database engine designs and parallel computing frameworks.^{[ citation needed ]}

Synchronous dataflow architectures tune to match the workload presented by real-time data path applications such as wire speed packet forwarding. Dataflow architectures that are deterministic in nature enable programmers to manage complex tasks such as processor load balancing, synchronization and accesses to common resources.^[6]

Meanwhile, there is a clash of terminology, since the term dataflow is used for a subarea of parallel programming: for dataflow programming.

History

Hardware architectures for dataflow was a major topic in computer architecture research in the 1970s and early 1980s. Jack Dennis of MIT pioneered the field of static dataflow architectures while the Manchester Dataflow Machine^[7] and MIT Tagged Token architecture were major projects in dynamic dataflow.

The research, however, never overcame the problems related to:

Efficiently broadcasting data tokens in a massively parallel system.
Efficiently dispatching instruction tokens in a massively parallel system.
Building content-addressable memory (CAM) large enough to hold all of the dependencies of a real program.

Instructions and their data dependencies proved to be too fine-grained to be effectively distributed in a large network. That is, the time for the instructions and tagged results to travel through a large connection network was longer than the time to do many computations.

Nonetheless, out-of-order execution (OOE) has become the dominant computing paradigm since the 1990s. It is a form of restricted dataflow. This paradigm introduced the idea of an execution window. The execution window follows the sequential order of the von Neumann architecture, however within the window, instructions are allowed to be completed in data dependency order. This is accomplished in CPUs that dynamically tag the data dependencies of the code in the execution window. The logical complexity of dynamically keeping track of the data dependencies, restricts OOE CPUs to a small number of execution units (2-6) and limits the execution window sizes to the range of 32 to 200 instructions, much smaller than envisioned for full dataflow machines.^{[ citation needed ]}

Dataflow architecture topics

Static and dynamic dataflow machines

Designs that use conventional memory addresses as data dependency tags are called static dataflow machines. These machines did not allow multiple instances of the same routines to be executed simultaneously because the simple tags could not differentiate between them.

Designs that use content-addressable memory (CAM) are called dynamic dataflow machines. They use tags in memory to facilitate parallelism.

Compiler

Normally, in the control flow architecture, compilers analyze program source code for data dependencies between instructions in order to better organize the instruction sequences in the binary output files. The instructions are organized sequentially but the dependency information itself is not recorded in the binaries. Binaries compiled for a dataflow machine contain this dependency information.

A dataflow compiler records these dependencies by creating unique tags for each dependency instead of using variable names. By giving each dependency a unique tag, it allows the non-dependent code segments in the binary to be executed out of order and in parallel. Compiler detects the loops, break statements and various programming control syntax for data flow.

Programs

Programs are loaded into the CAM of a dynamic dataflow computer. When all of the tagged operands of an instruction become available (that is, output from previous instructions and/or user input), the instruction is marked as ready for execution by an execution unit.

This is known as activating or firing the instruction. Once an instruction is completed by an execution unit, its output data is sent (with its tag) to the CAM. Any instructions that are dependent upon this particular datum (identified by its tag value) are then marked as ready for execution. In this way, subsequent instructions are executed in proper order, avoiding race conditions. This order may differ from the sequential order envisioned by the human programmer, the programmed order.

Instructions

An instruction, along with its required data operands, is transmitted to an execution unit as a packet, also called an instruction token. Similarly, output data is transmitted back to the CAM as a data token. The packetization of instructions and results allows for parallel execution of ready instructions on a large scale.

Dataflow networks deliver the instruction tokens to the execution units and return the data tokens to the CAM. In contrast to the conventional von Neumann architecture, data tokens are not permanently stored in memory, rather they are transient messages that only exist when in transit to the instruction storage.

Related Research Articles

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

The program counter (PC), commonly called the instruction pointer (IP) in Intel x86 and Itanium microprocessors, and sometimes called the instruction address register (IAR), the instruction counter, or just part of the instruction sequencer, is a processor register that indicates where a computer is in its program sequence.

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

In information technology and computer science, a system is described as stateful if it is designed to remember preceding events or user interactions; the remembered information is called the state of the system.

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbours, stores the result within itself and passes it downstream. Systolic arrays were first used in Colossus, which was an early computer used to break German Lorenz ciphers during World War II. Due to the classified nature of Colossus, they were independently invented or rediscovered by H. T. Kung and Charles Leiserson who described arrays for many dense linear algebra computations for banded matrices. Early applications include computing greatest common divisors of integers and polynomials. They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD, as discussed later in this article.

Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming.

In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share some features of functional languages, and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing. Some authors use the term datastream instead of dataflow to avoid confusion with dataflow computing or dataflow architecture, based on an indeterministic machine paradigm. Dataflow programming was pioneered by Jack Dennis and his graduate students at MIT in the 1960s.

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

A Kahn process network is a distributed model of computation in which a group of deterministic sequential processes communicate through unbounded first in, first out channels. The model requires that reading from a channel is blocking while writing is non-blocking. Due to these key restrictions, the resulting process network exhibits deterministic behavior that does not depend on the timing of computation nor on communication delays.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis.

Explicit data graph execution, or EDGE, is a type of instruction set architecture (ISA) which intends to improve computing performance compared to common processors like the Intel x86 line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

CAL is a high-level programming language for writing (dataflow) actors, which are stateful operators that transform input streams of data objects (tokens) into output streams. CAL has been compiled to a variety of target platforms, including single-core processors, multicore processors, and programmable hardware. It has been used in several application areas, including video and processing, compression and cryptography. The MPEG Reconfigurable Video Coding (RVC) working group has adopted CAL as part of their standardization efforts.

Specialized computer hardware is often used to execute artificial intelligence (AI) programs faster, and with less energy, such as Lisp machines, neuromorphic engineering, event cameras, and physical neural networks. As of 2023, the market for AI hardware is dominated by GPUs.

References

↑ Veen, Arthur H. (December 1986). "Dataflow Machine Architecture". ACM Computing Surveys. 18 (4): 365–396. doi:10.1145/27633.28055. S2CID 5467025 . Retrieved 5 March 2019.
↑ Maxfield, Max (24 December 2020). "Say Hello to Deep Vision's Polymorphic Dataflow Architecture". Electronic Engineering Journal. Techfocus media.
↑ "Kinara (formerly Deep Vision)". Kinara. 2022. Retrieved 2022-12-11.
↑ "Hailo". Hailo. Retrieved 2022-12-11.
↑ Lie, Sean (29 August 2022). Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning. Cerebras (Report).
↑ "HX300 Family of NPUs and Programmable Ethernet Switches to the Fiber Access Market". EN-Genius (Press release). June 18, 2008. Archived from the original on 2011-07-22.
↑ Manchester Dataflow Research Project, Research Reports: Abstracts, September 1997

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[architecture-1] Veen, Arthur H. (December 1986). "Dataflow Machine Architecture". ACM Computing Surveys. 18 (4): 365–396. doi:10.1145/27633.28055. S2CID 5467025 . Retrieved 5 March 2019.

[2] Maxfield, Max (24 December 2020). "Say Hello to Deep Vision's Polymorphic Dataflow Architecture". Electronic Engineering Journal. Techfocus media.

[3] "Kinara (formerly Deep Vision)". Kinara. 2022. Retrieved 2022-12-11.

[4] "Hailo". Hailo. Retrieved 2022-12-11.

[5] Lie, Sean (29 August 2022). Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning. Cerebras (Report).

[EN-Genius-6] "HX300 Family of NPUs and Programmable Ethernet Switches to the Fiber Access Market". EN-Genius (Press release). June 18, 2008. Archived from the original on 2011-07-22.

[Manchester-Dataflow-7] Manchester Dataflow Research Project, Research Reports: Abstracts, September 1997

[1]

[2]

[3]

[4]

[5]

[6]

[7]

v t e Hardware acceleration
Theory	Universal Turing machine Parallel computing Distributed computing
Applications	GPU GPGPU DirectX Audio Digital signal processing Hardware random number generation Artificial intelligence Cryptography TLS Machine vision Custom hardware attack scrypt Networking Data
Implementations	High-level synthesis C to HDL FPGA ASIC CPLD System on a chip Network on a chip
Architectures	Dataflow Transport triggered Multicore Manycore Heterogeneous In-memory computing Systolic array Neuromorphic
Related	Programmable logic Processor design chronology Digital electronics Virtualization Hardware emulation Logic synthesis Embedded systems