# Kahn process networks

Last updated

Kahn process networks (KPNs, or process networks) is a distributed model of computation where a group of deterministic sequential processes are communicating through unbounded FIFO channels. The resulting process network exhibits deterministic behavior that does not depend on the various computation or communication delays. The model was originally developed for modeling distributed systems but has proven its convenience for modeling signal processing systems. As such, KPNs have found many applications in modeling embedded systems, high-performance computing systems, and other computational tasks. KPNs were first introduced by Gilles Kahn. [1]

## Execution model

KPN is a common model for describing signal processing systems where infinite streams of data are incrementally transformed by processes executing in sequence or parallel. Despite parallel processes, multitasking or parallelism are not required for executing this model.

In a KPN, processes communicate via unbounded FIFO channels. Processes read and write atomic data elements, or alternatively called tokens, from and to channels. Writing to a channel is non-blocking, i.e. it always succeeds and does not stall the process, while reading from a channel is blocking, i.e. a process that reads from an empty channel will stall and can only continue when the channel contains sufficient data items (tokens). Processes are not allowed to test an input channel for existence of tokens without consuming them. A FIFO cannot be consumed by multiple processes, nor can multiple processes produce to a single FIFO. Given a specific input (token) history for a process, the process must be deterministic so that it always produces the same outputs (tokens). Timing or execution order of processes must not affect the result and therefore testing input channels for tokens is forbidden.

### Notes on processes

• A process need not read any input or have any input channels as it may act as a pure data source
• A process need not write any output or have any output channels
• Testing input channels for emptiness (or non-blocking reads) could be allowed for optimization purposes, but it should not affect outputs. It can be beneficial and/or possible to do something in advance rather than wait for a channel. For example, assume there were two reads from different channels. If the first read would stall (wait for a token) but the second read could be read a token directly, it could be beneficial to read the second one first to save time, because the reading itself often consumes some time (e.g. time for memory allocation or copying).

### Process firing semantics as Petri nets

Assuming process P in the KPN above is constructed so that it first reads data from channel A, then channel B, computes something and then writes data to channel C, the execution model of the process can be modeled with the Petri net shown on the right. The single token in the PE resource place forbids that the process is executed simultaneously for different input data. When data arrives at channel A or B, tokens are placed into places FIFO A and FIFO B respectively. The transitions of the Petri net are associated with the respective I/O operations and computation. When the data has been written to channel C, PE resource is filled with its initial marking again allowing new data to be read.

### Process as a finite state machine

A process can be modeled as a finite state machine that is in one of two states:

• Active; the process computes or writes data
• Wait; the process is blocked (waiting) for data

Assuming the finite state machine reads program elements associated with the process, it may read three kinds of tokens, which are "Compute", "Read" and "Write token". Additionally, in the Wait state it can only come back to Active state by reading a special "Get token" which means the communication channel associated with the wait contains readable data.

## Properties

### Boundedness of channels

A channel is strictly bounded by ${\displaystyle b}$ if it has at most ${\displaystyle b}$ unconsumed tokens for any possible execution. A KPN is strictly bounded by ${\displaystyle b}$ if all channels are strictly bounded by ${\displaystyle b}$.

The number of unconsumed tokens depends on the execution order (scheduling) of processes. A spontaneous data source could produce arbitrarily many tokens into a channel if the scheduler would not execute processes consuming those tokens.

A real application can not have unbounded FIFOs and therefore scheduling and maximum capacity of FIFOs must be designed into a practical implementation. The maximum capacity of FIFOs can be handled in several ways:

• FIFO bounds can be mathematically derived in design to avoid FIFO overflows. This is however not possible for all KPNs. It is an undecidable problem to test whether a KPN is strictly bounded by ${\displaystyle b}$.[ citation needed ] Moreover, in practical situations, the bound may be data dependent.
• FIFO bounds can be grown on demand. [2]
• Blocking writes can be used so that a process blocks if a FIFO is full. This approach may unfortunately lead to an artificial deadlock unless the designer properly derives safe bounds for FIFOs (Parks, 1995). Local artificial detection at run-time may be necessary to guarantee the production of the correct output. [3]

### Closed and open systems

A closed KPN has no external input or output channels. Processes that have no input channels act as data sources and processes that have no output channels act as data sinks. In an open KPN each process has at least one input and output channel.

### Determinism

Processes of a KPN are deterministic. For the same input history they must always produce exactly the same output. Processes can be modeled as sequential programs that do reads and writes to ports in any order or quantity as long as determinism property is preserved. As a consequence, KPN model is deterministic so that following factors entirely determine outputs of the system:

• processes
• the network
• initial tokens

Hence, timing of the processes does not affect outputs of the system.

### Monotonicity

KPN processes are monotonic, which means that they only need partial information of the input stream in order to produce partial information of the output stream. Monotonicity allows parallelism. In a KPN there is a total order of events[ clarification needed ] inside a signal.[ clarification needed ] However, there is no order relation between events in different signals. Thus, KPNs are only partially ordered, which classifies them as untimed model.

## Applications

Due to its high expressiveness and succinctness, KPNs as underlying the model of computation are applied in several academic modeling tools to represent streaming applications, which have certain properties (e.g., dataflow-oriented, stream-based).

The open source Daedalus framework [4] maintained by Leiden Embedded Research Center at Leiden University accepts sequential programs written in C and generates a corresponding KPN. This KPN could, for example, be used to map the KPN onto an FPGA-based platform systematically.

The Ambric Am2045 massively parallel processor array is a KPN implemented in actual silicon. [5] Its 336 32-bit processors are connected by a programmable interconnect of dedicated FIFOs. Thus its channels are strictly bounded with blocking writes.

## Related Research Articles

In computer science, the computational complexity, or simply complexity of an algorithm is the amount of resources required for running it. The computational complexity of a problem is the minimum of the complexities of all possible algorithms for this problem.

A finite-state machine (FSM) or finite-state automaton, finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number of states at any given time. The FSM can change from one state to another in response to some external inputs and/or a condition is satisfied; the change from one state to another is called a transition. An FSM is defined by a list of its states, its initial state, and the conditions for each transition. Finite state machines are of two types – deterministic finite state machines and non-deterministic finite state machines. A deterministic finite-state machine can be constructed equivalent to any non-deterministic one.

In computing, scheduling is the method by which work is assigned to resources that complete the work. The work may be virtual computation elements such as threads, processes or data flows, which are in turn scheduled onto hardware resources such as processors, network links or expansion cards.

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbors, stores the result within itself and passes it downstream. Systolic arrays were invented by H. T. Kung and Charles Leiserson who described arrays for many dense linear algebra computations for banded matrices. Early applications include computing greatest common divisors of integers and polynomials. They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD, as discussed later in this article.

Computability is the ability to solve a problem in an effective manner. It is a key topic of the field of computability theory within mathematical logic and the theory of computation within computer science. The computability of a problem is closely linked to the existence of an algorithm to solve the problem.

In computer science, the process calculi are a diverse family of related approaches for formally modelling concurrent systems. Process calculi provide a tool for the high-level description of interactions, communications, and synchronizations between a collection of independent agents or processes. They also provide algebraic laws that allow process descriptions to be manipulated and analyzed, and permit formal reasoning about equivalences between processes. Leading examples of process calculi include CSP, CCS, ACP, and LOTOS. More recent additions to the family include the π-calculus, the ambient calculus, PEPA, the fusion calculus and the join-calculus.

Dataflow is a term used in computing which has various meanings depending on application and the context in which the term is used. In the context of software architecture, data flow relates to stream processing or reactive programming.

In software engineering, a pipeline consists of a chain of processing elements, arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline. Usually some amount of buffering is provided between consecutive elements. The information that flows in these pipelines is often a stream of records, bytes, or bits, and the elements of a pipeline may be called filters; this is also called the pipes and filters design pattern. Connecting elements into a pipeline is analogous to function composition.

Dataflow architecture is a computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures do not have a program counter : the executability and execution of instructions is solely determined based on the availability of input arguments to the instructions, so that the order of instruction execution is unpredictable, i.e. behavior is nondeterministic.

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in the slower external memory, usually a hard disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

In computer science, and more specifically in computability theory and computational complexity theory, a model of computation is a model which describes how an output of a mathematical function is computed given an input. A model describes how units of computations, memories, and communications are organized. The computational complexity of an algorithm can be measured given a model of computation. Using a model allows studying the performance of algorithms independently of the variations that are specific to particular implementations and specific technology.

In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that is too large to fit into a computer's main memory at one time. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory such as hard drives or tape drives, or when memory is on a computer network. External memory algorithms are analyzed in the external memory model.

Automatic parallelization, also auto parallelization, autoparallelization, or parallelization, the last one of which implies automation when used in context, refers to converting sequential code into multi-threaded or vectorized code in order to utilize multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. The goal of automatic parallelization is to relieve programmers from the hectic and error-prone manual parallelization process. Though the quality of automatic parallelization has improved in the past several decades, fully automatic parallelization of sequential programs by compilers remains a grand challenge due to its need for complex program analysis and the unknown factors during compilation.

Concurrent computing is a form of computing in which several computations are executed during overlapping time periods—concurrently—instead of sequentially. This is a property of a system—this may be an individual program, a computer, or a network—and there is a separate execution point or "thread of control" for each computation ("process"). A concurrent system is one where a computation can advance without waiting for all other computations to complete.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation. Examples of applications of consensus include whether to commit a transaction to a database, agreeing on the identity of a leader, state machine replication, and atomic broadcasts. The real world applications include clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain and others.

A Turing machine is a hypothetical computing device, first conceived by Alan Turing in 1936. Turing machines manipulate symbols on a potentially infinite strip of tape according to a finite table of rules, and they provide the theoretical underpinnings for the notion of a computer algorithm.

Data parallelism is parallelization across multiple in environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to as another form of parallelism.

An access method is a function of a mainframe operating system that enables access to data on disk, tape or other external devices. They were introduced in 1963 in IBM OS/360 operating system. Access methods provide an application programming interface (API) for programmers to transfer data to or from device, and could be compared to device drivers in non-mainframe operating systems, but typically provide a greater level of functionality.

Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

CAL is a high-level programming language for writing (dataflow) actors, which are stateful operators that transform input streams of data objects (tokens) into output streams. CAL has been compiled to a variety of target platforms, including single-core processors, multicore processors, and programmable hardware. It has been used in several application areas, including video and processing, compression and cryptography. The MPEG Reconfigurable Video Coding (RVC) working group has adopted CAL as part of their standardization efforts.

## References

1. Kahn, G. (1974). Rosenfeld, Jack L. (ed.). The semantics of a simple language for parallel programming (PDF). Proc. IFIP Congress on Information Processing. North-Holland. ISBN   0-7204-2803-3.
2. Parks, Thomas M. (1995). Bounded Scheduling of Process Networks (Ph. D.). University of California, Berkeley.
3. Geilen, Marc; Basten, Twan (2003). Degano, P. (ed.). Requirements on the Execution of Kahn Process Networks. Proc. 12th European Symposium on Programming Languages and Systems (ESOP). Springer. pp. 319–334. CiteSeerX  .
4. http://daedalus.liacs.nl LIACS Daedalus framework
5. Mike Butts, Anthony Mark Jones, Paul Wasson, "A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing", Proceedings of FCCM, April 2007, IEEE Computer Society