Distributed data flow

Last updated
An illustration of the basic concepts involved in the definition of a distributed data flow. Definition of a Distributed Data Flow.gif
An illustration of the basic concepts involved in the definition of a distributed data flow.

Distributed data flow (also abbreviated as distributed flow) refers to a set of events in a distributed application or protocol.

In computing, an event is an action or occurrence recognized by software, often originating asynchronously from the external environment, that may be handled by the software. Because an event is an entity which encapsulates the action and the contextual variables triggering the action we can use the acrostic mnemonic of an event as an "Execution Variable Encapsulating Named Trigger" to clarify the concept. Computer events can be generated or triggered by the system, by the user or in other ways. Typically, events are handled synchronously with the program flow, that is, the software may have one or more dedicated places where events are handled, frequently an event loop. A source of events includes the user, who may interact with the software by way of, for example, keystrokes on the keyboard. Another source is a hardware device such as a timer. Software can also trigger its own set of events into the event loop, e.g. to communicate the completion of a task. Software that changes its behavior in response to events is said to be event-driven, often with the goal of being interactive.

Distributed computing is a field of computer science that studies distributed systems. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

Contents

Distributed data flows serve a purpose analogous to variables or method parameters in programming languages such as Java, in that they can represent state that is stored or communicated by a layer of software. Unlike variables or parameters, which represent a unit of state that resides in a single location, distributed flows are dynamic and distributed: they simultaneously appear in multiple locations within the network at the same time. As such, distributed flows are a more natural way of modeling the semantics and inner workings of certain classes of distributed systems. In particular, the distributed data flow abstraction has been used as a convenient way of expressing the high-level logical relationships between parts of distributed protocols. [1] [2] [3]

Programming language language designed to communicate instructions to a machine

A programming language is a formal language, which comprises a set of instructions that produce various kinds of output. Programming languages are used in computer programming to implement algorithms.

Java (programming language) Object-oriented programming language

Java is a general-purpose computer-programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that compiled Java code can run on all platforms that support Java without the need for recompilation. Java applications are typically compiled to "bytecode" that can run on any Java virtual machine (JVM) regardless of the underlying computer architecture. The language derives much of its original features from SmallTalk, with a syntax similar to C and C++, but it has fewer low-level facilities than either of them. As of 2016, Java was one of the most popular programming languages in use, particularly for client-server web applications, with a reported 9 million developers.

Informal properties

A distributed data flow satisfies the following informal properties.

In multithreaded computer programming, asynchronous method invocation (AMI), also known as asynchronous method calls or the asynchronous pattern is a design pattern in which the call site is not blocked while waiting for the called code to finish. Instead, the calling thread is notified when the reply arrives. Polling for a reply is an undesired option.

In computer science, message passing is a technique for invoking behavior on a computer. The invoking program sends a message to a process and relies on the process and the supporting infrastructure to select and invoke the actual code to run. Message passing differs from conventional programming where a process, subroutine, or function is directly invoked by name. Message passing is key to some models of concurrency and object-oriented programming.

Multicast a computer networking technique for forwarding transmissions from one sender to multiple receivers

In computer networking, multicast is group communication where data transmission is addressed to a group of destination computers simultaneously. Multicast can be one-to-many or many-to-many distribution. Multicast should not be confused with physical layer point-to-multipoint communication.

Formal representation

Formally, we represent each event in a distributed flow as a quadruple of the form (x,t,k,v), where x is the location (e.g., the network address of a physical node) at which the event occurs, t is the time at which this happens, k is a version, or a sequence number identifying the particular event, and v is a value that represents the event payload (e.g., all the arguments passed in a method call). Each distributed flow is a (possibly infinite) set of such quadruples that satisfies the following three formal properties.

In addition to the above, flows can have a number of additional properties.

Related Research Articles

OSI model Model with 7 layers to describe communications systems

The Open Systems Interconnection model is a conceptual model that characterizes and standardizes the communication functions of a telecommunication or computing system without regard to its underlying internal structure and technology. Its goal is the interoperability of diverse communication systems with standard protocols. The model partitions a communication system into abstraction layers. The original version of the model defined seven layers.

In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in a different address space, which is coded as if it were a normal (local) procedure call, without the programmer explicitly coding the details for the remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. This is a form of client–server interaction, typically implemented via a request–response message-passing system. In the object-oriented programming paradigm, RPC calls are represented by remote method invocation (RMI). The RPC model implies a level of location transparency, namely that calling procedures is largely the same whether it is local or remote, but usually they are not identical, so local calls can be distinguished from remote calls. Remote calls are usually orders of magnitude slower and less reliable than local calls, so distinguishing them is important.

In computer networking, the transport layer is a conceptual division of methods in the layered architecture of protocols in the network stack in the Internet protocol suite and the OSI model. The protocols of this layer provide host-to-host communication services for applications. It provides services such as connection-oriented communication, reliability, flow control, and multiplexing.

Message-oriented middleware (MOM) is software or hardware infrastructure supporting sending and receiving messages between distributed systems. MOM allows application modules to be distributed over heterogeneous platforms and reduces the complexity of developing applications that span multiple operating systems and network protocols. The middleware creates a distributed communications layer that insulates the application developer from the details of the various operating systems and network interfaces. APIs that extend across diverse platforms and networks are typically provided by MOM.

The Resource Reservation Protocol (RSVP) is a transport layer protocol designed to reserve resources across a network for quality of service (QoS) using the integrated services model. RSVP operates over an IPv4 or IPv6 and provides receiver-initiated setup of resource reservations for multicast or unicast data flows. It does not transport application data but is similar to a control protocol, like Internet Control Message Protocol (ICMP) or Internet Group Management Protocol (IGMP). RSVP is described in RFC 2205.

An overlay network is a computer network that is built on top of another network.

In software architecture, publish–subscribe is a messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead categorize published messages into classes without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more classes and only receive messages that are of interest, without knowledge of which publishers, if any, there are.

Distributed object

In distributed computing, distributed objects are objects that are distributed across different address spaces, either in different processes on the same computer, or even in multiple computers connected via a network, but which work together by sharing data and invoking methods. This often involves location transparency, where remote objects appear the same as local objects. The main method of distributed object communication is with remote method invocation, generally by message-passing: one object sends a message to another object in a remote machine or process to perform some task. The results are sent back to the calling object.

Connection-oriented communication is a network communication mode in telecommunications and computer networking, where a communication session or a semi-permanent connection is established before any useful data can be transferred, and where a stream of data is delivered in the same order as it was sent. The alternative to connection-oriented transmission is connectionless communication, for example the datagram mode communication used by the IP and UDP protocols, where data may be delivered out of order, since different network packets are routed independently, and may be delivered over different paths.

In computer science, consistent hashing is a special kind of hashing such that when a hash table is resized, only keys need to be remapped on average, where is the number of keys, and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

Computer network collection of autonomous computers interconnected by a single technology

A computer network is a digital telecommunications network which allows nodes to share resources. In computer networks, computing devices exchange data with each other using connections between nodes. These data links are established over cable media such as wires or optic cables, or wireless media such as Wi-Fi.

Transparent Inter Process Communication (TIPC) is an Inter-process communication (IPC) service designed for cluster wide operation. It is sometimes presented as Cluster Domain Sockets, in contrast to the well-known Unix Domain Socket service; the latter working only on a single kernel.

JGroups is a library for reliable one-to-one or one-to-many communication written in the Java language.

Virtual synchrony is an interprocess message passing technology. Virtual synchrony systems allow programs running in a network to organize themselves into process groups, and to send messages to groups. Each message is delivered to all the group members, in the identical order, and this is true even when two messages are transmitted simultaneously by different senders. Application design and implementation is greatly simplified by this property: every group member sees the same events in the same order.

Live distributed object

Live distributed object refers to a running instance of a distributed multi-party protocol, viewed from the object-oriented perspective, as an entity that has a distinct identity, may encapsulate internal state and threads of execution, and that exhibits a well-defined externally visible behavior.

Gbcast is a reliable multicast protocol that provides ordered, fault-tolerant (all-or-none) message delivery in a group of receivers within a network of machines that experience crash failure. The protocol is capable of solving Consensus in a network of unreliable processors, and can be used to implement state machine replication. Gbcast can be used in a standalone manner, or can support the virtual synchrony execution model, in which case Gbcast is normally used for group membership management while other, faster, protocols are often favored for routine communication tasks.

Jem The Bee

JEM, the BEE is a Java, cloud-aware application which implements a Batch Execution Environment, to help and manage the execution of jobs, described by a Job Control Language (JCL). JEM, the BEE performs the following functions:

Ken Birman is a professor in the Department of Computer Science at Cornell University.

References

  1. Ostrowski, K., Birman, K., Dolev, D., and Sakoda, C. (2009). "Implementing Reliable Event Streams in Large Systems via Distributed Data Flows and Recursive Delegation", 3rd ACM International Conference on Distributed Event-Based Systems (DEBS 2009), Nashville, TN, USA, July 6–9, 2009, http://www.cs.cornell.edu/~krzys/krzys_debs2009.pdf
  2. Ostrowski, K., Birman, K., and Dolev, D. (2009). "Distributed Data Flow Language for Multi-Party Protocols", 5th ACM SIGOPS Workshop on Programming Languages and Operating Systems (PLOS 2009), Big Sky, MT, USA. October 11, 2009, http://www.cs.cornell.edu/~krzys/krzys_plos2009.pdf
  3. Ostrowski, K., Birman, K., Dolev, D. (2009). "Programming Live Distributed Objects with Distributed Data Flows", Submitted to the International Conference on Object Oriented Programming, Systems, Languages and Applications (OOPSLA 2009), http://www.cs.cornell.edu/~krzys/krzys_oopsla2009.pdf