Distributed computing

Last updated

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. [1] [2] Distributed computing is a field of computer science that studies distributed systems.

Contents

The components of a distributed system interact with one another in order to achieve a common goal. Three significant challenges of distributed systems are: maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. [1] When a component of one system fails, the entire system does not fail. [3] Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

A computer program that runs within a distributed system is called a distributed program, [4] and distributed programming is the process of writing such programs. [5] There are many different types of implementations for the message passing mechanism, including pure HTTP, RPC-like connectors and message queues. [6]

Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, [7] which communicate with each other via message passing. [8]

Introduction

The word distributed in terms such as "distributed system", "distributed programming", and "distributed algorithm" originally referred to computer networks where individual computers were physically distributed within some geographical area. [9] The terms are nowadays used in a much wider sense, even referring to autonomous processes that run on the same physical computer and interact with each other by message passing. [8]

While there is no single definition of a distributed system, [10] the following defining properties are commonly used as:

A distributed system may have a common goal, such as solving a large computational problem; [13] the user then perceives the collection of autonomous processors as a unit. Alternatively, each computer may have its own user with individual needs, and the purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the users. [14]

Other typical properties of distributed systems include the following:

Parallel and distributed computing

(a), (b): a distributed system.
(c): a parallel system. Distributed-parallel.svg
(a), (b): a distributed system.
(c): a parallel system.

Distributed systems are groups of networked computers which share a common goal for their work. The terms "concurrent computing", "parallel computing", and "distributed computing" have much overlap, and no clear distinction exists between them. [18] The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel. [19] Parallel computing may be seen as a particularly tightly coupled form of distributed computing, [20] and distributed computing may be seen as a loosely coupled form of parallel computing. [10] Nevertheless, it is possible to roughly classify concurrent systems as "parallel" or "distributed" using the following criteria:

The figure on the right illustrates the difference between distributed and parallel systems. Figure (a) is a schematic view of a typical distributed system; the system is represented as a network topology in which each node is a computer and each line connecting the nodes is a communication link. Figure (b) shows the same distributed system in more detail: each computer has its own local memory, and information can be exchanged only by passing messages from one node to another by using the available communication links. Figure (c) shows a parallel system in which each processor has a direct access to a shared memory.

The situation is further complicated by the traditional uses of the terms parallel and distributed algorithm that do not quite match the above definitions of parallel and distributed systems (see below for more detailed discussion). Nevertheless, as a rule of thumb, high-performance parallel computation in a shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms. [23]

History

The use of concurrent processes which communicate through message-passing has its roots in operating system architectures studied in the 1960s. [24] The first widespread distributed systems were local-area networks such as Ethernet, which was invented in the 1970s. [25]

ARPANET, one of the predecessors of the Internet, was introduced in the late 1960s, and ARPANET e-mail was invented in the early 1970s. E-mail became the most successful application of ARPANET, [26] and it is probably the earliest example of a large-scale distributed application. In addition to ARPANET (and its successor, the global Internet), other early worldwide computer networks included Usenet and FidoNet from the 1980s, both of which were used to support distributed discussion systems. [27]

The study of distributed computing became its own branch of computer science in the late 1970s and early 1980s. The first conference in the field, Symposium on Principles of Distributed Computing (PODC), dates back to 1982, and its counterpart International Symposium on Distributed Computing (DISC) was first held in Ottawa in 1985 as the International Workshop on Distributed Algorithms on Graphs. [28]

Architectures

Various hardware and software architectures are used for distributed computing. At a lower level, it is necessary to interconnect multiple CPUs with some sort of network, regardless of whether that network is printed onto a circuit board or made up of loosely coupled devices and cables. At a higher level, it is necessary to interconnect processes running on those CPUs with some sort of communication system. [29]

Whether these CPUs share resources or not determines a first distinction between three types of architecture:

Distributed programming typically falls into one of several basic architectures: client–server, three-tier, n-tier, or peer-to-peer; or categories: loose coupling, or tight coupling. [30]

Another basic aspect of distributed computing architecture is the method of communicating and coordinating work among concurrent processes. Through various message passing protocols, processes may communicate directly with one another, typically in a main/sub relationship. Alternatively, a "database-centric" architecture can enable distributed computing to be done without any form of direct inter-process communication, by utilizing a shared database. [33] Database-centric architecture in particular provides relational processing analytics in a schematic architecture allowing for live environment relay. This enables distributed computing functions both within and beyond the parameters of a networked database. [34]

Applications

Reasons for using distributed systems and distributed computing may include:

Examples

Examples of distributed systems and applications of distributed computing include the following: [36]

Theoretical foundations

Models

Many tasks that we would like to automate by using a computer are of question–answer type: we would like to ask a question and the computer should produce an answer. In theoretical computer science, such tasks are called computational problems. Formally, a computational problem consists of instances together with a solution for each instance. Instances are questions that we can ask, and solutions are desired answers to these questions.

Theoretical computer science seeks to understand which computational problems can be solved by using a computer (computability theory) and how efficiently (computational complexity theory). Traditionally, it is said that a problem can be solved by using a computer if we can design an algorithm that produces a correct solution for any given instance. Such an algorithm can be implemented as a computer program that runs on a general-purpose computer: the program reads a problem instance from input, performs some computation, and produces the solution as output. Formalisms such as random-access machines or universal Turing machines can be used as abstract models of a sequential general-purpose computer executing such an algorithm. [38] [39]

The field of concurrent and distributed computing studies similar questions in the case of either multiple computers, or a computer that executes a network of interacting processes: which computational problems can be solved in such a network and how efficiently? However, it is not at all obvious what is meant by "solving a problem" in the case of a concurrent or distributed system: for example, what is the task of the algorithm designer, and what is the concurrent or distributed equivalent of a sequential general-purpose computer?[ citation needed ]

The discussion below focuses on the case of multiple computers, although many of the issues are the same for concurrent processes running on a single computer.

Three viewpoints are commonly used:

Parallel algorithms in shared-memory model
Parallel algorithms in message-passing model
Distributed algorithms in message-passing model

In the case of distributed algorithms, computational problems are typically related to graphs. Often the graph that describes the structure of the computer network is the problem instance. This is illustrated in the following example. [44]

An example

Consider the computational problem of finding a coloring of a given graph G. Different fields might take the following approaches:

Centralized algorithms [44]
Parallel algorithms
Distributed algorithms

While the field of parallel algorithms has a different focus than the field of distributed algorithms, there is much interaction between the two fields. For example, the Cole–Vishkin algorithm for graph coloring [45] was originally presented as a parallel algorithm, but the same technique can also be used directly as a distributed algorithm.

Moreover, a parallel algorithm can be implemented either in a parallel system (using shared memory) or in a distributed system (using message passing). [46] The traditional boundary between parallel and distributed algorithms (choose a suitable network vs. run in any given network) does not lie in the same place as the boundary between parallel and distributed systems (shared memory vs. message passing).

Complexity measures

In parallel algorithms, yet another resource in addition to time and space is the number of computers. Indeed, often there is a trade-off between the running time and the number of computers: the problem can be solved faster if there are more computers running in parallel (see speedup). If a decision problem can be solved in polylogarithmic time by using a polynomial number of processors, then the problem is said to be in the class NC. [47] The class NC can be defined equally well by using the PRAM formalism or Boolean circuits—PRAM machines can simulate Boolean circuits efficiently and vice versa. [48]

In the analysis of distributed algorithms, more attention is usually paid on communication operations than computational steps. Perhaps the simplest model of distributed computing is a synchronous system where all nodes operate in a lockstep fashion. This model is commonly known as the LOCAL model. During each communication round, all nodes in parallel (1) receive the latest messages from their neighbours, (2) perform arbitrary local computation, and (3) send new messages to their neighbors. In such systems, a central complexity measure is the number of synchronous communication rounds required to complete the task. [49]

This complexity measure is closely related to the diameter of the network. Let D be the diameter of the network. On the one hand, any computable problem can be solved trivially in a synchronous distributed system in approximately 2D communication rounds: simply gather all information in one location (D rounds), solve the problem, and inform each node about the solution (D rounds).

On the other hand, if the running time of the algorithm is much smaller than D communication rounds, then the nodes in the network must produce their output without having the possibility to obtain information about distant parts of the network. In other words, the nodes must make globally consistent decisions based on information that is available in their local D-neighbourhood. Many distributed algorithms are known with the running time much smaller than D rounds, and understanding which problems can be solved by such algorithms is one of the central research questions of the field. [50] Typically an algorithm which solves a problem in polylogarithmic time in the network size is considered efficient in this model.

Another commonly used measure is the total number of bits transmitted in the network (cf. communication complexity). [51] The features of this concept are typically captured with the CONGEST(B) model, which is similarly defined as the LOCAL model, but where single messages can only contain B bits.

Other problems

Traditional computational problems take the perspective that the user asks a question, a computer (or a distributed system) processes the question, then produces an answer and stops. However, there are also problems where the system is required not to stop, including the dining philosophers problem and other similar mutual exclusion problems. In these problems, the distributed system is supposed to continuously coordinate the use of shared resources so that no conflicts or deadlocks occur.

There are also fundamental challenges that are unique to distributed computing, for example those related to fault-tolerance. Examples of related problems include consensus problems, [52] Byzantine fault tolerance, [53] and self-stabilisation. [54]

Much research is also focused on understanding the asynchronous nature of distributed systems:

Election

Coordinator election (or leader election) is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task is begun, all network nodes are either unaware which node will serve as the "coordinator" (or leader) of the task, or unable to communicate with the current coordinator. After a coordinator election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task coordinator. [58]

The network nodes communicate among themselves in order to decide which of them will get into the "coordinator" state. For that, they need some method in order to break the symmetry among them. For example, if each node has unique and comparable identities, then the nodes can compare their identities, and decide that the node with the highest identity is the coordinator. [58]

The definition of this problem is often attributed to LeLann, who formalized it as a method to create a new token in a token ring network in which the token has been lost. [59]

Coordinator election algorithms are designed to be economical in terms of total bytes transmitted, and time. The algorithm suggested by Gallager, Humblet, and Spira [60] for general undirected graphs has had a strong impact on the design of distributed algorithms in general, and won the Dijkstra Prize for an influential paper in distributed computing.

Many other algorithms were suggested for different kinds of network graphs, such as undirected rings, unidirectional rings, complete graphs, grids, directed Euler graphs, and others. A general method that decouples the issue of the graph family from the design of the coordinator election algorithm was suggested by Korach, Kutten, and Moran. [61]

In order to perform coordination, distributed systems employ the concept of coordinators. The coordinator election problem is to choose a process from among a group of processes on different processors in a distributed system to act as the central coordinator. Several central coordinator election algorithms exist. [62]

Properties of distributed systems

So far the focus has been on designing a distributed system that solves a given problem. A complementary research problem is studying the properties of a given distributed system. [63] [64]

The halting problem is an analogous example from the field of centralised computation: we are given a computer program and the task is to decide whether it halts or runs forever. The halting problem is undecidable in the general case, and naturally understanding the behaviour of a computer network is at least as hard as understanding the behaviour of one computer. [65]

However, there are many interesting special cases that are decidable. In particular, it is possible to reason about the behaviour of a network of finite-state machines. One example is telling whether a given network of interacting (asynchronous and non-deterministic) finite-state machines can reach a deadlock. This problem is PSPACE-complete, [66] i.e., it is decidable, but not likely that there is an efficient (centralised, parallel or distributed) algorithm that solves the problem in the case of large networks.

See also

Notes

  1. 1 2 Tanenbaum, Andrew S.; Steen, Maarten van (2002). Distributed systems: principles and paradigms. Upper Saddle River, NJ: Pearson Prentice Hall. ISBN   0-13-088893-1. Archived from the original on 2020-08-12. Retrieved 2020-08-28.
  2. "Distributed Programs". Texts in Computer Science. London: Springer London. 2010. pp. 373–406. doi:10.1007/978-1-84882-745-5_11. ISBN   978-1-84882-744-8. ISSN   1868-0941. Systems consist of a number of physically distributed components that work independently using their private storage, but also communicate from time to time by explicit message passing. Such systems are called distributed systems.
  3. Dusseau & Dusseau 2016, p. 1–2.
  4. "Distributed Programs". Texts in Computer Science. London: Springer London. 2010. pp. 373–406. doi:10.1007/978-1-84882-745-5_11. ISBN   978-1-84882-744-8. ISSN   1868-0941. Distributed programs are abstract descriptions of distributed systems. A distributed program consists of a collection of processes that work concurrently and communicate by explicit message passing. Each process can access a set of variables which are disjoint from the variables that can be changed by any other process.
  5. Andrews (2000). Dolev (2000). Ghosh (2007), p. 10.
  6. Magnoni, L. (2015). "Modern Messaging for Distributed Sytems (sic)". Journal of Physics: Conference Series. 608 (1): 012038. doi: 10.1088/1742-6596/608/1/012038 . ISSN   1742-6596.
  7. Godfrey (2002).
  8. 1 2 Andrews (2000), p. 291–292. Dolev (2000), p. 5.
  9. Lynch (1996), p. 1.
  10. 1 2 Ghosh (2007), p. 10.
  11. Andrews (2000), pp. 8–9, 291. Dolev (2000), p. 5. Ghosh (2007), p. 3. Lynch (1996), p. xix, 1. Peleg (2000), p. xv.
  12. Andrews (2000), p. 291. Ghosh (2007), p. 3. Peleg (2000), p. 4.
  13. Ghosh (2007), p. 3–4. Peleg (2000), p. 1.
  14. Ghosh (2007), p. 4. Peleg (2000), p. 2.
  15. Ghosh (2007), p. 4, 8. Lynch (1996), p. 2–3. Peleg (2000), p. 4.
  16. Lynch (1996), p. 2. Peleg (2000), p. 1.
  17. Ghosh (2007), p. 7. Lynch (1996), p. xix, 2. Peleg (2000), p. 4.
  18. Ghosh (2007), p. 10. Keidar (2008).
  19. Lynch (1996), p. xix, 1–2. Peleg (2000), p. 1.
  20. Peleg (2000), p. 1.
  21. Papadimitriou (1994), Chapter 15. Keidar (2008).
  22. See references in Introduction.
  23. Bentaleb, A.; Yifan, L.; Xin, J.; et al. (2016). "Parallel and Distributed Algorithms" (PDF). National University of Singapore. Archived (PDF) from the original on 2017-03-26. Retrieved 20 July 2018.
  24. Andrews (2000), p. 348.
  25. Andrews (2000), p. 32.
  26. Peter (2004), The history of email Archived 2009-04-15 at the Wayback Machine .
  27. Banks, M. (2012). On the Way to the Web: The Secret History of the Internet and its Founders. Apress. pp. 44–5. ISBN   9781430250746. Archived from the original on 2023-01-20. Retrieved 2018-07-20.
  28. Tel, G. (2000). Introduction to Distributed Algorithms. Cambridge University Press. pp. 35–36. ISBN   9780521794831. Archived from the original on 2023-01-20. Retrieved 2018-07-20.
  29. Ohlídal, M.; Jaroš, J.; Schwarz, J.; et al. (2006). "Evolutionary Design of OAB and AAB Communication Schedules for Interconnection Networks". In Rothlauf, F.; Branke, J.; Cagnoni, S. (eds.). Applications of Evolutionary Computing. Springer Science & Business Media. pp. 267–78. ISBN   9783540332374.
  30. "Real Time And Distributed Computing Systems" (PDF). ISSN   2278-0661. Archived from the original (PDF) on 2017-01-10. Retrieved 2017-01-09.{{cite journal}}: Cite journal requires |journal= (help)
  31. Vigna P, Casey MJ. The Age of Cryptocurrency: How Bitcoin and the Blockchain Are Challenging the Global Economic Order St. Martin's Press January 27, 2015 ISBN   9781250065636
  32. Quang Hieu Vu; Mihai Lupu; Beng Chin Ooi (2010). Peer-to-peer computing : principles and applications. Heidelberg: Springer. p. 16. ISBN   9783642035135. OCLC   663093862.
  33. Lind P, Alm M (2006), "A database-centric virtual chemistry system", J Chem Inf Model, 46 (3): 1034–9, doi:10.1021/ci050360b, PMID   16711722.
  34. Chiu, G (1990). "A model for optimal database allocation in distributed computing systems". Proceedings. IEEE INFOCOM'90: Ninth Annual Joint Conference of the IEEE Computer and Communications Societies.
  35. Elmasri & Navathe (2000), Section 24.1.2.
  36. Andrews (2000), p. 10–11. Ghosh (2007), p. 4–6. Lynch (1996), p. xix, 1. Peleg (2000), p. xv. Elmasri & Navathe (2000), Section 24.
  37. Haussmann, J. (2019). "Cost-efficient parallel processing of irregularly structured problems in cloud computing environments". Journal of Cluster Computing. 22 (3): 887–909. doi:10.1007/s10586-018-2879-3. S2CID   54447518.
  38. Toomarian, N.B.; Barhen, J.; Gulati, S. (1992). "Neural Networks for Real-Time Robotic Applications". In Fijany, A.; Bejczy, A. (eds.). Parallel Computation Systems For Robotics: Algorithms And Architectures. World Scientific. p. 214. ISBN   9789814506175. Archived from the original on 2020-08-01. Retrieved 2018-07-20.
  39. Savage, J.E. (1998). Models of Computation: Exploring the Power of Computing. Addison Wesley. p. 209. ISBN   9780201895391.
  40. Cormen, Leiserson & Rivest (1990), Section 30.
  41. Herlihy & Shavit (2008), Chapters 2–6.
  42. Lynch (1996)
  43. Cormen, Leiserson & Rivest (1990), Sections 28 and 29.
  44. 1 2 3 TULSIRAMJI GAIKWAD-PATIL College of Engineering & Technology, Nagpur Department of Information Technology Introduction to Distributed Systems
  45. Cole & Vishkin (1986). Cormen, Leiserson & Rivest (1990), Section 30.5.
  46. Andrews (2000), p. ix.
  47. Arora & Barak (2009), Section 6.7. Papadimitriou (1994), Section 15.3.
  48. Papadimitriou (1994), Section 15.2.
  49. Lynch (1996), p. 17–23.
  50. Peleg (2000), Sections 2.3 and 7. Linial (1992). Naor & Stockmeyer (1995).
  51. Schneider, J.; Wattenhofer, R. (2011). "Trading Bit, Message, and Time Complexity of Distributed Algorithms". In Peleg, D. (ed.). Distributed Computing. Springer Science & Business Media. pp. 51–65. ISBN   9783642240997. Archived from the original on 2020-08-01. Retrieved 2018-07-20.
  52. Lynch (1996), Sections 5–7. Ghosh (2007), Chapter 13.
  53. Lynch (1996), p. 99–102. Ghosh (2007), p. 192–193.
  54. Dolev (2000). Ghosh (2007), Chapter 17.
  55. Lynch (1996), Section 16. Peleg (2000), Section 6.
  56. Lynch (1996), Section 18. Ghosh (2007), Sections 6.2–6.3.
  57. Ghosh (2007), Section 6.4.
  58. 1 2 Haloi, S. (2015). Apache ZooKeeper Essentials. Packt Publishing Ltd. pp. 100–101. ISBN   9781784398323. Archived from the original on 2023-01-20. Retrieved 2018-07-20.
  59. LeLann, G. (1977). "Distributed systems - toward a formal approach". Information Processing. 77: 155·160 via Elsevier.
  60. R. G. Gallager, P. A. Humblet, and P. M. Spira (January 1983). "A Distributed Algorithm for Minimum-Weight Spanning Trees" (PDF). ACM Transactions on Programming Languages and Systems. 5 (1): 66–77. doi:10.1145/357195.357200. S2CID   2758285. Archived (PDF) from the original on 2017-09-26.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  61. Korach, Ephraim; Kutten, Shay; Moran, Shlomo (1990). "A Modular Technique for the Design of Efficient Distributed Leader Finding Algorithms" (PDF). ACM Transactions on Programming Languages and Systems. 12 (1): 84–101. CiteSeerX   10.1.1.139.7342 . doi:10.1145/77606.77610. S2CID   9175968. Archived (PDF) from the original on 2007-04-18.
  62. Hamilton, Howard. "Distributed Algorithms". Archived from the original on 2012-11-24. Retrieved 2013-03-03.
  63. "Major unsolved problems in distributed systems?". cstheory.stackexchange.com. Archived from the original on 20 January 2023. Retrieved 16 March 2018.
  64. "How big data and distributed systems solve traditional scalability problems". theserverside.com. Archived from the original on 17 March 2018. Retrieved 16 March 2018.
  65. Svozil, K. (2011). "Indeterminism and Randomness Through Physics". In Hector, Z. (ed.). Randomness Through Computation: Some Answers, More Questions. World Scientific. pp. 112–3. ISBN   9789814462631. Archived from the original on 2020-08-01. Retrieved 2018-07-20.
  66. Papadimitriou (1994), Section 19.3.

Related Research Articles

<span class="mw-page-title-main">Load balancing (computing)</span> Set of techniques to improve the distribution of workloads across multiple computing resources

In computing, load balancing is the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize the response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Computer science is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. One well known subject classification system for computer science is the ACM Computing Classification System devised by the Association for Computing Machinery.

<span class="mw-page-title-main">Distributed memory</span> Multiprocessing memory architecture

In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task must communicate with one or more remote processors. In contrast, a shared memory multiprocessor offers a single memory space used by all processors. Processors do not have to be aware where data resides, except that there may be performance penalties, and that race conditions are to be avoided.

<span class="mw-page-title-main">Theoretical computer science</span> Subfield of computer science and mathematics

Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, formal language theory, the lambda calculus and type theory.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge (u,v) from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications, especially in ranking problems such as feedback arc set. Topological sorting is possible even when the DAG has disconnected components.

<span class="mw-page-title-main">Concurrency (computer science)</span> Ability to execute a task in a non-serial manner

In computer science, concurrency is the ability of different parts or units of a program, algorithm, or problem to be executed out-of-order or in partial order, without affecting the outcome. This allows for parallel execution of the concurrent units, which can significantly improve overall speed of the execution in multi-processor and multi-core systems. In more technical terms, concurrency refers to the decomposability of a program, algorithm, or problem into order-independent or partially-ordered components or units of computation.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

<span class="mw-page-title-main">Leslie Valiant</span> British American computer scientist

Leslie Gabriel Valiant is a British American computer scientist and computational theorist. He was born to a chemical engineer father and a translator mother. He is currently the T. Jefferson Coolidge Professor of Computer Science and Applied Mathematics at Harvard University. Valiant was awarded the Turing Award in 2010, having been described by the A.C.M. as a heroic figure in theoretical computer science and a role model for his courage and creativity in addressing some of the deepest unsolved problems in science; in particular for his "striking combination of depth and breadth".

A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. Distributed algorithms are used in different application areas of distributed computing, such as telecommunications, scientific computing, distributed information processing, and real-time process control. Standard problems solved by distributed algorithms include leader election, consensus, distributed search, spanning tree generation, mutual exclusion, and resource allocation.

The Dijkstra–Scholten algorithm is an algorithm for detecting termination in a distributed system. The algorithm was proposed by Dijkstra and Scholten in 1980.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts. Real-world applications often requiring consensus include cloud computing, clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain, and others.

<span class="mw-page-title-main">Distributed minimum spanning tree</span>

The distributed minimum spanning tree (MST) problem involves the construction of a minimum spanning tree by a distributed algorithm, in a network where nodes communicate by message passing. It is radically different from the classical sequential problem, although the most basic approach resembles Borůvka's algorithm. One important application of this problem is to find a tree that can be used for broadcasting. In particular, if the cost for a message to pass through an edge in a graph is significant, an MST can minimize the total cost for a source process to communicate with all the other processes in the network.

The bulk synchronous parallel (BSP) abstract computer is a bridging model for designing parallel algorithms. It is similar to the parallel random access machine (PRAM) model, but unlike PRAM, BSP does not take communication and synchronization for granted. In fact, quantifying the requisite synchronization and communication is an important part of analyzing a BSP algorithm.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

Lateral computing is a lateral thinking approach to solving computing problems. Lateral thinking has been made popular by Edward de Bono. This thinking technique is applied to generate creative ideas and solve problems. Similarly, by applying lateral-computing techniques to a problem, it can become much easier to arrive at a computationally inexpensive, easy to implement, efficient, innovative or unconventional solution.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

Pointer jumping or path doubling is a design technique for parallel algorithms that operate on pointer structures, such as linked lists and directed graphs. Pointer jumping allows an algorithm to follow paths with a time complexity that is logarithmic with respect to the length of the longest path. It does this by "jumping" to the end of the path computed by neighbors.

References

Books
Articles
Web sites

Further reading

Books
Articles
Conference Papers