Parallel RAM

Last updated January 03, 2024

In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confused with random-access memory). In the same way that the RAM is used by sequential-algorithm designers to model algorithmic performance (such as time complexity), the PRAM is used by parallel-algorithm designers to model parallel algorithmic performance (such as time complexity, where the number of processors assumed is typically also stated). Similar to the way in which the RAM model neglects practical issues, such as access time to cache memory versus main memory, the PRAM model neglects such issues as synchronization and communication, but provides any (problem-size-dependent) number of processors. Algorithm cost, for instance, is estimated using two parameters O(time) and O(time × processor_number).

Read/write conflicts

Read/write conflicts, commonly termed interlocking in accessing the same shared memory location simultaneously are resolved by one of the following strategies:

Exclusive read exclusive write (EREW)—every memory cell can be read or written to by only one processor at a time
Concurrent read exclusive write (CREW)—multiple processors can read a memory cell but only one can write at a time
Exclusive read concurrent write (ERCW)—never considered^{[ citation needed ]}
Concurrent read concurrent write (CRCW)—multiple processors can read and write. A CRCW PRAM is sometimes called a concurrent random-access machine.^[1]

Here, E and C stand for 'exclusive' and 'concurrent' respectively. The read causes no discrepancies while the concurrent write is further defined as:

Common—all processors write the same value; otherwise is illegal

Arbitrary—only one arbitrary attempt is successful, others retire

Priority—processor rank indicates who gets to write

Another kind of array reduction operation like SUM, Logical AND or MAX.

Several simplifying assumptions are made while considering the development of algorithms for PRAM. They are:

There is no limit on the number of processors in the machine.
Any memory location is uniformly accessible from any processor.
There is no limit on the amount of shared memory in the system.
Resource contention is absent.
The programs written on these machines are, in general, of type SIMD.

These kinds of algorithms are useful for understanding the exploitation of concurrency, dividing the original problem into similar sub-problems and solving them in parallel. The introduction of the formal 'P-RAM' model in Wyllie's 1979 thesis^[2] had the aim of quantifying analysis of parallel algorithms in a way analogous to the Turing Machine. The analysis focused on a MIMD model of programming using a CREW model but showed that many variants, including implementing a CRCW model and implementing on an SIMD machine, were possible with only constant overhead.

Implementation

PRAM algorithms cannot be parallelized with the combination of CPU and dynamic random-access memory (DRAM) because DRAM does not allow concurrent access to a single bank (not even different addresses in the bank); but they can be implemented in hardware or read/write to the internal static random-access memory (SRAM) blocks of a field-programmable gate array (FPGA), it can be done using a CRCW algorithm.

However, the test for practical relevance of PRAM (or RAM) algorithms depends on whether their cost model provides an effective abstraction of some computer; the structure of that computer can be quite different than the abstract model. The knowledge of the layers of software and hardware that need to be inserted is beyond the scope of this article. But, articles such as Vishkin (2011) demonstrate how a PRAM-like abstraction can be supported by the explicit multi-threading (XMT) paradigm and articles such as Caragea & Vishkin (2011) demonstrate that a PRAM algorithm for the maximum flow problem can provide strong speedups relative to the fastest serial program for the same problem. The article Ghanim, Vishkin & Barua (2018) demonstrated that PRAM algorithms as-is can achieve competitive performance even without any additional effort to cast them as multi-threaded programs on XMT.

Example code

This is an example of SystemVerilog code which finds the maximum value in the array in only 2 clock cycles. It compares all the combinations of the elements in the array at the first clock, and merges the result at the second clock. It uses CRCW memory; m[i] <= 1 and maxNo <= data[i] are written concurrently. The concurrency causes no conflicts because the algorithm guarantees that the same value is written to the same memory. This code can be run on FPGA hardware.

moduleFindMax#(parameterintlen=8)(inputbitclock,resetN,inputbit[7:0]data[len],outputbit[7:0]maxNo);typedefenumbit[1:0]{COMPARE,MERGE,DONE}State;Statestate;bitm[len];inti,j;always_ff@(posedgeclock,negedgeresetN)beginif(!resetN)beginfor(i=0;i<len;i++)m[i]<=0;state<=COMPARE;endelsebegincase(state)COMPARE:beginfor(i=0;i<len;i++)beginfor(j=0;j<len;j++)beginif(data[i]<data[j])m[i]<=1;endendstate<=MERGE;endMERGE:beginfor(i=0;i<len;i++)beginif(m[i]==0)maxNo<=data[i];endstate<=DONE;endendcaseendendendmodule

Related Research Articles

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. Distributed computing is a field of computer science that studies distributed systems.

The Harvard architecture is a computer architecture with separate storage and signal pathways for instructions and data. It is often contrasted with the von Neumann architecture, where program instructions and data share the same memory and pathways.

Static random-access memory is a type of random-access memory (RAM) that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can do multiple operations in a given time. It has been a tradition of computer science to describe serial algorithms in abstract machine models, often the one known as random-access machine. Similarly, many computer science researchers have used a so-called parallel random-access machine (PRAM) as a parallel abstract machine (shared-memory).

In computer operating systems, memory paging is a memory management scheme by which a computer stores and retrieves data from secondary storage for use in main memory. In this scheme, the operating system retrieves data from secondary storage in same-size blocks called pages. Paging is an important part of virtual memory implementations in modern operating systems, using secondary storage to let programs exceed the size of available physical memory.

<span class="mw-page-title-main">Theoretical computer science</span> Subfield of computer science and mathematics

Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, formal language theory, the lambda calculus and type theory.

<span class="mw-page-title-main">Hidden-line removal</span> Problem of finding obscured edges in a wire-frame 3D model

In 3D computer graphics, solid objects are usually modeled by polyhedra. A face of a polyhedron is a planar polygon bounded by straight line segments, called edges. Curved surfaces are usually approximated by a polygon mesh. Computer programs for line drawings of opaque objects must be able to decide which edges or which parts of the edges are hidden by an object itself or by other objects, so that those edges can be clipped during rendering. This problem is known as hidden-line removal.

In computer science, concurrency is the ability of different parts or units of a program, algorithm, or problem to be executed out-of-order or in partial order, without affecting the outcome. This allows for parallel execution of the concurrent units, which can significantly improve overall speed of the execution in multi-processor and multi-core systems. In more technical terms, concurrency refers to the decomposability of a program, algorithm, or problem into order-independent or partially-ordered components or units of computation.

In parallel algorithms, the list ranking problem involves determining the position, or rank, of each item in a linked list. That is, the first item in the list should be assigned the number 1, the second item in the list should be assigned the number 2, etc. Although it is straightforward to solve this problem efficiently on a sequential computer, by traversing the list in order, it is more complicated to solve in parallel. As Anderson & Miller (1990) wrote, the problem was viewed as important in the parallel algorithms community both for its many applications and because solving it led to many important ideas that could be applied in parallel algorithms more generally.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

XMTC is a shared-memory parallel programming language. It is an extension of the C programming language which strives to enable easy PRAM-like programming based on the explicit multi-threading paradigm. It is developed as part of the XMT PRAM-On-Chip vision by a research team at the University of Maryland, College Park, led by Dr. Uzi Vishkin.

In computing, especially computational geometry, a real RAM is a mathematical model of a computer that can compute with exact real numbers instead of the binary fixed point or floating point numbers used by most actual computers. The real RAM was formulated by Michael Ian Shamos in his 1978 Ph.D. dissertation.

Uzi Vishkin is a computer scientist at the University of Maryland, College Park, where he is Professor of Electrical and Computer Engineering at the University of Maryland Institute for Advanced Computer Studies (UMIACS). Uzi Vishkin is known for his work in the field of parallel computing. In 1996, he was inducted as a Fellow of the Association for Computing Machinery, with the following citation: "One of the pioneers of parallel algorithms research, Dr. Vishkin's seminal contributions played a leading role in forming and shaping what thinking in parallel has come to mean in the fundamental theory of Computer Science."

In computer science, the all nearest smaller values problem is the following task: for each position in a sequence of numbers, search among the previous positions for the last position that contains a smaller value. This problem can be solved efficiently both by parallel and non-parallel algorithms: Berkman, Schieber & Vishkin (1993), who first identified the procedure as a useful subroutine for other parallel programs, developed efficient algorithms to solve it in the Parallel Random Access Machine model; it may also be solved in linear time on a non-parallel computer using a stack-based algorithm. Later researchers have studied algorithms to solve it in other models of parallel computation.

Explicit Multi-Threading (XMT) is a computer science paradigm for building and programming parallel computers designed around the parallel random-access machine (PRAM) parallel computational model. A more direct explanation of XMT starts with the rudimentary abstraction that made serial computing simple: that any single instruction available for execution in a serial program executes immediately. A consequence of this abstraction is a step-by-step (inductive) explication of the instruction available next for execution. The rudimentary parallel abstraction behind XMT, dubbed Immediate Concurrent Execution (ICE) in Vishkin (2011), is that indefinitely many instructions available for concurrent execution execute immediately. A consequence of ICE is a step-by-step (inductive) explication of the instructions available next for concurrent execution. Moving beyond the serial von Neumann computer, the aspiration of XMT is that computer science will again be able to augment mathematical induction with a simple one-line computing abstraction.

SuperPascal is an imperative, concurrent computing programming language developed by Per Brinch Hansen. It was designed as a publication language: a thinking tool to enable the clear and concise expression of concepts in parallel programming. This is in contrast with implementation languages which are often complicated with machine details and historical conventions. It was created to address the need at the time for a parallel publication language. Arguably, few languages today are expressive and concise enough to be used as thinking tools.

In computer science, the analysis of parallel algorithms is the process of finding the computational complexity of algorithms executed in parallel – the amount of time, storage, or other resources needed to execute them. In many respects, analysis of parallel algorithms is similar to the analysis of sequential algorithms, but is generally more involved because one must reason about the behavior of multiple cooperating threads of execution. One of the primary goals of parallel analysis is to understand how a parallel algorithm's use of resources changes as the number of processors is changed.

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often commutative. The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References

↑ Neil Immerman, Expressibility and parallel complexity . SIAM Journal on Computing, vol. 18, no. 3, pp. 625-638, 1989.
↑ Wyllie, James C. The Complexity of Parallel Computations, PhD Thesis, Dept. of Computer Science, Cornell University

Eppstein, David; Galil, Zvi (1988), "Parallel algorithmic techniques for combinatorial computation", Annu. Rev. Comput. Sci., 3: 233–283, doi:10.1146/annurev.cs.03.060188.001313
JaJa, Joseph (1992), An Introduction to Parallel Algorithms, Addison-Wesley, ISBN 0-201-54856-9
Karp, Richard M.; Ramachandran, Vijaya (1988), A Survey of Parallel Algorithms for Shared-Memory Machines, University of California, Berkeley, Department of EECS, Tech. Rep. UCB/CSD-88-408
Keller, Jörg; Christoph Keßler; Jesper Träff (2001). Practical PRAM Programming. John Wiley and Sons. ISBN 0-471-35351-5.
Vishkin, Uzi (2009), Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion
Vishkin, Uzi (2011), "Using simple abstraction to reinvent computing for parallelism", Communications of the ACM, 54: 75–85, doi: 10.1145/1866739.1866757
Caragea, George Constantin; Vishkin, Uzi (2011), "Brief announcement: Better speedups for parallel max-flow", Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11, p. 131, doi:10.1145/1989493.1989511, ISBN 9781450307437, S2CID 5511743
Ghanim, Fady; Vishkin, Uzi; Barua, Rajeev (2018), "Easy PRAM-based High-performance Parallel Programming with ICE", IEEE Transactions on Parallel and Distributed Systems, 29 (2): 377–390, doi: 10.1109/TPDS.2017.2754376 , hdl: 1903/18521

External links

Saarland University's prototype PRAM
University Of Maryland's PRAM-On-Chip prototype. This prototype seeks to put many parallel processors and the fabric for inter-connecting them on a single chip
XMTC: PRAM-like Programming - Software release

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Neil Immerman, Expressibility and parallel complexity . SIAM Journal on Computing, vol. 18, no. 3, pp. 625-638, 1989.

[2] Wyllie, James C. The Complexity of Parallel Computations, PhD Thesis, Dept. of Computer Science, Cornell University

[1]

[2]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing