Cray MTA

Last updated February 18, 2021

The Cray MTA, formerly known as the Tera MTA, is a supercomputer architecture based on thousands of independent threads, fine-grain communication and synchronization between threads, and latency tolerance for irregular computations.

Each MTA processor (CPU) has a high-performance ALU with many independent register sets, each running an independent thread. For example, the Cray MTA-2 uses 128 register sets and thus 128 threads per CPU/ALU. All MTAs to date use a barrel processor arrangement, with a thread switch on every cycle, with blocked (stalled) threads skipped to avoid wasting ALU cycles. When a thread performs a memory read, execution blocks until data returns; meanwhile, other threads continue executing. With enough threads (concurrency), there are nearly always runnable threads to "cover" for blocked threads, and the ALUs stay busy. The memory system uses full/empty bits to ensure correct ordering. For example, an array A is initially written with "empty" bits, and any thread reading a value from A blocks until another thread writes a value. This ensures correct ordering, but allows fine-grained interleaving and provides a simple programming model. The memory system is also "randomized", with adjacent physical addresses going to different memory banks. Thus, when two threads access memory simultaneously, they rarely conflict unless they are accessing the same location.

A goal of the MTA is that porting codes from other machines is straightforward, but gives good performance. A parallelizing FORTRAN compiler can produce high performance for some codes with little manual intervention. Where manual porting is required, the simple and fine-grained synchronization model often allows programmers to write code the "obvious" way yet achieve good performance. A further goal is that programs for the MTA will be scalable – that is, when run on an MTA with twice as many CPUs, the same program will have nearly twice the performance. Both of these are challenges for many other high-performance computer systems.

An uncommon feature of the MTA is several workloads can be interleaved with good performance. Typically, supercomputers are dedicated to a task at a time. The MTA allows idle threads to be allocated to other tasks with very little effect on the main calculations.

Implementations

There have been three MTA implementations and as of 2009 a fourth is planned. The implementations are:

MTA-1 The MTA-1 uses a GaAs processor and was installed at the San Diego Supercomputer Center. It used four processors (512 threads)
MTA-2 The MTA-2 uses a CMOS processor and was installed at the Naval Research Laboratory. It was reportedly unstable, but being inside a secure facility was not available for debugging or repair.
MTA-3 The MTA-3 uses the same CPU as the MTA-2 but a dramatically cheaper and slower network interface. About six Cray XMT systems have been sold (2009) using the MTA-3.^[1]
MTA-4 The MTA-4 is a planned system (2009) that is architecturally similar but will use limited data caching and a faster network interface than the MTA-3.

Performance

Only a few systems have been deployed, and only MTA-2 benchmarks have been reported widely, making performance comparisons difficult.

Across several benchmarks, a 2-CPU MTA-2 shows performance similar to a 2-processor Cray T90.^[2] For the specific application of ray tracing, a 4-CPU MTA-2 was about 5x faster than a 4-CPU Cray T3E, and in scaling from 1 CPU to 4 CPUs the Tera performance improved by 3.8x, while the T3E going from 1 to 4 CPUs improved by only 3.0x.^[3]

Architectural considerations

Another way to compare systems is by inherent overheads and bottlenecks of the design.

The MTA uses many register sets, thus each register access is slow. Although concurrency (running other threads) typically hides latency, slow register file access limits performance when there are few runable threads. In existing MTA implementations, single-thread performance is 21 cycles per instruction,^[4] so performance suffers when there are fewer than 21 threads per CPU.

The MTA-1, -2, and -3 use no data caches. This reduces CPU complexity and avoids cache coherency problems. However, no data caching introduces two performance problems. First, the memory system must support the full data access bandwidth of all threads, even for unshared and thus cacheable data. Thus, good system performance requires very high memory bandwidth. Second, memory references take 150-170 cycles,^[4]^[5] a much higher latency than even a slow cache, thus increasing the number of runable threads required to keep the ALU busy. The MTA-4 will have a non-coherent cache, which can be used for read-only and unshared data (such as non-shared stack frames), but which requires software coherency e.g., if a thread is migrated between CPUs. Data cache competition is often a performance bottleneck for highly-concurrent processors, and sometimes even for 2-core systems; however, by using the cache for data that is either highly shared or has very high locality (stack frames), competition between threads can be kept low.

Full/empty status changes use polling, with a timeout for threads that poll too long. A timed-out thread may be descheduled and the hardware context used to run another thread; the OS scheduler sets a "trap on write" bit so the waited-for write will trap and put the descheduled thread back in the run queue.^[5] Where the descheduled thread is on the critical path, performance may suffer substantially.

The MTA is latency-tolerant, including irregular latency, giving good performance on irregular computations if there is enough concurrency to "cover" delays. The latency-tolerance hardware may be wasted on regular calculations, including those with latency that is high but which can be scheduled easily.

Related Research Articles

A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry within a computer that executes instructions that make up a computer program. The CPU performs basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions in the program. This contrasts with external components such as main memory and I/O circuitry, and specialized processors such as graphics processing units (GPUs).

Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. Multiple threads can exist within one process, executing concurrently and sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

Superscalar processor CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor that can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows for more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independent of the central processing unit (CPU).

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the later 1990s.

Parallel computing is a type of computation where many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It introduced the P6 microarchitecture and was originally intended to replace the original Pentium in a full range of applications. While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors, respectively, the Pentium Pro contained 5.5 million transistors. Later, it was reduced to a more narrow role as a server and high-end desktop processor and was used in supercomputers like ASCI Red, the first computer to reach the teraFLOPS performance mark. The Pentium Pro was capable of both dual- and quad-processor configurations. It only came in one form factor, the relatively large rectangular Socket 8. The Pentium Pro was succeeded by the Pentium II Xeon in 1998.

Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed in the TOP500, which ranks the most powerful supercomputers in the world.

Cell is a multi-core microprocessor microarchitecture that combines a general-purpose PowerPC core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with separate instruction-specific and data-specific caches at level 1.

The Tera Computer Company was a manufacturer of high-performance computing software and hardware, founded in 1987 in Washington, D.C. and moved 1988 to Seattle, Washington by James Rottsolk and Burton Smith. The company's first supercomputer product, named MTA, featured interleaved multi-threading, i.e. a barrel processor. It also had no data cache, relying instead on switching between threads for latency tolerance, and used a deeply pipelined memory system to handle many simultaneous requests, with address randomization to avoid memory hot spots.

A barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading. Unlike simultaneous multithreading in modern superscalar architectures, it generally does not allow execution of multiple instructions in one cycle.

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

A multi-core processor is a computer processor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

The R4000 is a microprocessor developed by MIPS Computer Systems that implements the MIPS III instruction set architecture (ISA). Officially announced on 1 October 1991, it was one of the first 64-bit microprocessors and the first MIPS III implementation. In the early 1990s, when RISC microprocessors were expected to replace CISC microprocessors such as the Intel i486, the R4000 was selected to be the microprocessor of the Advanced Computing Environment (ACE), an industry standard that intended to define a common RISC platform. ACE ultimately failed for a number of reasons, but the R4000 found success in the workstation and server markets.

Cray XMT is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the third generation of the Tera MTA architecture, targeted at large graph problems. Presented in 2005, it supersedes the earlier unsuccessful Cray MTA-2. It uses the Threadstorm3 CPUs inside Cray XT3 blades. Designed to make use of commodity parts and existing subsystems for other commercial systems, it alleviated the shortcomings of Cray MTA-2's high cost of fully custom manufacture and support. It brought various substantial improvements over Cray MTA-2, most notably nearly tripling the peak performance, and vastly increased maximum CPU count to 8,192 and maximum memory to 128 TB, with a data TLB of maximal 512 TB.

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

Manycore processors are specialist multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

References

↑ "Cray XMT System". 2009. Archived from the original on 2010-01-15.
↑ "Multi-processor Performance on the Tera MTA". 1999. Archived from the original on 2012-02-22.
↑ "Data Intensive Volume Visualization on the Tera MTA and Cray T3E". 1999. Archived from the original on 2010-08-15. Retrieved 2009-12-16.
1 2 "Tera MTA (Multi-Threaded Architecture)". 1999.
1 2 "Microbenchmarking the Tera MTA" (PDF). 1999.

External links

– a Cray XMT overview

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Cray XMT System". 2009. Archived from the original on 2010-01-15.

[2] "Multi-processor Performance on the Tera MTA". 1999. Archived from the original on 2012-02-22.

[3] "Data Intensive Volume Visualization on the Tera MTA and Cray T3E". 1999. Archived from the original on 2010-08-15. Retrieved 2009-12-16.

[Movva-4] 1 2 "Tera MTA (Multi-Threaded Architecture)". 1999.

[Riedy-Vuduc-5] 1 2 "Microbenchmarking the Tera MTA" (PDF). 1999.

[1]

[2]

[3]

[4]

[5]