Intel Tera-Scale

Last updated April 09, 2019

Intel Tera-Scale is a research program by Intel that focuses on development in Intel processors and platforms that utilize the inherent parallelism of emerging visual-computing applications. Such applications require teraFLOPS of parallel computing performance to process terabytes of data quickly.^[1] Parallelism is the concept of performing multiple tasks simultaneously. Utilizing parallelism will not only increase the efficiency of computer processing units (CPUs), but also increase the bytes of data analyzed each second. In order to appropriately apply parallelism, the CPU must be able to handle multiple threads and to do so the CPU must consist of multiple cores. The conventional amount of cores in consumer grade computers are 2–8 cores while workstation grade computers can have even greater amounts. However, even the current amount of cores aren't great enough to perform at teraFLOPS performance leading to an even greater amount of cores that must be added. As a result of the program, two prototypes have been manufactured that were used to test the feasibility of having many more cores than the conventional amount and proved to be successful.

Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, in the Silicon Valley. It is the world's second largest and second highest valued semiconductor chip manufacturer based on revenue after being overtaken by Samsung, and is the inventor of the x86 series of microprocessors, the processors found in most personal computers (PCs). Intel ranked No. 46 in the 2018 Fortune 500 list of the largest United States corporations by total revenue.

Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but it's gaining broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computing, floating point operations per second is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases it is a more accurate measure than measuring instructions per second.

Prototypes

Teraflops Research Chip (Polaris) is an 80-core prototype processor developed by Intel in 2007. It represents Intel's first public attempt at creating a Tera-Scale processor. The Polaris processor requires to be run at 3.13 GHz and 1V in order to maintain its teraFLOP name. At its peak performance, the processor is capable of 1.28 teraFLOP.^[2]

The Teraflops Research Chip is a research manycore processor, containing 80 cores developed by Intel Corporation's Tera-Scale Computing Research Program. The processor was officially announced February 11, 2007 and shown working at the 2007 International Solid-State Circuits Conference. Features of the processor include dual floating point engines, sleeping-core technology, self-correction, fixed-function cores, and three-dimensional memory stacking. The purpose of the chip is to explore the possibilities of Tera-Scale architecture and to experiment with various forms of networking and communication within the next generation of processors.

Single-chip Cloud Computer is another research processor developed by Intel in 2009. This processor consists of 48 P54C cores connected in a 6x4 2D-mesh.^[3]

The Single-Chip Cloud Computer (SCC) is a computer processor (CPU) created by Intel Corporation in 2009 that has 48 distinct physical cores that communicate through architecture similar to that of a cloud computer data center. Cores are a part of the processor that carry out instructions of code that allow the computer to run. The SCC was a product of a project started by Intel to research multi-core processors and parallel processing. Additionally Intel wanted to experiment with incorporating the designs and architecture of huge cloud computer data centers into a single processing chip. They took the aspect of cloud computing in which there are many remote servers that communicate with each other and applied it to a microprocessor. It was a new concept that Intel wanted to experiment with. The name "Single-chip Cloud Computer" originated from this concept.

Ideology

Parallelism is the concept of performing multiple tasks simultaneously, effectively reducing the time needed to perform a given task. The Tera-Scale research program is focused on the concept of utilizing many more cores than conventional to increase performance with parallelism. Based on their previous experience with increased core counts on CPUs, doubling the number of cores was able to nearly double the performance with no increase in power. With a greater amount of cores, there are possibilities of improved energy efficiency, improved performance, extended lifetimes and new capabilities. Tera-Scale processors would improve energy efficiency by being able to "put to sleep" cores that are unneeded at the time while being able to improve performance by intelligently redistributing workloads to ensure an even workload spread across the chip. Extended lifetimes are also capable by tera-scale processors due to the possibility of having reserve cores that could be brought online when a core fails in the processor. Lastly, the processors would gain new capabilities and functionality as dedicated hardware engines, such as graphics engines, could be integrated.^[4]

Hardware

Intel Tera-Scale is focused on creating multi-core processors that can utilize parallel processing to reach teraFLOPS of computing performance. Current processors consist of highly complicated cores; however, current cores are built in a way that makes it difficult to have more than the current amounts of cores in CPUs. As a result, Intel is currently focused on creating Tera-Scale processors with many cores rather than high performance cores. To simplify CPU cores, Intel moved from CPUs utilizing the x86 architecture to a much simpler VLIW architecture. VLIW is an uncommon architecture for desktops, but is adequate for computers running specialized applications. This architecture simplifies hardware design at the cost of the increasing the workload on the compiler side meaning more work must be put into programming. This drawback is offset by the fact that the number of applications that will be run on a Tera-Scale processor is low enough for it to not be too much of a burden on the software side.^[2]

x86 is a family of instruction set architectures based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors.

Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.

Software

With the release of the Polaris 80 core processor in 2007, people questioned the need of 10s-100s of cores. Intel responded with a category of software called Recognition, Mining, and Synthesis (RMS) applications which require the computational power of 10s-100s of cores. Recognition applications create models based on what they identify such as a person's face. Mining applications extract one or more instances from a large amount of data. Lastly, synthesis applications allow for prediction and projecting of new environments. An example of where RMS and tera-scale processors are necessary is the creation of sport summaries. Usually sport summaries require hours for a computer to mine through hundreds of thousands of video frames to find short action clips to be shown in the sport summaries. With RMS software and a tera-scale processor, sport summaries could be created in real time during sporting events.^[1] The Tera-Scale processors also show potential in real-time analysis in fields such as finance which requires a processor that is capable of analyzing immense amounts of data. From Intel's past evolution from single core to multi-core processors, Intel has learned that parallelization is the key to the greater processing power in the future. The Intel Tera-Scale research program is not only focused on creating the multi-cored processors, but also the parallelizing applications of today and in the future. To show their dedication to all aspects of parallel computing, Intel set aside $20 million to establish centers that will research and develop new methods utilize parallel computing in many more applications.^[5]

Challenges

In early 2005, Intel originally encountered the problem of memory bandwidth. As more cores are added, the memory bandwidth remains the same due to size constrictions, effectively bottlenecking the CPU. They were able to overcome the problem by a process called die stacking. This is a process in which the CPU die, flash, and DRAM would be stacked on top of each other significantly raising the possible memory bus widths.^[2] Another challenge that Intel encountered were the physical limitations of electrical buses. A bus bandwidth is the CPU's connection to the outside world and with the current bus bandwidth, it would be unable to keep up with the teraFLOPs performance resulting from tera-scale processors. Intel's research into Silicon Photonics has produced a functional optical bus that can offer superior signaling speed and power efficiency compared to the current buses. These optical buses are an ideal solution to the bus bandwidth limitation for tera-scale processors.^[2]

Related Research Articles

A central processing unit (CPU), also called a central processor or main processor, is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions. The computer industry has used the term "central processing unit" at least since the early 1960s. Traditionally, the term "CPU" refers to a processor, more specifically to its processing unit and control unit (CU), distinguishing these core elements of a computer from external components such as main memory and I/O circuitry.

Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

Superscalar processor CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor that can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows for more throughput than would otherwise be not possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

Central processing unit power dissipation or CPU power dissipation is the process in which central processing units (CPUs) consume electrical energy, and dissipate this energy in the form of heat due to the resistance in the electronic circuits.

The clock rate typically refers to the frequency at which a chip like a central processing unit (CPU), one core of a multi-core processor, is running and is used as an indicator of the processor's speed. It is measured in clock cycles per second or its equivalent, the SI unit hertz (Hz). The clock rate of the first generation of computers was measured in hertz or kilohertz (kHz), the first personal computers (PC's) to arrive throughout the 1970s and 1980s had clock rates measured in megahertz (MHz), and in the 21st century the speed of modern CPUs is commonly advertised in gigahertz (GHz). This metric is most useful when comparing processors within the same family, holding constant other features that may affect performance. Video card and CPU manufacturers commonly select their highest performing units from a manufacturing batch and set their maximum clock rate higher, fetching a higher price.

Power management is a feature of some electrical appliances, especially copiers, computers, GPUs and computer peripherals such as monitors and printers, that turns off the power or switches the system to a low-power state when inactive. In computing this is known as PC power management and is built around a standard called ACPI. This supersedes APM. All recent (consumer) computers have ACPI support.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

Explicitly parallel instruction computing (EPIC) is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures. It was the basis for Intel and HP development of the Intel Itanium architecture, and HP later asserted that "EPIC" was merely an old term for the Itanium architecture. EPIC permits microprocessors to execute software instructions in parallel by using the compiler, rather than complex on-die circuitry, to control parallel instruction execution. This was intended to allow simple performance scaling without resorting to higher clock frequencies.

In computing, hardware acceleration is the use of computer hardware specially made to perform some functions more efficiently than is possible in software running on a general-purpose CPU. Any transformation of data or routine that can be computed, can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in application-specific hardware designed or programmed to compute the operation than specified in software and performed on a general-purpose computer processor. Each approach has advantages and disadvantages. The implementation of computing tasks in hardware to decrease latency and increase throughput is known as hardware acceleration.

A multi-core processor is a single computing component with two or more independent processing units called cores, which read and execute program instructions. The instructions are ordinary CPU instructions but the single processor can run multiple instructions on separate cores at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

The history of general-purpose CPUs is a continuation of the earlier history of computing hardware.

Explicit data graph execution, or EDGE, is a type of instruction set architecture (ISA) which intends to improve computing performance compared to common processors like the Intel x86 line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

TRIPS was a microprocessor architecture designed by a team at the University of Texas at Austin in conjunction with IBM, Intel, and Sun Microsystems. TRIPS uses an instruction set architecture designed to be easily broken down into large groups of instructions (graphs) that can be run on independent processing elements. The design collects related data into the graphs, attempting to avoid expensive data reads and writes and keeping the data in high speed memory close to the processing elements. The prototype TRIPS processor contains 16 such elements. TRIPS hoped to reach 1 TFLOP on a single processor, as papers were published from 2003 to 2006.

Stream Processors, Inc was a Silicon Valley-based fabless semiconductor company specializing in the design and manufacture of high-performance digital signal processors for applications including video surveillance, multi-function printers and video conferencing. The company ceased operations in 2009.

In computing, performance per watt is a measure of the energy efficiency of a particular computer architecture or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed. This rate is typically measured by performance on the LINPACK benchmark when trying to compare between computing systems.

Manycore processors are specialist multi-core processors designed for a high degree of parallel processing, containing a large number of simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing. As of November 2018, the world's third fastest supercomputer, the Chinese Sunway TaihuLight, obtains its performance from 40,960 SW26010 manycore processors, each containing 256 cores.

Xeon Phi is a series of x86 manycore processors designed and made by Intel. It is intended for use in supercomputers, servers, and high-end workstations. Its architecture allows use of standard programming languages and APIs such as OpenMP.

References

1 2 Held, Jim; Bautista, Jerry; Koehl, Sean (2006). "From a Few Cores to Many: A Tera-scale Computing Research Overview" (PDF). White Paper Research at Intel. Intel Corporation. Retrieved 28 October 2014.
1 2 3 4 Shimpi, Anand Lal. "The Era of Tera: Intel Reveals more about 80-core CPU". AnandTech. Retrieved 29 October 2014.
↑ Mattson, Tim. "Using Intel's Single Chip Cloud Computer (SCC)" (PDF). Retrieved 11 November 2014.
↑ "Tera-scale Computing Architectural Overview". Intel. Archived from the original on 2014-11-29. Retrieved 2017-01-02.
↑ Ferguson, Scott. "Microsoft, Intel Earmark $20M for Parallel Computing". eWeek. Retrieved 6 November 2014.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Held-1] 1 2 Held, Jim; Bautista, Jerry; Koehl, Sean (2006). "From a Few Cores to Many: A Tera-scale Computing Research Overview" (PDF). White Paper Research at Intel. Intel Corporation. Retrieved 28 October 2014.

[Anandtech-2] 1 2 3 4 Shimpi, Anand Lal. "The Era of Tera: Intel Reveals more about 80-core CPU". AnandTech. Retrieved 29 October 2014.

[3] Mattson, Tim. "Using Intel's Single Chip Cloud Computer (SCC)" (PDF). Retrieved 11 November 2014.

[Architecture_Overview-4] "Tera-scale Computing Architectural Overview". Intel. Archived from the original on 2014-11-29. Retrieved 2017-01-02.

[5] Ferguson, Scott. "Microsoft, Intel Earmark $20M for Parallel Computing". eWeek. Retrieved 6 November 2014.

v t e Intel
Subsidiaries	3Dlabs Altera Comneon Intel Security Mobileye Recon Instruments Virtutech Wind River Systems 4Group Holdings (50% owned by Technicolor SA)
Products	Intel AZ210 3D XPoint Accounts & SSO Amplify Tablet Advanced Programmable Interrupt Controller Cache Acceleration Software Client Initiated Remote Access Direct Media Interface Flexible Display Interface Hella Zippy Intel 1103 Intel Display Power Saving Technology Intel Modular Server System Intel Reader Intel SPSH4 Intel System Development Kit Intel Upgrade Service Intel740 InTru3D IXP1200 OFono Omni-Path Performance acceleration technology Shooting Star Smart Cache SSDs (X25-M) Stable Image Platform Virtual 8086 mode WiDi x86 Intel Clear Video Intel Quick Sync Video
Litigation	Advanced Micro Devices, Inc. v. Intel Corp. High-Tech Employee Antitrust Litigation Intel Corp. v. Advanced Micro Devices, Inc. Intel Corp. v. Hamidi Intel Corporation Inc. v CPM United Kingdom Ltd Silvaco Data Systems v. Intel Corp.
People	Gordon Moore Robert Noyce
Related	Intel Foundation Achievement Award Apple's transition to Intel processors Intel Architecture Labs ASCI Red BiiN Classmate PC Convera Corporation Copy Exactly! Cornell Cup USA Intel Developer Forum Dynamic video memory technology Intel Extreme Masters List of Intel microprocessors List of Intel graphics processing units (2013 or earlier) I/O Acceleration Technology IA-32 Execution Layer IM Flash Technologies The Innovators Inside Inside Films Intel ADX Intel Capital Intel Cluster Ready Intel Compute Stick Intel Ireland Intel Mobile Communications Intel Outstanding Researcher Award Intel SHA extensions Intel Teach List of semiconductor fabrication plants List of Intel manufacturing sites Intel Museum OnCue Intel PRO/Wireless Intel International Science and Engineering Fair Regeneron Science Talent Search Simple Firmware Interface Single-chip Cloud Computer Software Guard Extensions Supervisor Mode Access Prevention Tarari Intel Technology Journal Intel Tera-Scale Timeline of Intel Xircom