|Designer||HP and Intel|
|General purpose||128 (64 bits plus 1 trap bit; 32 are static, 96 use register windows); 64 1-bit predicate registers|
IA-64 (also called Intel Itanium architecture) is the instruction set architecture (ISA) of the Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was evolved and then implemented in a new processor microarchitecture by Intel with HP's continued partnership and expertise on the underlying EPIC design concepts. In order to establish what was their first new ISA in 20 years and bring an entirely new product line to market, Intel made a massive investment in product definition, design, software development tools, OS, software industry partnerships, and marketing. To support this effort Intel created the largest design team in their history and a new marketing and industry enabling team completely separate from x86. The first Itanium processor, codenamed Merced, was released in 2001.
The Itanium architecture is based on explicit instruction-level parallelism, in which the compiler decides which instructions to execute in parallel. This contrasts with superscalar architectures, which depend on the processor to manage instruction dependencies at runtime. In all Itanium models, up to and including Tukwila , cores execute up to six instructions per clock cycle.
In 2008, Itanium was the fourth-most deployed microprocessor architecture for enterprise-class systems, behind x86-64, Power ISA, and SPARC.
In 1989, HP began to become concerned that reduced instruction set computing (RISC) architectures were approaching a processing limit at one instruction per cycle. Both Intel and HP researchers had been exploring computer architecture options for future designs and separately began investigating a new concept known as very long instruction word (VLIW)which came out of research by Yale University in the early 1980s. VLIW is a computer architecture concept (like RISC and CISC) where a single instruction word contains multiple instructions encoded in one very long instruction word to facilitate the processor executing multiple instructions in each clock cycle. Typical VLIW implementations rely heavily on sophisticated compilers to determine at compile time which instructions can be executed at the same time and the proper scheduling of these instructions for execution and also to help predict the direction of branch operations. The value of this approach is to do more useful work in fewer clock cycles and to simplify processor instruction scheduling and branch prediction hardware requirements, with a penalty in increased processor complexity, cost, and energy consumption in exchange for faster execution.
During this time, HP had begun to believe that it was no longer cost-effective for individual enterprise systems companies such as itself to develop proprietary microprocessors. Intel had also been researching several architectural options for going beyond the x86 ISA to address high end enterprise server and high performance computing (HPC) requirements. Thus Intel and HP partnered in 1994 to develop the IA-64 ISA, using a variation of VLIW design concepts which Intel named explicitly parallel instruction computing (EPIC). Intel's goal was to leverage the expertise HP had developed in their early VLIW work along with their own to develop a volume product line targeted at high-end enterprise class servers and high performance computing (HPC) systems that could be sold to all original equipment manufacturers (OEMs) while HP wished to be able to purchase off-the-shelf processors built using Intel's volume manufacturing and leading edge process technology that were higher performance and more cost effective than their current PA-RISC processors. Because the resulting products would be Intel's (HP would be one of many customers) and in order to achieve volumes necessary for a successful product line, the Itanium products would be required to meet the needs of the broader customer base and that software applications, OS, and development tools be available for these customers. This required that Itanium products be designed, documented, and manufactured, and have quality and support consistent with the rest of Intel's products. Therefore, Intel took the lead on microarchitecture design, productization (packaging, test, and all other steps), industry software and operating system enabling (Linux and Windows NT), and marketing. As part of Intel's definition and marketing process they engaged a wide variety of enterprise OEM's, software, and OS vendors, as well as end customers in order to understand their requirements and ensure they were reflected in the product family so as to meet the needs of a broad range of customers and end-users. HP made a substantial contribution to the ISA definition, the Merced/Itanium microarchitecture, and Itanium 2, but productization responsibility was Intel's. The original goal for delivering the first Itanium family product (codenamed Merced) was 1998.
Intel's product marketing and industry engagement efforts were substantial and achieved design wins with the majority of enterprise server OEM's including those based on RISC processors at the time, industry analysts predicted that IA-64 would dominate in servers, workstations, and high-end desktops, and eventually supplant RISC and complex instruction set computing (CISC) architectures for all general-purpose applications.Compaq and Silicon Graphics decided to abandon further development of the Alpha and MIPS architectures respectively in favor of migrating to IA-64.
By 1997, it was apparent that the IA-64 architecture and the compiler were much more difficult to implement than originally thought, and the delivery of Itanium began slipping.Since Itanium was the first ever EPIC processor, the development effort encountered more unanticipated problems than the team was accustomed to. In addition, the EPIC concept depends on compiler capabilities that had never been implemented before, so more research was needed.
Several groups developed operating systems for the architecture, including Microsoft Windows and Unix and Unix-like systems such as Linux, HP-UX, FreeBSD, Solaris,Tru64 UNIX, and Monterey/64 (the last three were canceled before reaching the market). In 1999, Intel led the formation of an open source industry consortium to port Linux to IA-64 they named "Trillium" (and later renamed "Trillian" due to a trademark issue) which was led by Intel and included Caldera Systems, CERN, Cygnus Solutions, Hewlett-Packard, IBM, Red Hat, SGI, SuSE, TurboLinux and VA Linux Systems. As a result, a working IA-64 Linux was delivered ahead of schedule and was the first OS to run on the new Itanium processors.
Intel announced the official name of the processor, Itanium, on October 4, 1999.Within hours, the name Itanic had been coined on a Usenet newsgroup as a pun on the name Titanic, the "unsinkable" ocean liner that sank on its maiden voyage in 1912.
|Max. CPU clock rate||733 MHz to 800 MHz|
|FSB speeds||266 MT/s|
|L2 cache||96 KB|
|L3 cache||2 or 4 MB|
|Architecture and classification|
|Products, models, variants|
By the time Itanium was released in June 2001, its performance was not superior to competing RISC and CISC processors.
Recognizing that the lack of software could be a serious problem for the future, Intel made thousands of these early systems available to independent software vendors (ISVs) to stimulate development. HP and Intel brought the next-generation Itanium 2 processor to market a year later.
Itanium 2 processor
|Max. CPU clock rate||733 MHz to 2.66 GHz|
|L2 cache||256 KB on Itanium2|
256 KB (D) + 1 MB(I) or 512 KB (I) on (Itanium2 9x00 series)
|L3 cache||1.5–32 MB|
|Architecture and classification|
|Products, models, variants|
The Itanium 2 processor was released in 2002. It relieved many of the performance problems of the original Itanium processor, which were mostly caused by an inefficient memory subsystem.
In 2003, AMD released the Opteron, which implemented its own 64-bit architecture (x86-64). Opteron gained rapid acceptance in the enterprise server space because it provided an easy upgrade from x86. Intel responded by implementing x86-64 (as Em64t) in its Xeon microprocessors in 2004.
In November 2005, the major Itanium server manufacturers joined with Intel and a number of software vendors to form the Itanium Solutions Alliance to promote the architecture and accelerate software porting.
In 2006, Intel delivered Montecito (marketed as the Itanium 2 9000 series), a dual-core processor that roughly doubled performance and decreased energy consumption by about 20 percent.
The Itanium 9300 series processor, codenamed Tukwila, was released on 8 February 2010 with greater performance and memory capacity.Tukwila had originally been slated for release in 2007.
The device uses a 65 nm process, includes two to four cores, up to 24 MB on-die caches, Hyper-Threading technology and integrated memory controllers. It implements double-device data correction (DDDC), which helps to fix memory errors. Tukwila also implements Intel QuickPath Interconnect (QPI) to replace the Itanium bus-based architecture. It has a peak interprocessor bandwidth of 96 GB/s and a peak memory bandwidth of 34 GB/s. With QuickPath, the processor has integrated memory controllers and interfaces the memory directly, using QPI interfaces to directly connect to other processors and I/O hubs. QuickPath is also used on Intel processors using the Nehalem microarchitecture, making it probable that Tukwila and Nehalem will be able to use the same chipsets. Tukwila incorporates four memory controllers, each of which supports multiple DDR3 DIMMs via a separate memory controller, much like the Nehalem-based Xeon processor code-named Beckton .
This section needs to be updated.April 2017)(
The Itanium 9500 series processor, codenamed Poulson, is the follow-on processor to Tukwila features eight cores, has a 12-wide issue architecture, multithreading enhancements, and new instructions to take advantage of parallelism, especially in virtualization. MB. L2 cache size is 6 MB, 512 I KB, 256 D KB per core. Die size is 544 mm², less than its predecessor Tukwila (698.75 mm²).The Poulson L3 cache size is 32
At ISSCC 2011, Intel presented a paper called, "A 32nm 3.1 Billion Transistor 12-Wide-Issue Itanium Processor for Mission Critical Servers."Given Intel's history of disclosing details about Itanium microprocessors at ISSCC, this paper most likely refers to Poulson. Analyst David Kanter speculates that Poulson will use a new microarchitecture, with a more advanced form of multi-threading that uses as many as two threads, to improve performance for single threaded and multi-threaded workloads. Some new information was released at Hotchips conference. New information presents improvements in multithreading, resiliency improvements (Instruction Replay RAS) and few new instructions (thread priority, integer instruction, cache prefetching, data access hints).
The Kittson is the same as the 9500 Poulson, but slightly higher clocked.
In January 2019, Intel announced that Kittson would be discontinued, with a last order date of January 2020, and a last ship date of July 2021.
There is no planned successor.
Intel has extensively documented the Itanium instruction setand the technical press has provided overviews. The architecture has been renamed several times during its history. HP originally called it PA-WideWord. Intel later called it IA-64, then Itanium Processor Architecture (IPA), before settling on Intel Itanium Architecture, but it is still widely referred to as IA-64.
It is a 64-bit register-rich explicitly parallel architecture. The base data word is 64 bits, byte-addressable. The logical address space is 264 bytes. The architecture implements predication, speculation, and branch prediction. It uses variable-sized register windowing for parameter passing. The same mechanism is also used to permit parallel execution of loops. Speculation, prediction, predication, and renaming are under control of the compiler: each instruction word includes extra bits for this. This approach is the distinguishing characteristic of the architecture.
The architecture implements a large number of registers:
gr0always reads 0.
fr0always reads +0.0, and
fr1always reads +1.0.
pr0always reads 1 (true).
br0is set to the return address when a function is called with
bsppoints to the second stack, which is where the hardware will automatically spill registers when the register window wraps around.
Each 128-bit instruction word is called a bundle, and contains three slots each holding a 41-bit instruction, plus a 5-bit template indicating which type of instruction is in each slot. Those types are M-unit (memory instructions), I-unit (integer ALU, non-ALU integer, or long immediate extended instructions), F-unit (floating-point instructions), or B-unit (branch or long branch extended instructions). The template also encodes stops which indicate that a data dependency exists between data before and after the stop. All instructions between a pair of stops constitute an instruction group, regardless of their bundling, and must be free of many types of data dependencies; this knowledge allows the processor to execute instructions in parallel without having to perform its own complicated data analysis, since that analysis was already done when the instructions were written.
Within each slot, all but a few instructions are predicated, specifying a predicate register, the value of which (true or false) will determine whether the instruction is executed. Predicated instructions which should always execute are predicated on
pr0, which always reads as true.
The IA-64 assembly language and instruction format was deliberately designed to be written mainly by compilers, not by humans. Instructions must be grouped into bundles of three, ensuring that the three instructions match an allowed template. Instructions must issue stops between certain types of data dependencies, and stops can also only be used in limited places according to the allowed templates.
The fetch mechanism can read up to two bundles per clock from the L1 cache into the pipeline. When the compiler can take maximum advantage of this, the processor can execute six instructions per clock cycle. The processor has thirty functional execution units in eleven groups. Each unit can execute a particular subset of the instruction set, and each unit executes at a rate of one instruction per cycle unless execution stalls waiting for data. While not all units in a group execute identical subsets of the instruction set, common instructions can be executed in multiple units.
The execution unit groups include:
Ideally, the compiler can often group instructions into sets of six that can execute at the same time. Since the floating-point units implement a multiply–accumulate operation, a single floating point instruction can perform the work of two instructions when the application requires a multiply followed by an add: this is very common in scientific processing. When it occurs, the processor can execute four FLOPs per cycle. For example, the 800 MHz Itanium had a theoretical rating of 3.2 GFLOPS and the fastest Itanium 2, at 1.67 GHz, was rated at 6.67 GFLOPS.
In practice, the processor may often be underutilized, with not all slots filled with useful instructions due to e.g. data dependencies or limitations in the available bundle templates. The densest possible code requires 42.6 bits per instruction, compared to 32 bits per instruction on traditional RISC processors of the time, and no-ops due to wasted slots further decrease the density of code. Additional instructions for speculative loads and hints for branches and cache are difficult to generate optimally, even with modern compilers.
From 2002 to 2006, Itanium 2 processors shared a common cache hierarchy. They had 16 KB of Level 1 instruction cache and 16 KB of Level 1 data cache. The L2 cache was unified (both instruction and data) and is 256 KB. The Level 3 cache was also unified and varied in size from 1.5 MB to 24 MB. The 256 KB L2 cache contains sufficient logic to handle semaphore operations without disturbing the main arithmetic logic unit (ALU).
Main memory is accessed through a bus to an off-chip chipset. The Itanium 2 bus was initially called the McKinley bus, but is now usually referred to as the Itanium bus. The speed of the bus has increased steadily with new processor releases. The bus transfers 2×128 bits per clock cycle, so the 200 MHz McKinley bus transferred 6.4 GB/s, and the 533 MHz Montecito bus transfers 17.056 GB/s
Itanium processors released prior to 2006 had hardware support for the IA-32 architecture to permit support for legacy server applications, but performance for IA-32 code was much worse than for native code and also worse than the performance of contemporaneous x86 processors. In 2005, Intel developed the IA-32 Execution Layer (IA-32 EL), a software emulator that provides better performance. With Montecito, Intel therefore eliminated hardware support for IA-32 code.
In 2006, with the release of Montecito, Intel made a number of enhancements to the basic processor architecture including:
See Chipsets...Other markets.
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture. Intel marketed the processors for enterprise servers and high-performance computing systems. The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly developed by HP and Intel.
PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to as HP/PA for Hewlett Packard Precision Architecture.
x86 is a family of instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors.
An instruction set architecture (ISA) is an abstract model of a computer. It is also referred to as architecture or computer architecture. A realization of an ISA, such as a central processing unit (CPU), is called an implementation.
Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.
The Intel i860 was a RISC microprocessor design introduced by Intel in 1989. It was one of Intel's first attempts at an entirely new, high-end instruction set architecture since the failed Intel iAPX 432 from the 1980s. It was released with considerable fanfare, slightly obscuring the earlier Intel i960, which was successful in some niches of embedded systems, and which many considered to be a better design. The i860 never achieved commercial success and the project was terminated in the mid-1990s.
In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit CPU and ALU architectures are those that are based on registers, address buses, or data buses of that size. 64-bit microcomputers are computers in which 64-bit microprocessors are the norm. From the software perspective, 64-bit computing means the use of code with 64-bit virtual memory addresses. However, not all 64-bit instruction sets support full 64-bit virtual memory addresses; x86-64 and ARMv8, for example, support only 48 bits of virtual address, with the remaining 16 bits of the virtual address required to be all 0's or all 1's, and several 64-bit instruction sets support fewer than 64 bits of physical memory address.
The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel introduced in November 1, 1995. It introduced the P6 microarchitecture and was originally intended to replace the original Pentium in a full range of applications. While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors, respectively, the Pentium Pro contained 5.5 million transistors. Later, it was reduced to a more narrow role as a server and high-end desktop processor and was used in supercomputers like ASCI Red, the first computer to reach the teraFLOPS performance mark. The Pentium Pro was capable of both dual- and quad-processor configurations. It only came in one form factor, the relatively large rectangular Socket 8. The Pentium Pro was succeeded by the Pentium II Xeon in 1998.
In computing, binary translation is a form of binary recompilation where sequences of instructions are translated from a source instruction set to the target instruction set. In some cases such as instruction set simulation, the target instruction set may be the same as the source instruction set, providing testing and debugging features such as instruction trace, conditional breakpoints and hot spot detection.
Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.
Memory protection is a way to control memory access rights on a computer, and is a part of most modern instruction set architectures and operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug or malware within a process from affecting other processes, or the operating system itself. Protection may encompass all accesses to a specified area of memory, write accesses, or attempts to execute the contents of the area. An attempt to access unowned memory results in a hardware fault, called a segmentation fault or storage violation exception, generally causing abnormal termination of the offending process. Memory protection for computer security includes additional techniques such as address space layout randomization and executable space protection.
Explicitly parallel instruction computing (EPIC) is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures. It was the basis for Intel and HP development of the Intel Itanium architecture, and HP later asserted that "EPIC" was merely an old term for the Itanium architecture. EPIC permits microprocessors to execute software instructions in parallel by using the compiler, rather than complex on-die circuitry, to control parallel instruction execution. This was intended to allow simple performance scaling without resorting to higher clock frequencies.
The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the POWER4. The principal improvements are support for simultaneous multithreading (SMT) and an on-die memory controller. The POWER5 is a dual-core microprocessor, with each core supporting one physical thread and two logical threads, for a total of two physical threads and four logical threads.
In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.
MAJC was a Sun Microsystems multi-core, multithreaded, very long instruction word (VLIW) microprocessor design from the mid-to-late 1990s. Originally called the UltraJava processor, the MAJC processor was targeted at running Java programs, whose "late compiling" allowed Sun to make several favourable design decisions. The processor was released into two commercial graphical cards from Sun. Lessons learned regarding multi-threads on a multi-core processor provided a basis for later OpenSPARC implementations such as the UltraSPARC T1.
The history of general-purpose CPUs is a continuation of the earlier history of computing hardware.
The HP Superdome is a high-end server computer developed and produced by Hewlett Packard Enterprise. The latest version of product, "Superdome 2" was introduced in 2010. Superdome 2 scales from 2 to 32 sockets and 4 TB of memory. When introduced in 2000, the Superdome used PA-RISC processors. Since 2002, there has been another version of the machine based on Itanium 2 processors, marketed in parallel as the HP Integrity Superdome. The classic PA-RISC Superdome was subsequently rebranded to HP 9000 Superdome. The predecessor to the Superdome was the HP V-Class.
The PA-8000 (PCX-U), code-named Onyx, is a microprocessor developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC 2.0 instruction set architecture (ISA). It was a completely new design with no circuitry derived from previous PA-RISC microprocessors. The PA-8000 was introduced on 2 November 1995 when shipments began to members of the Precision RISC Organization (PRO). It was used exclusively by PRO members and was not sold on the merchant market. All follow-on PA-8x00 processors are based on the basic PA-8000 processor core.