Hyper-threading (officially called Hyper-Threading Technology or HT Technology and abbreviated as HTT or HT) is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors. It was introduced on Xeon server processors in February 2002 and on Pentium 4 desktop processors in November 2002. [ citation needed ]Since then, Intel has included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.
For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible. The main function of hyper-threading is to increase the number of independent instructions in the pipeline; it takes advantage of superscalar architecture, in which multiple instructions operate on separate data in parallel. With HTT, one physical core appears as two processors to the operating system, allowing concurrent scheduling of two processes per core. In addition, two or more processes can use the same resources: If resources for one process are not available, then another process can continue if its resources are available.
In addition to requiring simultaneous multithreading support in the operating system, hyper-threading can be properly utilized only with an operating system specifically optimized for it.
Hyper-Threading Technology is a form of simultaneous multithreading technology introduced by Intel, while the concept behind the technology has been patented by Sun Microsystems. Architecturally, a processor with Hyper-Threading Technology consists of two logical processors per core, each of which has its own processor architectural state. Each logical processor can be individually halted, interrupted or directed to execute a specified thread, independently from the other logical processor sharing the same physical core.
Unlike a traditional dual-processor configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface; the sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core (assuming both logical cores are associated with the same physical core). A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The degree of benefit seen when using a hyper-threaded or multi core processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.
Hyper-threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a hyper-threading processor to appear as the usual "physical" processor and an extra "logical" processor to the host operating system (HTT-unaware operating systems see two "physical" processors), allowing the operating system to schedule two threads or processes simultaneously and appropriately. When execution resources would not be used by the current task in a processor without hyper-threading, and especially when the processor is stalled, a hyper-threading equipped processor can use those execution resources to execute another scheduled task. (The processor may stall due to a cache miss, branch misprediction, or data dependency.)
This technology is transparent to operating systems and programs. The minimum that is required to take advantage of hyper-threading is symmetric multiprocessing (SMP) support in the operating system, as the logical processors appear as standard separate processors.
It is possible to optimize operating system behavior on multi-processor hyper-threading capable systems. For example, consider an SMP system with two physical processors that are both hyper-threaded (for a total of four logical processors). If the operating system's thread scheduler is unaware of hyper-threading, it will treat all four logical processors the same. If only two threads are eligible to run, it might choose to schedule those threads on the two logical processors that happen to belong to the same physical processor; that processor would become extremely busy while the other would idle, leading to poorer performance than is possible by scheduling the threads onto different physical processors. This problem can be avoided by improving the scheduler to treat logical processors differently from physical processors; in a sense, this is a limited form of the scheduler changes that are required for NUMA systems.
The first published paper describing what is now known as hyper-threading in a general purpose computer was written by Edward S. Davidson and Leonard. E. Shar in 1973.
Denelcor, Inc. introduced multi-threading with the Heterogeneous Element Processor (HEP) in 1982. The HEP pipeline could not hold multiple instructions from the same process. Only one instruction from a given process was allowed to be present in the pipeline at any point in time. Should an instruction from a given process block the pipe, instructions from other processes would continue after the pipeline drained.
US patent for the technology behind hyper-threading was granted to Kenneth Okin at Sun Microsystems in November 1994. At that time, CMOS process technology was not advanced enough to allow for a cost-effective implementation.
Intel implemented hyper-threading on an x86 architecture processor in 2002 with the Foster MP-based Xeon. It was also included on the 3.06 GHz Northwood-based Pentium 4 in the same year, and then remained as a feature in every Pentium 4 HT, Pentium 4 Extreme Edition and Pentium Extreme Edition processor since. The Intel Core & Core 2 processor lines (2006) that succeeded the Pentium 4 model line didn't utilize hyper-threading. The processors based on the Core microarchitecture did not have hyper-threading because the Core microarchitecture was a descendant of the older P6 microarchitecture. The P6 microarchitecture was used in earlier iterations of Pentium processors, namely, the Pentium Pro, Pentium II and Pentium III (plus their Celeron & Xeon derivatives at the time).
Intel released the Nehalem microarchitecture (Core i7) in November 2008, in which hyper-threading made a return. The first generation Nehalem processors contained four physical cores and effectively scaled to eight threads. Since then, both two- and six-core models have been released, scaling four and twelve threads respectively. Earlier Intel Atom cores were in-order processors, sometimes with hyper-threading ability, for low power mobile PCs and low-price desktop PCs. The Itanium 9300 launched with eight threads per processor (two threads per core) through enhanced hyper-threading technology. The next model, the Itanium 9500 (Poulson), features a 12-wide issue architecture, with eight CPU cores with support for eight more virtual cores via hyper-threading. The Intel Xeon 5500 server chips also utilize two-way hyper-threading.
According to Intel, the first hyper-threading implementation used only 5% more die area than the comparable non-hyperthreaded processor, but the performance was 15–30% better. 4. Tom's Hardware states: "In some cases a P4 running at 3.0 GHz with HT on can even beat a P4 running at 3.6 GHz with HT turned off." Intel also claims significant performance improvements with a hyper-threading-enabled Pentium 4 processor in some artificial-intelligence algorithms.Intel claims up to a 30% performance improvement compared with an otherwise identical, non-simultaneous multithreading Pentium
Overall the performance history of hyper-threading was a mixed one in the beginning. As one commentary on high-performance computing from November 2002 notes:
Hyper-Threading can improve the performance of some MPI applications, but not all. Depending on the cluster configuration and, most importantly, the nature of the application running on the cluster, performance gains can vary or even be negative. The next step is to use performance tools to understand what areas contribute to performance gains and what areas contribute to performance degradation.
As a result, performance improvements are very application-dependent; 4 tying up valuable execution resources, equalizing the processor resources between the two programs, which adds a varying amount of execution time. The Pentium 4 "Prescott" and the Xeon "Nocona" processors received a replay queue that reduces execution time needed for the replay system and completely overcomes the performance penalty.however, when running two programs that require full attention of the processor, it can actually seem like one or both of the programs slows down slightly when Hyper-Threading Technology is turned on. This is due to the replay system of the Pentium
According to a November 2009 analysis by Intel, performance impacts of hyper-threading result in increased overall latency in case the execution of threads does not result in significant overall throughput gains, which varyby the application. In other words, overall processing latency is significantly increased due to hyper-threading, with the negative effects becoming smaller as there are more simultaneous threads that can effectively use the additional hardware resource utilization provided by hyper-threading. A similar performance analysis is available for the effects of hyper-threading when used to handle tasks related to managing network traffic, such as for processing interrupt requests generated by network interface controllers (NICs). Another paper claims no performance improvements when hyper-threading is used for interrupt handling.
When the first HT processors were released, many operating systems were not optimized for hyper-threading technology (e.g. Windows 2000 and Linux older than 2.4).
In 2006, hyper-threading was criticised for energy inefficiency.For example, specialist low-power CPU design company ARM stated that simultaneous multithreading can use up to 46% more power than ordinary dual-core designs. Furthermore, they claimed that SMT increases cache thrashing by 42%, whereas dual core results in a 37% decrease.
In 2010, ARM said it might include simultaneous multithreading in its future chips;however, this was rejected in favor of their 2012 64-bit design.
In 2013, Intel dropped SMT in favor of out-of-order execution for its Silvermont processor cores, as they found this gave better performance with better power efficiency than a lower number of cores with SMT.
In 2017, it was revealed Intel's Skylake and Kaby Lake processors had a bug with their implementation of hyper-threading that could cause data loss.Microcode updates were later released to address the issue.
In 2019, with Coffee Lake, Intel began to move away from including hyper-threading in mainstream Core i7 desktop processors except for highest-end Core i9 parts or Pentium Gold CPUs.It also started recommending disabling hyper-threading as new CPU vulnerability attacks were revealed which could be mitigated by disabling HT.
In May 2005, Colin Percival demonstrated that a malicious thread on a Pentium 4 can use a timing-based side-channel attack to monitor the memory access patterns of another thread with which it shares a cache, allowing the theft of cryptographic information. This is not actually a timing attack, as the malicious thread measures the time of only its own execution. Potential solutions to this include the processor changing its cache eviction strategy or the operating system preventing the simultaneous execution, on the same physical core, of threads with different privileges. In 2018 the OpenBSD operating system has disabled hyper-threading "in order to avoid data potentially leaking from applications to other software" caused by the Foreshadow/L1TF vulnerabilities. In 2019 a set of vulnerabilities led to security experts recommending the disabling of hyper-threading on all devices.
A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor that can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows for more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.
Pentium 4 is a series of single-core CPUs for desktops, laptops and entry-level servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 2008. The production of Netburst processors was active from 2000 until May 21, 2010.
Tejas was a code name for Intel's microprocessor, which was to be a successor to the latest Pentium 4 with the Prescott core and was sometimes referred to as Pentium V. Jayhawk was a code name for its Xeon counterpart. The cancellation of the processors in May 2004 underscored Intel's historical transition of its focus on single-core processors to multi-core processors.
Xeon is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same architecture as regular desktop-grade CPUs, but have advanced features such as support for ECC memory, higher core counts, more PCI Express lanes, support for larger amounts of RAM, larger cache memory and extra provision for enterprise-grade reliability, availability and serviceability (RAS) features responsible for handling hardware exceptions through the Machine Check Architecture. They are often capable of safely continuing execution where a normal processor cannot due to these extra RAS features, depending on the type and severity of the machine-check exception (MCE). Some also support multi-socket systems with two, four, or eight sockets through use of the Quick Path Interconnect (QPI) bus.
Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.
The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of central processing units (CPUs) made by Intel. The first CPU to use this architecture was the Willamette-core Pentium 4, released on November 20, 2000 and the first of the Pentium 4 CPUs; all subsequent Pentium 4 and Pentium D variants have also been based on NetBurst. In mid-2004, Intel released the Foster core, which was also based on NetBurst, thus switching the Xeon CPUs to the new architecture as well. Pentium 4-based Celeron CPUs also use the NetBurst architecture.
The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the POWER4. The principal improvements are support for simultaneous multithreading (SMT) and an on-die memory controller. The POWER5 is a dual-core microprocessor, with each core supporting one physical thread and two logical threads, for a total of two physical threads and four logical threads.
The Pentium D brand refers to two series of desktop dual-core 64-bit x86-64 microprocessors with the NetBurst microarchitecture, which is the dual-core variant of Pentium 4 "Prescott" manufactured by Intel. Each CPU comprised two dies, each containing a single core, residing next to each other on a multi-chip module package. The brand's first processor, codenamed Smithfield, was released by Intel on May 25, 2005. Nine months later, Intel introduced its successor, codenamed Presler, but without offering significant upgrades in design, still resulting in relatively high power consumption. By 2004, the NetBurst processors reached a clock speed barrier at 3.8 GHz due to a thermal limit exemplified by the Presler's 130 watt thermal design power. The future belonged to more energy efficient and slower clocked dual-core CPUs on a single die instead of two. The final shipment date of the dual die Presler chips was August 8, 2008, which marked the end of the Pentium D brand and also the NetBurst microarchitecture.
The P6 microarchitecture is the sixth-generation Intel x86 microarchitecture, implemented by the Pentium Pro microprocessor that was introduced in November 1995. It is frequently referred to as i686. It was succeeded by the NetBurst microarchitecture in 2000, but eventually revived in the Pentium M line of microprocessors. The successor to the Pentium M variant of the P6 microarchitecture is the Core microarchitecture which in turn is also derived from the P6 microarchitecture.
The Intel Core microarchitecture is a multi-core processor microarchitecture unveiled by Intel in Q1 2006. It is based on the Yonah processor design and can be considered an iteration of the P6 microarchitecture introduced in 1995 with Pentium Pro. High power consumption and heat intensity, the resulting inability to effectively increase clock rate, and other shortcomings such as an inefficient pipeline were the primary reasons why Intel abandoned the NetBurst microarchitecture and switched to a different architectural design, delivering high efficiency through a small pipeline rather than high clock rates. The Core microarchitecture initially did not reach the clock rates of the NetBurst microarchitecture, even after moving to 45 nm lithography. However after many generations of successor microarchitectures which used Core as their basis, Intel managed to eventually surpass the clock rates of Netburst with the Devil's Canyon microarchitecture reaching a base frequency of 4 GHz and a maximum tested frequency of 4.4 GHz using 22 nm lithography.
The replay system is a subsystem within the Intel Pentium 4 processor. Its primary function is to catch operations that have been mistakenly sent for execution by the processor's scheduler. Operations caught by the replay system are then re-executed in a loop until the conditions necessary for their proper execution have been fulfilled.
Pentium is a brand used for a series of x86 architecture-compatible microprocessors produced by Intel since 1993. In their form as of November 2011, Pentium processors are considered entry-level products that Intel rates as "two stars", meaning that they are above the low-end Atom and Celeron series, but below the faster Intel Core lineup, and workstation Xeon series.
In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).
The AMD Bulldozer Family 15h is a microprocessor microarchitecture for the FX and Opteron line of processors, developed by AMD for the desktop and server markets. Bulldozer is the codename for this family of microarchitectures. It was released on October 12, 2011, as the successor to the K10 microarchitecture.
Haswell is the codename for a processor microarchitecture developed by Intel as the "fourth-generation core" successor to the Ivy Bridge. Intel officially announced CPUs based on this microarchitecture on June 4, 2013, at Computex Taipei 2013, while a working Haswell chip was demonstrated at the 2011 Intel Developer Forum. With Haswell, which uses a 22 nm process, Intel also introduced low-power processors designed for convertible or "hybrid" ultrabooks, designated by the "U" suffix.
Bloomfield is the code name for Intel high-end desktop processors sold as Core i7-9xx and single-processor servers sold as Xeon 35xx., in almost identical configurations, replacing the earlier Yorkfield processors. The Bloomfield core is closely related to the dual-processor Gainestown, which has the same CPUID value of 0106Ax and which uses the same socket. Bloomfield uses a different socket than the later Lynnfield and Clarksfield processors based on the same 45 nm Nehalem microarchitecture, even though some of these share the same Intel Core i7 brand.
Intel Core are streamlined midrange consumer, workstation and enthusiast computers central processing units (CPU) marketed by Intel Corporation. These processors displaced the existing mid- to high-end Pentium processors at the time of their introduction, moving the Pentium to the entry level. Identical or more capable versions of Core processors are also sold as Xeon processors for the server and workstation markets.
Skylake is the codename used by Intel for a processor microarchitecture that was launched in August 2015 succeeding the Broadwell microarchitecture. Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology as its predecessor, serving as a "tock" in Intel's "tick–tock" manufacturing and design model. According to Intel, the redesign brings greater CPU and GPU performance and reduced power consumption. Skylake CPUs share their microarchitecture with Kaby Lake, Coffee Lake, Cannon Lake, Whiskey Lake, and Comet Lake CPUs.
Per-cpu load can be observed using the mpstat utility, but note that on processors with hyperthreading (HT), each hyperthread is represented as a separate CPU. For interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system.