NEC SX-Aurora TSUBASA

Last updated
NEC SX-Aurora TSUBASA A300-8 server with eight vector engines on display at the NEC booth at SC'17 in Denver 8 VE Aurora TSUBASA Server transparent.jpg
NEC SX-Aurora TSUBASA A300-8 server with eight vector engines on display at the NEC booth at SC'17 in Denver

The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. [1] [2] Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). [2] Eight VE cards can be inserted into a vector host (VH) which is typically a x86-64 server running the Linux operating system. [2] The product has been announced in a press release on 25 October 2017 and NEC has started selling it in February 2018. [3] The product succeeds the SX-ACE.

Contents

Hardware

SX-Aurora TSUBASA is a successor to the NEC SX series and SUPER-UX, which are vector computer systems upon which the Earth Simulator supercomputer is based. Its hardware consists of x86 Linux hosts with vector engines (VEs) connected via PCI express (PCIe) interconnect. [4]

High memory bandwidth (0.75–1.2 TB/s), comes from eight cores and six HBM2 memory modules on a silicon interposer implemented in the form-factor of a PCIe card. [5] Operating system functionality for the VE is offloaded to the VH and handled mainly by user space daemons running the VEOS. [6]

Depending on the clock frequency (1.4 or 1.6 GHz), each VE CPU has eight cores and a peak performance of 2.15 or 2.45  TFLOPS in double precision. The processor has the world's first implementation of six HBM2 modules on a Silicon interposer with a total of 24 or 48 GB of high bandwidth memory. It is integrated in the form-factor of a standard full length, full height, double width PCIe card that is hosted by an x86_64 server, the Vector Host (VH). The server can host up to eight VEs, clusters VHs can scale to arbitrary number of nodes. [1] [7] [2]

Product releases

Version 2 Vector Engine [8]

SKU20A20B
Clock speed (in Ghz)1.61.6
Number of cores108
Core peak performance

(double precision GFLOPS)

307307
Core peak performance

(single precision GFLOPS)

614614
CPU peak performance

(double precision TFLOPS)

3.072.45
CPU peak performance

(single precision TFLOPS)

6.144.91
Memory bandwidth (TB/s)1.531.53
Memory capacity (GB)4848

Version 1 Vector Engine

The version 1.0 of the Vector Engine was produced in 16 nm FinFET process (from TSMC) and released in three SKUs (subsequent versions add an E at the end): [9]

SKU10A10B10C10AE10BE10CE
Clock speed (in Ghz)1.61.41.41.5841.4081.400
Number of cores888888
Core peak performance

(double precision GFLOPS)

307.2268.8268.8304270268
Core peak performance

(single precision GFLOPS)

537608540537
CPU peak performance

(double precision TFLOPS)

2.452.152.152.432.162.15
CPU peak performance

(single precision TFLOPS)

4.94.34.34.864.324.30
Memory bandwidth (TB/s)1.21.20.751.351.351.00
Memory capacity (GB)484824484824

Functional units

Each of the eight SX-Aurora cores has 64 logical vector registers. [10] These have 256 x 64 Bits length implemented as a mix of pipeline and 32-fold parallel SIMD units. The registers are connected to three FMA floating-point multiply and add units that can run in parallel, as well as two ALU arithmetical logical units handling fixed point operations and a divide and square root pipe. [10] Considering only the FMA units and their 32-fold SIMD parallelism, a vector core is capable of 192 double precision operations per cycle. [10] In "packed" vector operations, where two single precision values are loaded into the space of one double precision slot in the vector registers, the vector unit delivers twice as many operations per clock cycle compared to double precision.

A Scalar Processing Unit (SPU) handles non-vector instructions on each of the cores.

Memory and caches

The memory of the SX-Aurora TSUBASA processor consists of six HBM2 second generation high-bandwidth memory modules implemented in the same package as the CPU with the help of Chip-on-Wafer-on-Substrate technology. Depending on the processor model, the HBM2 modules are either 4 or 8 die 3D modules with either 4 or 8 GB capacity, each. The SX-Aurora CPUs thus have either 24 GB or 48 GB HBM2 memory. The models implemented with large HBM2 modules have 1.2 TB/s memory bandwidth. [11]

The cores of a vector engine share 16 MB of "Last-Level-Cache" (LLC), a write-back cache directly connected to the vector registers and the L2 cache of the SPU. The LLC cache line size is 128 Bytes. The priority of data retention in the LLC can to some extent be controlled in software, allowing the programmer to specify which of the variables or arrays should be retained in cache, a feature comparable to that of the Advanced Data Buffer (ADB) of the NEC SX-ACE.

Platforms

NEC is currently selling the SX-Aurora TSUBASA vector engine integrated into four platforms: [12] [9]

Within a VH node VEs can communicate with each other through PCIe. Large parallel systems built with SX-Aurora use Infiniband in a PeerDirect setup as interconnect.

NEC also used to sell the SX-Aurora TSUBASA vector engine integrated into five platforms:

All types are exclusively air cooled with the exception of the A500 series, which also utilizes watercooling.

Software

Operating system

The operating system of the vector engine (VE) is called "VEOS", and has been offloaded entirely to the host system, the vector host (VH). [14] VEOS consists of kernel modules and user space daemons that:

VEOS supports multitasking on the VE and almost all Linux system calls are supported in the VE libc. [15] Offloading operating system services to the VH shifts OS jitter away from the VE at the expense of increased latencies. [15] All VE operating system related packages are licensed under the GNU General Public License and have been published at github.com/veos-sxarr-nec .

Software development

A Software Development Kit is available from NEC for developers and customers. It contains proprietary products and must be purchased from NEC. The SDK contains:

NEC MPI is also a proprietary implementation and is conforming to the MPI-3.1 standard specification. [19]

Hybrid programs can be created that use the VE as an accelerator for certain host kernel functions by using VE offloading C-API. [20] To some extent VE offloading is comparable to OpenCL and CUDA, but provides a simpler API and allows the kernels to be developed in normal C, C++ or Fortran and use almost any syscall on the VE.[ citation needed ] Python bindings to VEO are available at github.com/SX-Aurora/py-veo .

Comparison of Mathematical Functions

NLC1 MKL CUDA
Linear AlgebraDense Matrix
Sparse Matrix
Function Transform Fourier
Real-to-Real (DCT, …)
Laplace, Wavelet, …
StatisticsRandom Number Generator✓ w/o MPI✓ w/o MPI
Multivariate, Regression, …
OtherSorting
Special Functions
Integrals, Derivatives, …
Stencil Code
Deep Learning✗ (planned)

1 NEC Numerical Library Collection is a collection of mathematical libraries that supports the development of numerical simulation programs.

Related Research Articles

The Earth Simulator (ES) is a series of supercomputers deployed at Japan Agency for Marine-Earth Science and Technology Yokohama Institute of Earth Sciences.

<span class="mw-page-title-main">Opteron</span> Server and workstation processor line by AMD

Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture. It was released on April 22, 2003, with the SledgeHammer core (K8) and was intended to compete in the server and workstation markets, particularly in the same segment as the Intel Xeon processor. Processors based on the AMD K10 microarchitecture were announced on September 10, 2007, featuring a new quad-core configuration. The last released Opteron CPUs are the Piledriver-based Opteron 4300 and 6300 series processors, codenamed "Seoul" and "Abu Dhabi" respectively.

<span class="mw-page-title-main">Xeon</span> Line of Intel server and workstation processors

Xeon is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded markets. It was introduced in June 1998. Xeon processors are based on the same architecture as regular desktop-grade CPUs, but have advanced features such as support for error correction code (ECC) memory, higher core counts, more PCI Express lanes, support for larger amounts of RAM, larger cache memory and extra provision for enterprise-grade reliability, availability and serviceability (RAS) features responsible for handling hardware exceptions through the Machine Check Architecture (MCA). They are often capable of safely continuing execution where a normal processor cannot due to these extra RAS features, depending on the type and severity of the machine-check exception (MCE). Some also support multi-socket systems with two, four, or eight sockets through use of the Ultra Path Interconnect (UPI) bus, which replaced the older QuickPath Interconnect (QPI) bus.

<span class="mw-page-title-main">Altix</span> Supercomputer family

Altix is a line of server computers and supercomputers produced by Silicon Graphics, based on Intel processors. It succeeded the MIPS/IRIX-based Origin 3000 servers.

<span class="mw-page-title-main">NEC SX</span> Series of supercomputers by NEC

NEC SX describes a series of vector supercomputers designed, manufactured, and marketed by NEC. This computer series is notable for providing the first computer to exceed 1 gigaflop, as well as the fastest supercomputer in the world between 1992–1993, and 2002–2004. The current model, as of 2018, is the SX-Aurora TSUBASA.

The ES7000 is Unisys's x86/Windows, Linux and Solaris-based server product line. The "ES7000" brand has been used since 1999, although variants and models within the family support various processor and bus architectures. The server is marketed and positioned as a scale-up platform where scale-out becomes inefficient. Typically the ES7000 is utilized as a platform for homogeneous consolidation, large databases, Business Intelligence, Decision Support Systems, ERP, virtualization, as well as large Linux application hosting.

The Cray CX1 is a deskside workstation designed by Cray Inc., based on the x86-64 processor architecture. It was launched on September 16, 2008, and was discontinued in early 2012. It comprises a single chassis blade server design that supports a maximum of eight modular single-width blades, giving up to 96 processor cores. Computational load can be run independently on each blade and/or combined using clustering techniques.

<span class="mw-page-title-main">LGA 2011</span> CPU socket created by Intel

LGA 2011, also called Socket R, is a CPU socket by Intel released on November 14, 2011. It launched along with LGA 1356 to replace its predecessor, LGA 1366 and LGA 1567. While LGA 1356 was designed for dual-processor or low-end servers, LGA 2011 was designed for high-end desktops and high-performance servers. The socket has 2011 protruding pins that touch contact points on the underside of the processor.

<span class="mw-page-title-main">POWER8</span> 2014 family of multi-core microprocessors by IBM

POWER8 is a family of superscalar multi-core microprocessors based on the Power ISA, announced in August 2013 at the Hot Chips conference. The designs are available for licensing under the OpenPOWER Foundation, which is the first time for such availability of IBM's highest-end processors.

<span class="mw-page-title-main">NEC SX-ACE</span>

The SX-ACE is a vector supercomputer based on the NEC SX series by NEC Corporation. It features NEC's first multi-core System on a Chip vector processor design, with four cores. The SX-ACE runs at 1 GHz, has peak performance of 64 GFLOPS per core, and has 64 gigabytes per second of memory bandwidth per core. Four cores make up a shared-memory node, and 64 nodes can fit in a rack for a total performance of 16 TFLOPS per rack. The SX-ACE was released in 2013. NEC released the successor, the SX-Aurora TSUBASA in 2017. It is used by Earth Simulator 3.

<span class="mw-page-title-main">PureSystems</span> Family of computer systems

PureSystems is an IBM product line of factory pre-configured components and servers also being referred to as an "Expert Integrated System". The centrepiece of PureSystems is the IBM Flex System Manager in tandem with the so-called "Patterns of Expertise" for the automated configuration and management of PureSystems.

<span class="mw-page-title-main">POWER9</span> 2017 family of multi-core microprocessors by IBM

POWER9 is a family of superscalar, multithreading, multi-core microprocessors produced by IBM, based on the Power ISA. It was announced in August 2016. The POWER9-based processors are being manufactured using a 14 nm FinFET process, in 12- and 24-core versions, for scale out and scale up applications, and possibly other variations, since the POWER9 architecture is open for licensing and modification by the OpenPOWER Foundation members.

<span class="mw-page-title-main">QPACE2</span> Massively parallel and scalable supercomputer

QPACE 2 is a massively parallel and scalable supercomputer. It was designed for applications in lattice quantum chromodynamics but is also suitable for a wider range of applications..

<span class="mw-page-title-main">Nvidia DGX</span> Line of Nvidia produced servers and workstations

The Nvidia DGX represents a series of servers and workstations designed by Nvidia, primarily geared towards enhancing deep learning applications through the use of general-purpose computing on graphics processing units (GPGPU). These systems typically come in a rackmount format featuring high-performance x86 server CPUs on the motherboard.

Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard for use in large data center computers, initially designed to be layered on top of PCI Express, for directly connecting central processing units (CPUs) to external accelerators like graphics processing units (GPUs), ASICs, FPGAs or fast storage. It offers low latency, high speed, direct memory access connectivity between devices of different instruction set architectures.

<span class="mw-page-title-main">Epyc</span> AMD brand for server microprocessors

Epyc is a brand of multi-core x86-64 microprocessors designed and sold by AMD, based on the company's Zen microarchitecture. Introduced in June 2017, they are specifically targeted for the server and embedded system markets.

<span class="mw-page-title-main">Power10</span> 2020 family of multi-core microprocessors by IBM

Power10 is a superscalar, multithreading, multi-core microprocessor family, based on the open source Power ISA, and announced in August 2020 at the Hot Chips conference; systems with Power10 CPUs. Generally available from September 2021 in the IBM Power10 Enterprise E1080 server.

Sapphire Rapids is a codename for Intel's server and workstation processors based on the Golden Cove microarchitecture and produced using Intel 7. It features up to 60 cores and an array of accelerators, and it is the first generation of Intel server and workstation processors to use a chiplet design.

Granite Rapids is the codename for 6th generation Xeon Scalable server processors designed by Intel, set to launch in 2024. Featuring up to 128 P-cores, Granite Rapids is designed for high performance computing applications. 6th generation Sierra Forrest processors with up to 288 E-cores that launched in June 2024 before Granite Rapids.

References

  1. 1 2 "NEC SX-Aurora TSUBASA - Vector Engine". www.nec.com. Retrieved 2018-03-20.
  2. 1 2 3 4 Morgan, Timothy Prickett (October 27, 2017). "Can Vector Supercomputing Be Revived?". The Next Platform.
  3. "NEC releases new high-end HPC product line, SX-Aurora TSUBASA". NEC. Retrieved 2018-03-21.
  4. Imai, Teruyuki (2019), Gerofi, Balazs; Ishikawa, Yutaka; Riesen, Rolf; Wisniewski, Robert W. (eds.), "NEC Earth Simulator and the SX-Aurora TSUBASA", Operating Systems for Supercomputers and High Performance Computing, High-Performance Computing Series, vol. 1, Singapore: Springer, pp. 139–160, doi:10.1007/978-981-13-6624-6_9, ISBN   978-981-13-6624-6, S2CID   204811906
  5. Morgan, Timothy Prickett (2017-11-22). "A Deep Dive Into NEC's Aurora Vector Engine". The Next Platform. Retrieved 2020-07-02.
  6. Focht, Erich. "First steps with the SX-Aurora TSUBASA vector engine". sx-aurora.github.io. Retrieved 2020-07-02.
  7. SX-Aurora TSUBASA Brochure
  8. "NEC Vector Engine Models". www.nec.com. Retrieved 15 September 2020.
  9. 1 2 "SX-Aurora TSUBASA" (PDF). NEC Corporation. February 2020.
  10. 1 2 3 "NEC SX-Aurora TSUBASA Architecture". www.nec.com. Retrieved 2018-03-20.
  11. "SX-Aurora - Microarchitectures - NEC - WikiChip". en.wikichip.org. Retrieved 2020-07-02.
  12. "NEC SX-Aurora TSUBASA".
  13. "NEC SX-Aurora TSUBASA A500-64". www.nec.com.
  14. "NEC SX Aurora TSUBASA — VSC documentation 1.0 documentation". vlaams-supercomputing-centrum-vscdocumentation.readthedocs-hosted.com. Retrieved 2020-07-02.
  15. 1 2 3 "A Look at NEC's Latest Vector Processor, the SX-Aurora". WikiChip Fuse. 2018-12-09. Retrieved 2020-08-27.
  16. "NEC SX Aurora TSUBASA — VSC documentation 1.0 documentation". vlaams-supercomputing-centrum-vscdocumentation.readthedocs-hosted.com. Retrieved 2020-08-27.
  17. "NEC SX-Aurora TSUBASA Documentation".
  18. "NEC SX-Aurora TSUBASA Vector System". Rechenzentrum der CAU. Retrieved 2020-08-27.
  19. "NEC MPI User's Guide".
  20. "SX-Aurora/veoffload". GitHub. Retrieved 2018-03-21.