Ne-XVP

Last updated

Ne-XVP was a research project executed between 2006-2008 at NXP Semiconductors. The project undertook a holistic approach to define a next generation multimedia processing architecture for embedded MPSoCs that targets programmability, performance scalability, and silicon efficiency in an evolutionary way. The evolutionary way implies using existing processor cores such as NXP TriMedia as building blocks and supporting industry programming standards such as POSIX threads. Based on the technology-aware design space exploration, the project concluded that hardware accelerators facilitating task management and coherency coupled with right dimensioning of compute cores deliver good programmability, scalable performance and competitive silicon efficiency.

NXP Semiconductors company

NXP Semiconductors N.V. is a Dutch global semiconductor manufacturer headquartered in Eindhoven, Netherlands. The company employs approximately 31,000 people in more than 35 countries, including 11,200 engineers in 33 countries. NXP reported revenue of $6.1 billion in 2015, including one month of revenue contribution from recently merged Freescale Semiconductor.

Multimedia is content that uses a combination of different content forms such as text, audio, images, animations, video and interactive content. Multimedia contrasts with media that use only rudimentary computer displays such as text-only or traditional forms of printed or hand-produced material.

Central processing unit electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logical, control and input/output (I/O) operations specified by the instructions

A central processing unit (CPU), also called a central processor or main processor, is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions. The computer industry has used the term "central processing unit" at least since the early 1960s. Traditionally, the term "CPU" refers to a processor, more specifically to its processing unit and control unit (CU), distinguishing these core elements of a computer from external components such as main memory and I/O circuitry.

Contents

Research

Ne-XVP architecture at the end of 2008. Two different core types core1 and core2 are used to construct a multicore processor. To increase performance density the multicore is supported by several accelerators for inter-thread synchronization and communication. For example, the Hardware Task Scheduler can schedule tasks for many complex multimedia applications, and the cache coherence coprocessors enable inter-thread communication via shared memory. Ne-XVP Architecture.png
Ne-XVP architecture at the end of 2008. Two different core types core1 and core2 are used to construct a multicore processor. To increase performance density the multicore is supported by several accelerators for inter-thread synchronization and communication. For example, the Hardware Task Scheduler can schedule tasks for many complex multimedia applications, and the cache coherence coprocessors enable inter-thread communication via shared memory.

Ne-XVP's research subjects and corresponding publications:

  1. Asymmetric multicore architecture with generic accelerators [1]
  2. Hardware multithreading in VLIWs [2]
  3. Low-complexity cache coherence [1]
  4. Hardware accelerators for task scheduling and synchronization:
    1. A Hardware Task Scheduler [3]
    2. Hardware Synchronization Unit to sync threads [1] [2]
    3. Task Management Unit [4]
  5. Instruction cache sharing [1]
  6. Design Space Exploration with Performance Density as the optimization function [1]
  7. Technology modeling for embedded processors [1] [5] [6]
  8. Parallelization of complex multimedia algorithms (H.264, Frame Rate Conversion) [7] [8] [9] [10]
  9. Auto-parallelizing compilers
  10. Time-aware programming languages in cooperation with the ACOTES project [11]
  11. Visual programming
  12. Task-level speculation
  13. Porting GCC to Exposed Pipeline VLIW Processors [12]
  14. Multiprogram workload for embedded processing
  15. A 1-GHz embedded VLIW processor

Project members

Ne-XVP team at the end of 2008. (left-to-right, top-to-bottom) Surendra Guntur, Jan Hoogerbrugge, Ghiath Al-Kadi, Marc Duranton, Andrei Terechko, Anirban Lahiri. Few Ne XVP team members.jpg
Ne-XVP team at the end of 2008. (left-to-right, top-to-bottom) Surendra Guntur, Jan Hoogerbrugge, Ghiath Al-Kadi, Marc Duranton, Andrei Terechko, Anirban Lahiri.

Related Research Articles

Superscalar processor CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor that can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows for more throughput than would otherwise be not possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.

Digital signal processor specialized microprocessor

A digital signal processor (DSP) is a specialized microprocessor, with its architecture optimized for the operational needs of digital signal processing.

Explicitly parallel instruction computing (EPIC) is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures. It was the basis for Intel and HP development of the Intel Itanium architecture, and HP later asserted that "EPIC" was merely an old term for the Itanium architecture. EPIC permits microprocessors to execute software instructions in parallel by using the compiler, rather than complex on-die circuitry, to control parallel instruction execution. This was intended to allow simple performance scaling without resorting to higher clock frequencies.

Multiflow Computer, Inc., founded in April, 1984 near New Haven, Connecticut, USA, was a manufacturer and seller of minisupercomputer hardware and software embodying the VLIW design style. Multiflow, incorporated in Delaware, ended operations in March, 1990, after selling about 125 VLIW minisupercomputers in the United States, Europe, and Japan.

In computing, hardware acceleration is the use of computer hardware specially made to perform some functions more efficiently than is possible in software running on a general-purpose CPU. Any transformation of data or routine that can be computed, can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in application-specific hardware designed or programmed to compute the operation than specified in software and performed on a general-purpose computer processor. Each approach has advantages and disadvantages. The implementation of computing tasks in hardware to decrease latency and increase throughput is known as hardware acceleration.

In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction-set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.

The Fujitsu FR-V is one of the very few processors ever able to process both a very long instruction word (VLIW) and vector processor instructions at the same time, increasing throughput with high parallel computing while increasing performance per watt and hardware efficiency. The family was presented in 1999. Its design was influenced by the VPP500/5000 models of the Fujitsu VP/2000 vector processor supercomputer line.

Multi-core processor computing component

A multi-core processor is a single computing component with two or more independent processing units called cores, which read and execute program instructions. The instructions are ordinary CPU instructions but the single processor can run multiple instructions on separate cores at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor ("CPU"), scratchpad refers to a special high-speed memory circuit used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc.

No instruction set computing (NISC) is a computing architecture and compiler technology for designing highly efficient custom processors and hardware accelerators by allowing a compiler to have low-level control of hardware resources.

Intel Tera-Scale is a research program by Intel that focuses on development in Intel processors and platforms that utilize the inherent parallelism of emerging visual-computing applications. Such applications require teraFLOPS of parallel computing performance to process terabytes of data quickly. Parallelism is the concept of performing multiple tasks simultaneously. Utilizing parallelism will not only increase the efficiency of computer processing units (CPUs), but also increase the bytes of data analyzed each second. In order to appropriately apply parallelism, the CPU must be able to handle multiple threads and to do so the CPU must consist of multiple cores. The conventional amount of cores in consumer grade computers are 2–8 cores while workstation grade computers can have even greater amounts. However, even the current amount of cores aren't great enough to perform at teraFLOPS performance leading to an even greater amount of cores that must be added. As a result of the program, two prototypes have been manufactured that were used to test the feasibility of having many more cores than the conventional amount and proved to be successful.

Stream Processors, Inc was a Silicon Valley-based fabless semiconductor company specializing in the design and manufacture of high-performance digital signal processors for applications including video surveillance, multi-function printers and video conferencing. The company ceased operations in 2009.

TriMedia (mediaprocessor)

TriMedia is a family of very long instruction word media processors from NXP Semiconductors. TriMedia is a Harvard architecture CPU that features many DSP and SIMD operations to efficiently process audio and video data streams. For TriMedia processor optimal performance can be achieved by only programming in C/C++ as opposed to most other VLIW/DSP processors which require assembly language programming to achieve optimal performance. High-level programmability of TriMedia relies on the large uniform register file and the orthogonal instruction set, in which RISC-like operations can be scheduled independently of each other in the VLIW issue slots. Furthermore, TriMedia processors boast advanced caches supporting unaligned accesses without performance penalty, hardware and software data/instruction prefetch, allocate-on-write-miss, as well as collapsed load operations combining a traditional load with a 2-taps filter function. TriMedia development has been supported by various research studies on hardware cache coherency, multithreading and diverse accelerators to build scalable shared memory multiprocessor systems.

QorIQ

QorIQ is a brand of ARM-based and Power ISA-based communications microprocessors from NXP Semiconductors. It is the evolutionary step from the PowerQUICC platform and initial products were built around one or more e500mc cores and came in five different product platforms, P1, P2, P3, P4 and P5, segmented by performance and functionality. The platform keeps software compatibility with older PowerPC products such as the PowerQUICC platform. In 2012 Freescale announced ARM based QorIQ offerings beginning in 2013.

The Multicore Association was founded in 2005. Multicore Association is a member-funded, non-profit, industry consortium focused on the creation of open standard APIs, specifications, and guidelines that allow system developers and programmers to more readily adopt multicore technology into their applications.

Manycore processors are specialist multi-core processors designed for a high degree of parallel processing, containing a large number of simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing. As of November 2018, the world's third fastest supercomputer, the Chinese Sunway TaihuLight, obtains its performance from 40,960 SW26010 manycore processors, each containing 256 cores.

The Collective Tuning Initiative is a community-driven initiative started by Grigori Fursin to develop free collaborative open-source research tools with unified API for code and architecture characterization, optimization and co-design. This enables the sharing of benchmarks, data sets and optimization cases from the community in the open optimization repository through unified web services to predict better optimizations or architecture designs. Using common research-and-development tools should help to improve the quality and reproducibility of research into code, architecture design and optimization, encouraging innovation in this area. This approach helped establish Artifact Evaluation at several ACM-sponsored conferences to encourage sharing of artifacts and validation of experimental results from accepted papers.

For several years parallel hardware was only available for distributed computing but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing multi-threaded or parallel processing applications. Parallel programming requires handling various issues such as synchronization and deadlock avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.

References

  1. 1 2 3 4 5 6 A. Terechko, J. Hoogerbrugge, G. Alkadi; S. Guntur; A. Lahiri; M. Duranton; C. Wust; P. Christie; A. Nackaerts; A. Kumar, "Balancing programmability and silicon efficiency of heterogeneous multicore architectures", ACM Transactions on Embedded Computing Systems, Special Issue on Real-time Multimedia, 2010.
  2. 1 2 J. Hoogerbrugge, A. Terechko, "A multithreaded multicore system for embedded media processing", Transactions on High-Performance Embedded Architectures and Compilers, Volume 4, Issue 2, 2008.
  3. G. Al-Kadi, A.S. Terechko, "A Hardware Task Scheduler for Embedded Video Processing", in Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers, Paphos, Cyprus, January 25–28, 2009.
  4. M. Sjalander, A. Terechko, M. Duranton; A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures; Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools; Pages 149-157; 2008; ISBN   978-0-7695-3277-6; IEEE Computer Society Washington, DC, USA.
  5. A. Terechko, J. Hoogerbrugge; G. Al-Kadi; A. Lahiri; S. Guntur; M. Duranton; P. Christie; A. Nackaerts; A. Kumar, “Performance Density Exploration of Heterogeneous Multicore Architectures”, invited presentation at Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO’09), January 25, 2009, in conjunction with the 4th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Paphos, Cyprus, January 25–28, 2009.
  6. P. Christie, A. Nackaerts, A. Kumar, A. S. Terechko, G. Doornbos, “Rapid Design Flows for Advanced Technology Pathfinding”, invited paper, International Electron Devices Meeting, San Francisco, 2008.
  7. G. Al-Kadi, J. Hoogerbrugge, S. Guntur, A. Terechko, M. Duranton, “Meandering based parallel 3DRS algorithm for the multicore era”, in IEEE International Conference on Consumer Electronics, Las Vegas, USA, January 11–13, 2010.
  8. A. Azevedo, B. Juurlink, C. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264”, in Transactions on High-Performance Embedded Architectures and Compilers, Volume 4, Issue 2, pp. 404-418, 2009.
  9. A. Azevedo, C. Meenderinck, B. Juurlink, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, "Parallel H.264 Decoding on an Embedded Multicore Processor", in Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers, Paphos, Cyprus, January 2009.
  10. M. Alvarez, A. Azevedo, C. Meenderinck, B. Juurlink, A. Terechko, J. Hoogerbrugge, A. Ramirez, "Analyzing Scalability Limits of H.264 Decoding Due to TLP Overhead", in Proceedings of 6th HiPEAC Industrial Workshop, November 2008.
  11. ACOTES: http://www.hitech-projects.com/euprojects/ACOTES/
  12. A. Turjan, D. Cheresiz, "Porting GCC to an exposed pipeline vector VLIW processor", GCC Developer's summit, Montreal, Québec, Canada, June 8–10, 2009.