Stream Processors, Inc.

Last updated
Stream Processors, Inc.
Type Private
Industry Semiconductors-Specialized
Founded2004
Headquarters Sunnyvale, California, United States
Key people
Bill Dally, Co-Founder and ex-Chairman
Products Digital Signal Processor
Number of employees
Approximately 100 (2007)
Website www.streamprocessors.com

Stream Processors, Inc. (SPI), was a Silicon Valley-based fabless semiconductor company specializing in the design and manufacture of high-performance digital signal processors for applications including video surveillance, multi-function printers and video conferencing. The company ceased operations in 2009.

Contents

Company history

Foundational work in stream processing was initiated in 1995 by a research team led by MIT professor Bill Dally. In 1996, he moved to Stanford University where he continued this work, receiving a multimillion-dollar grant from DARPA with additional resources from Intel and Texas Instruments to fund the development of a project called "Imagine" - the first stream processor chip and accompanying compiler tools.

The Imagine Project

The goal of the Imagine project was to develop a C programmable signal and image processor intended to provide both the performance density and efficiency of a special-purpose processor (such as a hard-wired ASIC). The project successfully demonstrated the advantages of stream processing. Details on the Imagine project and its results are posted on the Stanford Imagine project page. The work also showed that a number of applications ranging from wireless baseband processing, 3D graphics, encryption, IP forwarding to video processing could take advantage of the efficiency of stream processing. This research inspired other designs such as GPUs from ATI Technologies as well as the Cell microprocessor from Sony, Toshiba, and IBM.

The main deliverables from the Imagine program included:

SPI established

Dally, together with other team members, obtained a license from Stanford to commercialize the resulting technology. Stream Processors, Incorporated (SPI) was incorporated in California in 2004. Professor Dally remained at Stanford and the company hired industry veteran Chip Stearns to become the President and CEO in December of that year. Through June, 2006 SPI has been able to raise a total of $26M from a trio of notable venture capital firms - Austin Ventures, Norwest Venture Partners and the Woodside Fund.

The company launched its first two products concurrently with the International Solid State Circuits Conference (ISSCC) in February, 2006 [1] and has introduced two others since. [2] [3]

SPI has headquarters located in Sunnyvale, California as well as a software development group (SPI Software Technologies Pvt. Ltd) located in Bangalore, India.

In January 2009 Co-Founder Prof. Bill Dally accepted a position as Chief Scientist with NVIDIA Corporation. [4] At the same time he resigned as chairman. [5] In an interview Dally reflected on his experiences with startups: [5] " I have done several chip startups myself. It’s getting hard. The ante is very high. If you do a chip startup, you need patient investors with very deep pockets. It’s many tens of millions of dollars to get to a first product and $50 million to get to profits. That’s very difficult to do because investors want an exit some multiple over that investment. I am hoping we return to the days of frequent IPOs and get beyond the fire-sale acquisitions. That’s not what you can see right now. If it’s a programmable chip, the cost is even more."

In the summer of 2009 CEO Stearns left the company and was replaced by Mike Fister, an executive with senior level experience at Cadence Design Systems and Intel.

In September 2009 the company ceased operations. [6]

Technology

Similar to graphics and scientific computing, media and signal processing are characterized by available data-parallelism, locality and a high computation to global memory access ratio. Stream processing exploits these characteristics using data-parallel processing fed by a distributed memory hierarchy managed by the compiler. The main challenge for next generation massively parallel processors is data bandwidth, not computational resources. Unlike most conventional processors, the technology does not rely on a hardware cache - instead data movement is explicitly managed by the compiler and hardware.

The execution model is based on accelerating performance-critical functions (kernels) that process and produce data records (streams). Kernels and streams are scheduled at compile-time and moved to on-chip memory at runtime via a scoreboard. The compiler analyses data live times of streams to optimize allocation and minimize external memory bandwidth needs. Streams and kernels loads can overlap with execution to improve latency tolerance and the explicit data movement provides predictable performance. There are no CPU cache misses and the design presents a single-core model to the programmer – data-parallelism is within the kernels.

Architecture

The architecture includes a host CPU (System MIPS) for system-level tasks and a DSP Coprocessor Subsystem where the DSP MIPS runs the main threads that make kernel function calls to the Data Parallel Unit (DPU). For users that use libraries, and don't intend to develop DSP code, the architecture is a MIPS-based system-on-a-chip with an API to a “black box” coprocessor. The DPU Dispatcher receives kernel function calls to manage runtime kernel and stream loads. One kernel at a time is executed across the lanes, operating on local stream data stored in the Lane Register File of each lane. Each lane has a set of VLIW ALUs and distributed operand register files (ORF) allow for a large working data set and processing bandwidth exceeding 1 TeraByte/s. The Stream Load/Store Unit provides gather/scatter with a wide variety of access patterns. The InterLane Switch is a compiler-scheduled, full crossbar for high-speed access between lanes.

Tools

SPI's RapiDev Tools Suite leverages the predictability of stream processing to provide a fast path to optimized results using C programming. Starting with C reference code, the Fast Functional Debugger (FFD) library plugs into standard tools, such as Microsoft Visual Studio and GNU, and simulates the DPU to support re-structuring code to kernels and streams. Because kernels are statically scheduled and data movement is explicit, DPU cycle-accuracy can be obtained even at this functional high level. This is one source of the predictability of the architecture. For targeting code to the device, the Stream Processor Compiler (SPC) generates the VLIW executable and pre-processed C code that is compiled/linked via standard GCC for MIPS. SPC allocates streams in the Lane Register Files and provides dependency information for the kernel function calls. Software pipelining and loop unrolling are supported. Branch penalties are avoided by predicated selects and larger conditionals use conditional streams. Running under Eclipse, the Target Code Simulator provides comprehensive Host or Device binary code simulation with breakpoint and single-stepping capabilities with bandwidth and load statistics. A kernel view shows the VLIW pipeline for kernel optimizations, and a stream view shows kernel execution and stream loads to review global data movement for system profiling.

Products

SPI currently markets its Storm-1 family, that includes four fully software programmable DSPs of varying performance levels.

ProductGMACS*Applications
SP16HP-G220224
  • Broadcasting/transcoding
  • Wireless Infrastructure
SP16-G160160
  • Telepresence
  • Surveillance DVRs
SP8-G8080
  • Printers, Scanners and MFPs
  • Surveillance DVRs
SP8LP-G3032
  • Professional camcorder
  • IP Camera

Note: GMACS stands for Giga (billions of) Multiply-Accumulate operations per Second, a common measure of DSP performance.

Support hardware and software

Related Research Articles

<span class="mw-page-title-main">Reduced instruction set computer</span> Processor executing one instruction in minimal clock cycles

In computer engineering, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a complex instruction set computer (CISC), a RISC computer might require more instructions in order to accomplish a task because the individual instructions are written in simpler code. The goal is to offset the need to process more instructions by increasing the speed of each instruction, in particular by implementing an instruction pipeline, which may be simpler to achieve given simpler instructions.

In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.

Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

The Intel i860 is a RISC microprocessor design introduced by Intel in 1989. It is one of Intel's first attempts at an entirely new, high-end instruction set architecture since the failed Intel iAPX 432 from the beginning of the 1980s. It was the world's first million-transistor chip. It was released with considerable fanfare, slightly obscuring the earlier Intel i960, which was successful in some niches of embedded systems. The i860 never achieved commercial success and the project was terminated in the mid-1990s.

<span class="mw-page-title-main">Digital signal processor</span> Specialized microprocessor optimized for digital signal processing

A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on MOS integrated circuit chips. They are widely used in audio signal processing, telecommunications, digital image processing, radar, sonar and speech recognition systems, and in common consumer electronic devices such as mobile phones, disk drives and high-definition television (HDTV) products.

Transmeta Corporation was an American fabless semiconductor company based in Santa Clara, California. It developed low power x86 compatible microprocessors based on a VLIW core and a software layer called Code Morphing Software.

<span class="mw-page-title-main">Coprocessor</span> Type of computer processor

A coprocessor is a computer processor used to supplement the functions of the primary processor. Operations performed by the coprocessor may be floating-point arithmetic, graphics, signal processing, string processing, cryptography or I/O interfacing with peripheral devices. By offloading processor-intensive tasks from the main processor, coprocessors can accelerate system performance. Coprocessors allow a line of computers to be customized, so that customers who do not need the extra performance do not need to pay for it.

Nucleus RTOS is a real-time operating system (RTOS) produced by the Embedded Software Division of Mentor Graphics, a Siemens Business, supporting 32- and 64-bit embedded system platforms. The operating system (OS) is designed for real-time embedded systems for medical, industrial, consumer, aerospace, and Internet of things (IoT) uses. Nucleus was released first in 1993. The latest version is 3.x, and includes features such as power management, process model, 64-bit support, safety certification, and support for heterogeneous computing multi-core system on a chip (SOCs) processors.

<span class="mw-page-title-main">Blackfin</span>

The Blackfin is a family of 16-/32-bit microprocessors developed, manufactured and marketed by Analog Devices. The processors have built-in, fixed-point digital signal processor (DSP) functionality supplied by 16-bit multiply–accumulates (MACs), accompanied on-chip by a microcontroller. It was designed for a unified low-power processor architecture that can run operating systems while simultaneously handling complex numeric tasks such as real-time H.264 video encoding.

<span class="mw-page-title-main">Benchmark (computing)</span> Comparing the relative performance of computers by running the same program on all of them

In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it.

<span class="mw-page-title-main">Texas Instruments TMS320</span> About a series of Digital Signal Processor chips from Texas Instruments

Texas Instruments TMS320 is a blanket name for a series of digital signal processors (DSPs) from Texas Instruments. It was introduced on April 8, 1983 through the TMS32010 processor, which was then the fastest DSP on the market.

In computer science, stream processing is a programming paradigm which views data streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

MIPS-X is a reduced instruction set computer (RISC) microprocessor and instruction set architecture (ISA) developed as a follow-on project to the MIPS project at Stanford University by the same team that developed MIPS. The project, supported by the Defense Advanced Research Projects Agency (DARPA), began in 1984, and its final form was described in a set of papers released in 1986–87. Unlike its older cousin, MIPS-X was never commercialized as a workstation central processing unit (CPU), and has mainly been seen in embedded system designs based on chips designed by Integrated Information Technology (IIT) for use in digital video applications.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

<span class="mw-page-title-main">Parallax Propeller</span> Multi-core microcontroller

The Parallax P8X32A Propeller is a multi-core processor parallel computer architecture microcontroller chip with eight 32-bit reduced instruction set computer (RISC) central processing unit (CPU) cores. Introduced in 2006, it is designed and sold by Parallax, Inc.

<span class="mw-page-title-main">TILE64</span>

TILE64 is a VLIW ISA multicore processor manufactured by Tilera. It consists of a mesh network of 64 "tiles", where each tile houses a general purpose processor, cache, and a non-blocking router, which the tile uses to communicate with the other tiles on the processor.

TI-RTOS is an embedded tools ecosystem created and offered by Texas Instruments (TI) for use in a range of their embedded system processors. It includes a real-time operating system (RTOS) component named TI-RTOS Kernel, networking connectivity stacks, power management, file systems, instrumentation, and inter-processor communications like DSP/BIOS Link. It is free and open-source software, released under a BSD license.

<span class="mw-page-title-main">TriMedia (mediaprocessor)</span>

TriMedia is a family of very long instruction word media processors from NXP Semiconductors. TriMedia is a Harvard architecture CPU that features many DSP and SIMD operations to efficiently process audio and video data streams. For TriMedia processor optimal performance can be achieved by only programming in C/C++ as opposed to most other VLIW/DSP processors which require assembly language programming to achieve optimal performance. High-level programmability of TriMedia relies on the large uniform register file and the orthogonal instruction set, in which RISC-like operations can be scheduled independently of each other in the VLIW issue slots. Furthermore, TriMedia processors boast advanced caches supporting unaligned accesses without performance penalty, hardware and software data/instruction prefetch, allocate-on-write-miss, as well as collapsed load operations combining a traditional load with a 2-taps filter function. TriMedia development has been supported by various research studies on hardware cache coherency, multithreading and diverse accelerators to build scalable shared memory multiprocessor systems.

In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler. They may reduce cache pollution, reduce bandwidth requirement, bypass latencies, by providing better control over the working set. Most cache control instructions do not affect the semantics of a program, although some can.

References

  1. EETimes.com - Startup touts stream processing architecture for DSPs
  2. Data-parallel DSP aimed at cost-sensitive video surveillance apps | Video Imaging DesignLine
  3. EETimes.com - Stream Processors claims fastest DSP
  4. "Home".
  5. 1 2 "Stanford's Bill Dally leaps from academia to the computer graphics wars". 22 May 2009.
  6. http://sanjose.bizjournals.com/sanjose/stories/2009/11/02/daily124.html [ bare URL ]

37°22′59.48″N122°04′42.08″W / 37.3831889°N 122.0783556°W / 37.3831889; -122.0783556